Migrating Redshift Data to and from S3

Hands-On Lab

 

Photo of Fernando Medina Corey

Fernando Medina Corey

Training Architect

Length

01:00:00

Difficulty

Intermediate

In this hands-on lab, we will migrate data between S3 and Redshift, and then from Redshift back to S3. Specifically, we will upload a data set into S3 and then load it into a Redshift cluster with a COPY command. Then we will use the UNLOAD command to export data from Redshift into S3.

What are Hands-On Labs?

Hands-On Labs are scenario-based learning environments where learners can practice without consequences. Don't compromise a system or waste money on expensive downloads. Practice real-world skills without the real-world risk, no assembly required.

Migrating Redshift Data to and from S3

Introduction

In this hands-on lab, we will migrate data between S3 and Redshift, and then from Redshift back to S3. Specifically, we will upload a data set into S3 and then load it into a Redshift cluster with a COPY command. Then we will use the UNLOAD command to export data from Redshift into S3.

Logging In

There are links and credentials provided on the hands-on lab overview page. Use these to log into the AWS Management Console and the Jupyter notebook that we'll be working with.

Get Software Installed

Run the pip install command in the notebook, either by pressing the "run" button right next to it, or by highlighting it and pressing Run up below the top menu.

Set up

Once things are installed, let's head over to the the AWS console. In a couple of new tabs, open up Storage > S3, and Database > Amazon Redshift.

In the S3 tab, there should be a bucket sitting there already. Click on its name, then on the next screen copy that name. It will be cfst- something.

Back over in the Jupyter notebook, in the second code cell, we want to remove the comment (#) from the DATA_BUCKET line. Then replace that existing cfst with the one we just copied from the S3 dashboard.

We'll do the same kind of thing for the HOST line. In the Redshift dashboard (in the browser tab we opened up a little bit ago), click on Clusters, then click on the cluster that is sitting in there. Now copy the name of that one and paste it into the Jupyter notebook. Delete :5439 from the end though. That's a port number, and we've got it already, a little further down in the code cell.

Now we can run this code cell, which will set these variables, and move on.

Connection

The next code cell uses all of the variables from the last one to connect to the database server. When we click on the button to run this cell, the output we see at the bottom is going to be the results from the $query that's up near the top. We'll get the names of the databases on the server. dev is the one we'll be interacting with.

Create a Table

If we click the button to run the next code cell, we will create the movies table that's going to hold our data. We can run the next cell, which is a query that should confirm that movies was created.

Get data.csv into S3

There is another code cell to run. This one will zip up our csv file, then send it over to our S3 bucket. If we browse the bucket after we run the code, we'll see data.csv.gz in there.

Import the Data into Redshift

Our data is in S3, but can we access it? We need to edit the next code cell, uncommenting the IAM_ROLE line first, then putting the right value between the quotes. Over in the Redshift dashboard, still in Clusters, click on See IAM roles. Copy the roleID from the window that pops up (it will start with arn:aws:iam) and paste it over in our Jupyter notebook. Once that's done, run the cell.

Run a Test Query

We should have data in Redshift now. If we look at the next code cell, we'll see a query. Run that, and we should see some results underneath. This means that the data survived the trip into Redshift.

Unload Data from Redshift to S3

Use the UNLOAD command in the next code cell to move some data from Redshift into S3. Yes, we will get some errors, but the rollback command in the final code cell can help alleviate that. We can look in S3 though and see (alongside the data.csv.gz sitting there) that we have an output directory, which is where Redshift exported out data.

Conclusion

In this hands-on lab, we were able to get data moved from an S# bucket into Redshift, and then back again. Congratulations!