Introduction to Jupyter Notebooks (AWS SageMaker)

Hands-On Lab

 

Length

01:00:00

Difficulty

Beginner

Jupyter Notebooks are the standard tool for interacting with and manipulating data. Data scientists and engineers at many companies can experiment with them, using their data sets to assist in product development. In this activity, we will cover the basic structure of a notebook, how to execute code, and how to make changes. We'll also create a simple machine learning model and use it to make inferences. This lab uses AWS SageMaker Notebooks, and provides you with the foundational knowledge required to use this service for more advanced topics. The files used in this lab, can be found here on GitHub

What are Hands-On Labs?

Hands-On Labs are scenario-based learning environments where learners can practice without consequences. Don't compromise a system or waste money on expensive downloads. Practice real-world skills without the real-world risk, no assembly required.

Introduction to Jupyter Notebooks (AWS SageMaker)

Introduction

This hands-on lab isn't quite the same as most of them at Linux Academy. We don't have a set task, like a scenario, that needs to be accomplished. Instead, it's more of a follow along with the videos, and play around in the lab environment afterward to get familiar with the concepts we've covered. Let's get started...

Jupyter Notebooks are the standard tool for data scientists to interact with and manipulate data. In this activity, we will cover the basic structure of a notebook, how to execute code, and how to make changes. We'll also create a simple machine learning model and use it to make inferences. This lab uses AWS SageMaker Notebooks, and provides you with the foundational knowledge required to use this service for more advanced topics.

The model we work with here is simple. We've got a bunch of sample heights and weights of individual penguins. We're going to use the data (stored in a csv file) to train a data model and create inferences so that we can make guesses about what a penguin weighs if we know its height.

The files used in this lab, can be found here on GitHub

Logging In

Log into AWS with the credentials provided in the hands-on lab page. In the Find Services box, look for SageMaker, and it should be the first item that pops up in the list. Once we're in the SageMaker dashboard, find Notebook instances in the left-hand menu. There should be one sitting there. If there isn't one, double check to make sure that we're in the N. Virginia region (using the dropdown in the upper right of the dashboard). When it shows up in the list of notebooks, click it and we'll land in the notebook server. To open up the notebook we'll be working with, click Open Jupyter over toward the right.

Browsing Jupyter Notebooks

When we first get into a notebook, it looks rather like a file explorer window. We can click on items we see, and they'll open up in new tabs. So if we click on an image, for example, that image will open up in a new browser tab.

Over on the right-hand side of the screen, there are a couple of buttons: New and Upload.

Notebook Files

The actual Jupyter notebook file will have an .ipynb file extension. When we click on one in the notebook, it will open up in a new tab. It looks a lot like a word processor.

Structure

Notebook files are made up of cells. There are three different types of cells: Code, Markdown, or Raw NBConvert. If the cell is Markdown, then typing Markdown code into it (see this site for a quick read about Markdown syntax) then click on the Run button, then it will "run" that code. In this case, it will then show the HTML rendering of what we typed with Markdown.

If it's a Code type of cell, then we can actually type some executable code in there. When we type something in there like !whoami, and hit the Run button, the results of that command will show up under the code part of the cell.

Raw NBConvert cells are used for converting the output to something else, like HTML or PDF (via LaTeX).

Getting Our Hands On Things

Ok, we've seen what a notebook looks like, and delved into an actual notebook file to see how it's built. Now let's actually play around a little.

Creating New Things

Out in the notebook's directory (where we can see all of the files involved with this notebook), let's make a new directory. Click the New button in the upper right, and scroll down (almost at the bottom) to Folder. We'll see an Untitled Folder. We'll have a new folder named Untitled Folder.

If we want to rename it, check the box next to it, and then click the Rename button that appears up above the list of files and folders. Let's call this one my_new_folder. If we click on it, we'll end up inside of it. To get back out, either click the double-dot link (the .. shorthand for "up one directory") or click somewhere in the blue path up above (currently a picture of a folder, followed by / my_new_folder).

The Notebook File

Back out in the main directory, click on our notebook (the ipynb file) which will then open up in a new tab. Take a peek back at the other tab we were in, notice that the notebook icon is now green. That means it's running now. Get back into the running notebook and let's do some editing.

This notebook has already been built, and includes several blocks of different types.

Editing a Cell

We can click on any of the cells here, and edit the text in it. If it's a Markdown cell, we'd type Markdown. In a code cell, we're going to be typing Python.

We can edit a Markdown cell, then hit the Run button (up near the menu) to see our text get rendered.

Adding a Cell

With any cell highlighted, we can click the Insert menu, and choose whether we want our cell to be before or after the one we highlighted. By default, we get a new code cell.

Markdown Cells

Let's make this new one Markdown by (again, near the top of the screen) clicking on the dropdown that currently says Code and choosing Markdown. Now, we've got a Markdown cell. Let's try putting an image in there. In HTML, the image tag would look like <img src="blahblah.jpg" alt="alt text" />, but in Markdown, it's going to be ![alt text](blahblah.jpg"). When we hit the Run button, we're going to see the image, with an alt tag of alt text.

Code Cells

Now let's take a look at the cell down in the 2) Command-Line Operations section of the notebook. It looks like this:

!whoami
!which python

The exclamation points at the beginning of each line is telling the code cell to run the commands on a command line. This is just like running these commands (whoami and which python) while sitting in a Unix/Linux terminal. And if we think back to the little theory lesson on cells, this runs just like a Markdown cell: by hitting the Run button.

Below the code cell, we'll see the results of running those commands displayed:

ec2-user
/home/ec2-user/anaconda3/envs/python3/bin/pyton

Now let's make a new code cell. Let's get our exact Python version. Put this in the cell:

!python --version

Now, if we run that, we'll get command line output: Python 3.6.5 :: Anaconda, Inc

Python Code Cells

This notebook has been created with a Python3 kernel, so we can run actual Python code in it, without the preceding !python tags.

So a cell like this:

words = ['awesome', 'amazing', 'great']
for w in words:
    print('This Linux Academy lab is %s!' % w)

Now if we hit the Run button, we'll get:

This Linux Academy lab is awesome!
This Linux Academy lab is amazing!
This Linux Academy lab is great!

Digging into Data

One of the primary reasons to use Jupyter notebooks is data manipulation. Let's get into a few different ways to do that here.

Python Lists

Let's look at a list:

myList = [0, 1, 2, 3, 4, 5]
myList

We're setting a variable called myList. If we put that in a code cell, and hit run, the then output will be [0, 1, 2, 3, 4, 5].

Now if we do something crazy like adding a color (a string) to the end of our list of integers, no problem. Python will let us do that. This:

myList.append('blue')
myList

will produce [0, 1, 2, 3, 4, 5, 'blue'].

Now if we run myList[3], Python will return the third data point in the list (3 in this case). Running myList[3:] will return everything from the third point onwards, [3, 4, 5, 'blue']. One last little command we'll play with is len(myList). Running this will output the length of our list (should be 7 with this one.)

NumPy

This is a package design to perform scientific computing with Python. Once we import it (and we kind of renamed how we were going to call it -- we called numpy as np, so instead of having to run numpy, we'll just have to run np) we can use it for things like printing out the value of pi.

import numpy as np
np.pi

If we run the cell, the np.pi will print out the value of pi.

begin{equation}
c & = 2pi r \
end{equation}

Now, we can use it for performing some actual calculations too. We can find the circumference of a circle (using the 2pir equations we all remember from high school), once we've provided a radius of 10.

radius = 10
circumference = 2 * np.pi * radius
circumference

Once we hit Run on this, we'll get our answer (62.831...)

NumPy Arrays

Let's get into manipulation of data a bit now, using NumPy arrays. These are like lists, but on steroids. We can pass in what's essentially a list of lists, and then do things with that data.

data = np.array([['','Col1','Col2'],
                ['Row1',1,2],
                ['Row2',3,4],
                ['Row3',5,6]])

print(data)

If we run this cell, we'll get back pretty much what we put in. data now holds this array. When we take it a step further, with something like print(data[1:,1:]), we are picking and choosing what we're getting back. This will just give the data below the first row (the column headers, essentially) and to the right of the first column:

[['1' '2']
 ['3' '4']
 ['5' '6']]

pandas

Like we did with NumPy, we're going to import pandas as pd:

import pandas as pd

Once we run that and gotten pandas imported, we can run the next code cell:

df = pd.DataFrame(data=data[1:,1:],
                  index=data[1:,0],
                  columns=data[0,1:])

df

What we've done is create a pandas DataFrame. Remember that arrays are like lists on steroids? This is taking it a step further. It's yet another way to store data. This particular DataFrame takes the all data from our NumPy array, then gives its row titles the same names that the array had, and also names the rows with the same names as those corresponding rows in the array. We get something nice looking, about like this:

Col1 Col2
Row1 1 2
Row2 3 4
Row3 5 6

Loading Data from an External Source

Back out in our Home tab, there is a list of files. One is the notebook file, remember? Another is a csv file. If we click on this, we'll see a set of data points, two per line, separated by commas. This is our list of penguin heights and weights. We are actually going to put this data to use now.

Back in the notebook, we have a code cell:

penguin_data = pd.read_csv("penguin-data.csv")

penguin_data.shape

It's followed by another:

penguin_data.head()

So what's happening is we'll hit the Run button on the first cell. This will import the csv file, and it will also (because of the penguin_data.shape) give us a bird's eye view of the file. It's got 20 rows and two columns (20,2 is what we see as output).

Now, when we run the second cell, we'll get the first five rows of our data, in a nicely formatted HTML table. > By default, with no arguments, it seems like we get five rows. Ask Mike

We can put numbers between the parentheses (penguin_data.head(2)) to pull different numbers of rows of data. > Can we do penguin_data.head(2,10) to pull rows 2 through 10?

Matplotlib

This is another Python library that works together with pandas and pandas DataFrames.

%matplotlib inline

penguin_data.plot(kind='scatter',x='Height',y='Weight',color='red')

First, we've imported the library, like we did with the others (and in this case we imported it as plt). Then we're calling plot on the DataFrame we were looking at earlier. When we click Run, now we'll get a plotted graph instead of a table.

scikit-learn

But wait! There's more! There's a collection of scikit-learn libraries that we can also import. We can do some other things with those. In our case, we're going to create a model that creates a "line of best fit," for our data points in the graph.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Set up the Linear Regression model
model = LinearRegression()

# Train the model with our data
model.fit(penguin_data[['Height']], penguin_data['Weight'])

Here we call in the appropriate libraries (the first two lines) and then we can see how to put this to use. We're going to set up a Linear Regression model, then we're going to run our data through it, making our data fit the model, with the model.fit line.

If we hit Run here, we'll get some output after a few seconds that might look like a failure (because the Out [21]: text is red). But it is in fact totally fine. This output is simply a linear regression model, with some of the attributes included in the output.

Ok, so now we've got a model. Now we're going to actually create the new graph, using our model, that will include the line of best fit drawn over the data points.


# Plot our original training data
axes = plt.axes()
axes.scatter(x=penguin_data['Height'], y=penguin_data['Weight'])

# Determine the best fit line
slope = model.coef_[0]
intercept = model.intercept_

# Plot our model line
x = np.linspace(10,20)
y = slope*x+intercept
axes.plot(x, y, 'r')

# Add some labels to the graph
axes.set_xlabel('Height')
axes.set_ylabel('Weight')

plt.show()

Highlight the cell with that code in it, hit the Run button, and we'll get our graph of data points with the line. Using that line, we can make an inference. We can guess pretty close what a penguin would weigh based on it's height, or the other way around. That's actually what the next code cell will do.

height = 14

# Reshape the hight into an array
new_height = np.reshape([height],(1, -1))

# Pass the new height to the model so that a predicted weight can be infered
weight = model.predict(new_height)[0]

# Print the information back to the user
print ( "If you see a penguin thats %.2f tall, you can expect it to be %.2f in weight." % (height, weight))

This will take a height of 14, then give us a pretty good idea of what that penguin will weight. Hit the Run button on this, and we'll get output in the form of a sentence telling us that a (we never specified a unit of measure -- this could be inches, cm, cubits, etc.) 14 tall penguin will weigh 18.84.

If we look at the newest graph, with the line, we'll see that it's right on the money. And we can play with the code, changing 14 to something else and clicking Run again, to get other results that should still be in agreement with what our line of best fit says.

What's It All Mean?

What we've done, essentially, is build a machine learning model that will tell us approximately what a penguin should weigh, based on it's height.

Building Our Own

Back in the main Jupyter tab (where we can see the list of files associated with our notebook), click on the New button. Choose conda_python3 from the dropdown list. This will fire up a whole new notebook, using the Python3 kernel.

Click on the word Untitled at the top of the screen, and give it a name. Let's do something very original, like my_notebook.

We'll see that the first thing we've got is a code cell. Change that to be a Markdown type cell, just so we can let people know right off the bat what we're doing. The rest is just playing. We can do whatever we want in the notebook, using tools and methods that we went over earlier.

As the lab comes to an end (you'll know when you hear You have five minutes remaining types of messages), go to the File menu, and the Download as flyout, then choose Notebook (.ipynb) from the list. You can download it onto your local machine and play with it further.

Conclusion

You'll need to have a Jupyter Notebook server if you're going to play with the file locally, but don't be afraid to run the lab again and just upload the file to it. Play with it some more. If you want to keep on, just download the newest version of your notebook file, then fire the lab up again and repeat the process, picking up where you left off. Good luck!