Skip to main content

scikit-learn Random Forest Classifier (AWS SageMaker)

Hands-On Lab

 

Photo of

Training Architect

Length

01:00:00

Difficulty

Intermediate

scikit-learn is a great place to start working with machine learning and artificial intelligence. In this activity, we will use scikit-learn to create a Random Forest Classifier that performs a basic classification of people to see if they are likely to prefer dogs or cats. In this lab, we use a fictitious data set that could easily be replaced with one of your own. The files used in this lab can be found on GitHub.

What are Hands-On Labs?

Hands-On Labs are scenario-based learning environments where learners can practice without consequences. Don't compromise a system or waste money on expensive downloads. Practice real-world skills without the real-world risk, no assembly required.

scikit-learn Random Forest Classifier (AWS SageMaker)

Introduction

scikit-learn is a great place to start working with machine learning and artificial intelligence. In this activity, we will use scikit-learn to create a Random Forest Classifier that performs a basic classification of people to see if they are likely to prefer dogs or cats. In this lab, we use a fictitious data set that could easily be replaced with one of your own. The files used in this lab can be found on GitHub.

Logging In

Log in to AWS with the credentials provided on the hands-on lab page. In the Find Services box, look for SageMaker. Once we're in the SageMaker dashboard, click Notebook instances in the left-hand menu. There should be one sitting there. If there isn't one, double-check to make sure you're in the N. Virginia region (using the dropdown in the upper-right of the dashboard). When it shows up in the list of notebooks, click it and we'll land in the notebook server. To open up the notebook we'll be working with, click Open Jupyter over toward the right.

Navigate to the Jupyter Notebook

Once you're redirected to Jupyter, click to open the listed scikit-learn-random-forest-classifier.ipynb notebook. It may take a few minutes to fully spin up — once it's ready, you should see a cat and dog image pop up.

Run the Libraries

  1. With The Libraries code block selected, click Run at the top.

Load a CSV File into the Running Jupyter Notebook Environment

  1. Click the next two code blocks and then click Run to load the CSV file into a Pandas DataFrame and review:

    df = pd.read_csv("data.csv")
    df.head(10)

    This will list out some of our data.

  2. Click the next code block and then click Run to change the names of the DataFrame columns:

    df.columns = ['walk', 'run', 'color', 'distance', 'label']
  3. To see the new column labels, click Insert > Insert Cell Below and enter:

    df.head(10)

    We should see the same data as before, but the column labels will be updated.

  4. Click the next code block and then click Run to see the types of data:

    df.dtypes
  5. Click the next code block and then click Run to change the data types of the data:

    df['walk'] = df['walk'].astype('bool')
    df['run'] = df['run'].astype('bool')
    df['color'] = df['color'].astype('category',categories=['red','green','blue'])
    df['label'] = df['label'].astype('bool')
  6. Click the next code block and then click Run to re-review the data types:

    df.dtypes
  7. Click Insert > Insert Cell Below and enter:

    df.head(10)

    Now, we should see different values (True instead of numerals in a few columns).

  8. Click the next code block and then click Run to split the colors into different columns:

    df = pd.get_dummies(df, prefix=['color'])
  9. Click the next code block and then click Run to re-review the data:

    df.head(10)

Split the Training Data, and Then Use It to Train the Decision Trees Model

  1. Click the next code block and then click Run to use train_test_split to prepare the data, and then check that the data looks right:

    X_train, X_test, y_train, y_test = train_test_split(df.drop('label', 1), df['label'], test_size = .2, random_state=10)
    X_train.head(10)

    We should see a randomized list of data.

  2. Click the next code block and then click Run to create the model object:

    model = RandomForestClassifier(max_depth=5, n_estimators=15)
  3. Click the next code block and then click Run to fit the data to the model (train the model):

    model.fit(X_train, y_train)

Review the Trained Random Cut Forest Model

  1. Click the next code block and then click Run to obtain a single estimator (tree) from the model as well as the names of the features:

    estimator = model.estimators_[0]
    feature_names = [i for i in X_train.columns]
  2. Click the next code block and then click Run to use export_graphviz to display the tree graphically:

    export_graphviz(estimator, out_file='tree.dot',
                    feature_names = feature_names,
                    rounded = True,
                    filled = True)
    
    call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])
    
    Image(filename = 'tree.png')

    We should then see a decision tree display.

Perform Predictions with the Random Cut Forest Model, and Produce a Confusion Matrix

  1. Click the next code block and then click Run to pass test data into the model and get test results:

    y_predict = model.predict(X_test)
    y_pred_quant = model.predict_proba(X_test)[:, 1]
    y_pred_bin = model.predict(X_test)
  2. Click the next code block and then click Run to use scikit-learn's confusion_matrix function to create a confusion matrix:

    confusion_matrix = confusion_matrix(y_test, y_pred_bin)
    confusion_matrix
  3. Optionally, click the next code block and then click Run to format the confusion matrix with matplotlib:

    y_true = ["Dog", "Cat"]
    y_pred = ["Dog", "Cat"]
    df_cm = pd.DataFrame(confusion_matrix, columns=np.unique(y_true), index = np.unique(y_true))
    df_cm.index.name = 'Actual'
    df_cm.columns.name = 'Predicted'
    
    df_cm.dtypes
    
    plt.figure(figsize = (8,5))
    plt.title('Confusion Matrix')
    sn.set(font_scale=1.4)#for label size
    sn.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})# font size

    We should then see the confusion matrix display.

Calculate the Sensitivity and Specificity of the Model

  1. Click the next code block and then click Run to calculate the total number of results as well as the sensitivity and specificity:

    total=sum(sum(confusion_matrix))
    
    sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0])
    
    specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])
  2. Click the next code block and then click Run to display the values:

    print('Sensitivity : ', sensitivity )
    print('Specificity : ', specificity)

Create an ROC Graph for our Random Forest Model, and Calculate the AUC

  1. Click the next code block and then click Run to calculate the false positive rate (fpr) and the true positive rate (tpr) using scikit-learn's roc_curve function, as well as use matplotlib to plot an ROC graph for our model:

    fpr, tpr, thresholds = roc_curve(y_test, y_pred_quant)
    
    fig, ax = plt.subplots()
    ax.plot(fpr, tpr)
    ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3")
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.0])
    plt.rcParams['font.size'] = 12
    plt.title('ROC curve for Dog vs Cat people')
    plt.xlabel('False Positive Rate (1 - Specificity)')
    plt.ylabel('True Positive Rate (Sensitivity)')
    plt.grid(True)
  2. Click the next code block and then click Run to calculate the AUC:

    auc(fpr, tpr)

Use the Random Forest Model to Make Predictions from "Real-World" Data

  1. Click the next code block and then click Run to set the input variables for the model, make a prediction, and display the model output:

    like_walking = 0
    like_running = 1
    distance_run = 10
    
    # Only one of:
    red = 0
    green = 1
    blue = 0
    
    prediction = model.predict([[like_walking, like_running, distance_run, red, green, blue]])
    
    if prediction[0]:
        print('This is a DOG person!')
    else:
        print('This is a CAT person!')

Conclusion

Congratulations on successfully completing this (follow-along) hands-on lab!