scikit-learn Random Forest Classifier (AWS SageMaker)

Training Architect
scikit-learn is a great place to start working with machine learning and artificial intelligence. In this activity, we will use scikit-learn to create a Random Forest Classifier that performs a basic classification of people to see if they are likely to prefer dogs or cats. In this lab, we use a fictitious data set that could easily be replaced with one of your own. The files used in this lab can be found on GitHub.
scikit-learn Random Forest Classifier (AWS SageMaker)
Introduction
scikit-learn is a great place to start working with machine learning and artificial intelligence. In this activity, we will use scikit-learn to create a Random Forest Classifier that performs a basic classification of people to see if they are likely to prefer dogs or cats. In this lab, we use a fictitious data set that could easily be replaced with one of your own. The files used in this lab can be found on GitHub.
Logging In
Log in to AWS with the credentials provided on the hands-on lab page. In the Find Services box, look for SageMaker. Once we're in the SageMaker dashboard, click Notebook instances in the left-hand menu. There should be one sitting there. If there isn't one, double-check to make sure you're in the N. Virginia region (using the dropdown in the upper-right of the dashboard). When it shows up in the list of notebooks, click it and we'll land in the notebook server. To open up the notebook we'll be working with, click Open Jupyter over toward the right.
Navigate to the Jupyter Notebook
Once you're redirected to Jupyter, click to open the listed scikit-learn-random-forest-classifier.ipynb notebook. It may take a few minutes to fully spin up — once it's ready, you should see a cat and dog image pop up.
Run the Libraries
- With The Libraries code block selected, click Run at the top.
Load a CSV File into the Running Jupyter Notebook Environment
Click the next two code blocks and then click Run to load the CSV file into a Pandas DataFrame and review:
df = pd.read_csv("data.csv") df.head(10)
This will list out some of our data.
Click the next code block and then click Run to change the names of the DataFrame columns:
df.columns = ['walk', 'run', 'color', 'distance', 'label']
To see the new column labels, click Insert > Insert Cell Below and enter:
df.head(10)
We should see the same data as before, but the column labels will be updated.
Click the next code block and then click Run to see the types of data:
df.dtypes
Click the next code block and then click Run to change the data types of the data:
df['walk'] = df['walk'].astype('bool') df['run'] = df['run'].astype('bool') df['color'] = df['color'].astype('category',categories=['red','green','blue']) df['label'] = df['label'].astype('bool')
Click the next code block and then click Run to re-review the data types:
df.dtypes
Click Insert > Insert Cell Below and enter:
df.head(10)
Now, we should see different values (
True
instead of numerals in a few columns).Click the next code block and then click Run to split the colors into different columns:
df = pd.get_dummies(df, prefix=['color'])
Click the next code block and then click Run to re-review the data:
df.head(10)
Split the Training Data, and Then Use It to Train the Decision Trees Model
Click the next code block and then click Run to use
train_test_split
to prepare the data, and then check that the data looks right:X_train, X_test, y_train, y_test = train_test_split(df.drop('label', 1), df['label'], test_size = .2, random_state=10) X_train.head(10)
We should see a randomized list of data.
Click the next code block and then click Run to create the model object:
model = RandomForestClassifier(max_depth=5, n_estimators=15)
Click the next code block and then click Run to fit the data to the model (train the model):
model.fit(X_train, y_train)
Review the Trained Random Cut Forest Model
Click the next code block and then click Run to obtain a single estimator (tree) from the model as well as the names of the features:
estimator = model.estimators_[0] feature_names = [i for i in X_train.columns]
Click the next code block and then click Run to use
export_graphviz
to display the tree graphically:export_graphviz(estimator, out_file='tree.dot', feature_names = feature_names, rounded = True, filled = True) call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600']) Image(filename = 'tree.png')
We should then see a decision tree display.
Perform Predictions with the Random Cut Forest Model, and Produce a Confusion Matrix
Click the next code block and then click Run to pass test data into the model and get test results:
y_predict = model.predict(X_test) y_pred_quant = model.predict_proba(X_test)[:, 1] y_pred_bin = model.predict(X_test)
Click the next code block and then click Run to use scikit-learn's
confusion_matrix
function to create a confusion matrix:confusion_matrix = confusion_matrix(y_test, y_pred_bin) confusion_matrix
Optionally, click the next code block and then click Run to format the confusion matrix with matplotlib:
y_true = ["Dog", "Cat"] y_pred = ["Dog", "Cat"] df_cm = pd.DataFrame(confusion_matrix, columns=np.unique(y_true), index = np.unique(y_true)) df_cm.index.name = 'Actual' df_cm.columns.name = 'Predicted' df_cm.dtypes plt.figure(figsize = (8,5)) plt.title('Confusion Matrix') sn.set(font_scale=1.4)#for label size sn.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})# font size
We should then see the confusion matrix display.
Calculate the Sensitivity and Specificity of the Model
Click the next code block and then click Run to calculate the total number of results as well as the sensitivity and specificity:
total=sum(sum(confusion_matrix)) sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0]) specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])
Click the next code block and then click Run to display the values:
print('Sensitivity : ', sensitivity ) print('Specificity : ', specificity)
Create an ROC Graph for our Random Forest Model, and Calculate the AUC
Click the next code block and then click Run to calculate the false positive rate (fpr) and the true positive rate (tpr) using scikit-learn's
roc_curve
function, as well as use matplotlib to plot an ROC graph for our model:fpr, tpr, thresholds = roc_curve(y_test, y_pred_quant) fig, ax = plt.subplots() ax.plot(fpr, tpr) ax.plot([0, 1], [0, 1], transform=ax.transAxes, ls="--", c=".3") plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.rcParams['font.size'] = 12 plt.title('ROC curve for Dog vs Cat people') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.grid(True)
Click the next code block and then click Run to calculate the AUC:
auc(fpr, tpr)
Use the Random Forest Model to Make Predictions from "Real-World" Data
Click the next code block and then click Run to set the input variables for the model, make a prediction, and display the model output:
like_walking = 0 like_running = 1 distance_run = 10 # Only one of: red = 0 green = 1 blue = 0 prediction = model.predict([[like_walking, like_running, distance_run, red, green, blue]]) if prediction[0]: print('This is a DOG person!') else: print('This is a CAT person!')
Conclusion
Congratulations on successfully completing this (follow-along) hands-on lab!