Skip to main content

Using Python for Data Management and Reporting

Course

Intro Video

Photo of Larry Fritts

Larry Fritts

Python Development Training Architect II

Larry lives in Everett, WA with his fabulous husband Harley, Shiba Dexter, and two cats Alfred and Petra. In his spare time he reads sci-fi/fantasy and has decided he does not like to read e-books—in his words, “Paper please!” He also reminisces that his best memory is starting the Nerd Herd for non-athletic students at the high school where he used to teach Chemistry. His ideal life would be spending all of his time with deep snow outside, a fireplace and a good book.

Length

06:00:00

Difficulty

Intermediate

Videos

23

Hands-on Labs

4

Course Details

Data abounds in our world. It is collected in various forms, various ways, and various qualities. It is collected for a multitude of purposes including collecting to collect for some undefined future use. Being able to access that data and make some sense of it is an important skill for now and the future.

Python is an excellent resource for exploring data and making it useable for a variety of purposes. Using Python and Jupyter Notebooks we will look at accessing data in some common databases such as MySql, Postgresql, and MongoDB. We will explore Excel operations; writing and reading to Excel. We will also explore simple report formatting and data visualization.

Come on in and let's have some fun exploring these topics together.

Syllabus

Overview

About This Course

00:02:15

Lesson Description:

In this course, we will explore ways to get and use data with Python. In particular, we will be using Jupyter Notebooks because of its excellence in use as a learning tool and as a data science/analyst tool for looking at data and generating reports. We will introduce Anaconda and the Conda virtual environment system and take a close look at both pandas and matplotlib.

About the Author

00:00:43

Lesson Description:

I live in Everett, WA, with my fabulous husband Harley, Shiba Dexter, and two cats Alfred and Petra. In my spare time, I read sci-fi/fantasy and have decided I do not like to read e-books. “Paper please!” My best memory is starting the Nerd Herd for non-athletic students at the high school where I used to teach Chemistry. My ideal life would be spending all of my time with deep snow outside, a fireplace, and a good book.

Using Jupyter Notebooks

Installing Jupyter Notebooks, Opening a Notebook, and Setting the Kernel, Part 1

00:04:19

Lesson Description:

This lesson has been broken into three parts. This first of three parts dicusses what Jupyter Notebooks are and why we've chosen to use them. Using Jupyter Notebooks will aid in learning, retention, and recall. Students can download the Notebooks themselves or export them to PDF to have a record of the lesson and the work done as part of the lesson. Jupyter Notebooks are used in a variety of settings including education, scientific computing, data analysis, and data science. There are many uses for Jupyter Notebooks and having knowledge of them is useful throughout a career.

Installing Jupyter Notebooks, Opening a Notebook, and Setting the Kernel, Part 2

00:15:00

Lesson Description:

A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways but it is suggested that you download the PDF, watch through the entire lesson and then complete the steps on your own outside of the lesson. Of course, you can use it to follow along with the lesson as you wish. This is the same information as in the PDF: Creating Your Playground Server It is advised your start a Playground server to follow along with the lessons and practice on your own. Create a Playground server We are going to create a simple Ubunto 18 server and install what we need as we go along. Choosing Ubuntu 18.10 Bionic Beaver LTS will install a basic Ubuntu 18 server. Playground Server Settings Choose the following settings in making your Playground Server. You may choose a different tag if desired.Distribution: Ubuntu 18.04 Bionic Beaver LTS Zone: North America Size: Micro Tag: Using Python for Database Operations and ReportingSSH into your playground server Once the server indicates it is ready, using your terminal:

ssh cloud_user@<the public IP address shown>
type yes when asked to allow connection to the server enter <the temporary password shown> You will be asked to change your password For Old Password, enter <the temporary password shown> For New Password, enter your new password Enter the New Password again, enter your new password Don't forget your new password! You will use this new password when you log-in to your Playground server.Update Ubuntu Let's make sure we have all security patches installed on the server:
  sudo apt update -y
  sudo apt upgrade -y
NOTE: Sometimes this results in an error when the server is newly created. The error refers to a lock on packages. The easiet way to correct this is to shut the server down and then restart it. After that, the server should upgrade correctly. Though rare, sometimes during the upgrade process an information screen may pop up that looks like this:Just select enter or return. Another information screen may pop up that looks like this:Now select space and then return/enter. We need to install one more package needed for our work. This package is needed to later turn our notebooks to PDF.
sudo apt install -y texlive-xetex
Installing Anaconda We are using Anaconda for this course. Anaconda is a data science and machine learning platform based on Python that can also be used to write standard Python apps also. We will be using it because it makes the use of Jupyter Notebooks very easy. The work with the Jupyter Notebooks could be done without it but it suggested by Project Jupyter that using Anaconda is the best solution. To install, we first need to download the Anaconda package
cd /tmp
curl -O https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh
We should check that the package we downloaded is not corrupt:
  sha256sum Anaconda3-2019.03-Linux-x86_64.sh
The result should be:45c851b7497cc14d5ca060064394569f724b67d9b5f98a926ed49b834a6bb73a Anaconda3-2019.03-Linux-x86_64.shIf the checksum is not the same, if the checksums do not match, it indicates the file may be corrput in some way. Please stop installation of the Playground server. Please ask for help from Linux Academy before proceeding. If the checksums do match being the installation of Anaconda:
bash Anaconda3-2019.03-Linux-x86_64.sh
Near the beginning of the install you will get this message:
Welcome to Anaconda3 2019.03

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>
After reading the license agreement, you will get the message:
Do you approve the license terms? [yes|no]
if you enter yes then this message will appear:
Anaconda3 will now be installed into this location:
/home/cloud_user/anaconda3

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/cloud_user/anaconda3] >>>
Please make the appropriate selection. After installation is complete, this message will appear:
installation finished.
Do you wish the installer to prepend the Anaconda3 install location
to PATH in your /home/cloud_user/.bashrc ? [yes|no]
[no] >>> 
Please enter yes. This will update you ~/.bashrc file so that you can use conda on the command. Move back to your directory:
cd ~
We need to restart the shell to include the changes made:
source ~/.bashrc
You will see a (base) prepended to your command line. That indicates that the base conda virtual environment has been activated. This will happen everytime we start the terminal. We don't want that behavior; to turn it off:
conda config --set auto_activate_base false
Now let's update the conda base code:
conda update -n base -c defaults conda
Create a conda Environment For This Course We want to create a conda virtual environment for this course using Python 3. I am naming this environment python_data_course using the -n flag. You may choose a different virtual envornment name. python=3 tells the virtual environment to use the lastest version of Python 3 installed in conda.
conda create -n python_data_course python=3
You will need to activate this virtual environment before you work on the code in the repo we are about to clone. To do so:
conda activate python_data_course
To leave the virtual environment:
conda deactivate
Please note: You will recieve errors if you try to run the code without activated the virtual environment. If you do encounter errors when starting work, please check that you have activated the virtual environment. Use the git Repo for the Course to Practice What We Do in Lessons Now we will make a directory to hold the github repo.

mkdir python_data_course
You change the above name to something you like if desired. Move to the directory, activate the virtual environment (if it is not active, look for the parenthesis statement prepended to your command line, it does not hurt to rn the activate command more than once) and install two packages we need.
cd python_data_course
conda activate python_data_course
conda install jupyter psutil
To clone the repo:
git clone https://github.com/linuxacademy/content-python-for-database-and-reporting.git .

Installing Jupyter Notebooks, Opening a Notebook, and Setting the Kernel, Part 3

00:06:00

Lesson Description:

A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways but it is suggested that you download the PDF, watch through the entire lesson and then complete the steps on your own outside of the lesson. Of course, you can use it to follow along with the lesson as you wish. This is the same information as in the PDF: Connecting to the Jupyter Notebook Server On the Playground Server Make sure the virtual environment is activated! To activate the virtual environment:

conda activate python_data_course
To start the server, run the following:
python get_notebook_token.py
This is a simple script that starts the jupyter notebook server and sets it to continue to run outside of the terminal. nohup jupyter notebook --no-browser --port=8086 We've selected port 8086, but the actual port number is not important as long as we know what it is and it does not conflict with other running services. It then searches the resulting text file nohup.out to find and print the token. A token is on the terminal. Copy this and save it to a text file on the local machine. On the Local Machine In a terminal window, enter the following:
ssh -N -L localhost:8087:localhost:8086 cloud_user@<the public IP address of the Playground server>
-N indicates there will be no remote commands. -L maps the local port to the remote port. We've selected port 8086, but the actual port number is not important as long as we know what it is and it does not conflict with other running services. It will ask for a password. This is the password we use to log in to the Playground remote server. Leave this terminal open. It will appear nothing has happened, but it must remain open while we use the Jupyter Notebook server in this session. In a browser, enter the following address: http://localhost:8087 This will open a Jupyter Notebook site that asks for the token copied from the remote server.

Using and Evaluating Both Code and Markdown Cells, Part 1

00:08:54

Lesson Description:

In this lesson we learn the basics of Jupyter Noteook. How to assign a kernel to the notebook, how to evaluate cells, and moving cells around. WE also learn a little bit of Markdown and practice getting around in this environment. A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways but it is suggested that you download the PDF, watch through the entire lesson and then complete the steps on your own outside of the lesson. Of course, you can use it to follow along with the lesson as you wish. This is the same information as in the PDF: Markdown CellIt is designed to allow for text explanations Can be used to create intricate layout including tables with extra packages Can create ordered and unordered lists Can create code snippets This is the *Markdown Guide*.

for i in range(5):
    print(i)
Before evaluating the cell think about these two questions.What is going to be the result of evaluating this cell? What will be the output?
a = 'pears'
b = 'bananas'
x = 8
y = -23
There was no output because all the cell does is assign values to variables; there are no print statements or return statements.
print(f"a = {a}")
print(f"b = {b}")
print(f"x = {x}")
print(f"y = {y}")
print(f"x * a = {a*x}")
import math

def get_sqr_root(x):
    return print(f"The square root of {x} is {math.sqrt(x)}")
get_sqr_root(8)
Now we are going to examine the changes made in previous cells. In the cell where variables were assigned change a = "apples" to a = "pears". Don't forget to re-evaluate the cell. Consider the following questions:Why were the cells after this cell not changed? What is the value of a? What will happen if we re-evaluate the cell where be multiply a and x, but not the cell where we print the value of a?
get_sqr_root(16)
get_sqr_root(100)
# imports and matplotlib setup
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# set x to be 0 to 4+pi in .1 increments
x = np.arange(0,4*np.pi,0.1)

# the sine and cosine values
y = np.sin(x)
z = np.cos(x)

# plot them
plt.plot(x,y,x,z)
plt.show()

Using and Evaluating Both Code and Markdown Cells, Part 2

00:08:28

Lesson Description:

Continued from part 1. In this lesson we learn the basics of Jupyter Noteook. How to assign a kernel to the notebook, how to evaluate cells, and moving cells around. WE also learn a little bit of Markdown and practice getting around in this environment. A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways but it is suggested that you download the PDF, watch through the entire lesson and then complete the steps on your own outside of the lesson. Of course, you can use it to follow along with the lesson as you wish. This is the same information as in the PDF: Markdown CellIt is designed to allow for text explanations Can be used to create intricate layout including tables with extra packages Can create ordered and unordered lists Can create code snippets This is the *Markdown Guide*.

for i in range(5):
    print(i)
Before evaluating the cell think about these two questions.What is going to be the result of evaluating this cell? What will be the output?
a = 'pears'
b = 'bananas'
x = 8
y = -23
There was no output because all the cell does is assign values to variables; there are no print statements or return statements.
print(f"a = {a}")
print(f"b = {b}")
print(f"x = {x}")
print(f"y = {y}")
print(f"x * a = {a*x}")
import math

def get_sqr_root(x):
    return print(f"The square root of {x} is {math.sqrt(x)}")
get_sqr_root(8)
Now we are going to examine the changes made in previous cells. In the cell where variables were assigned change a = "apples" to a = "pears". Don't forget to re-evaluate the cell. Consider the following questions:Why were the cells after this cell not changed? What is the value of a? What will happen if we re-evaluate the cell where be multiply a and x, but not the cell where we print the value of a?
get_sqr_root(16)
get_sqr_root(100)
# imports and matplotlib setup
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# set x to be 0 to 4+pi in .1 increments
x = np.arange(0,4*np.pi,0.1)

# the sine and cosine values
y = np.sin(x)
z = np.cos(x)

# plot them
plt.plot(x,y,x,z)
plt.show()

panda DataFrames in Jupyter Notebooks

Creating a panda DataFrame and Examining Its Properties, Part 1

00:06:43

Lesson Description:

In this lesson, we explore pandas DataFrame objects. The information below is composed of markdown and code cells we can copy and paste into a Jupyter Notebook. A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways, but it's suggested to download the PDF, watch the entire lesson, and then complete the steps outside of the lesson. Of course, follow along as desired. This is the same information in the PDF: pandas DataFrames - Examining Data lesson_2_2_1 Import packages

import pandas as pd
Creating a Basic DataFrame From JSON
# define the data as a list
data = [
    ("Dexter","Johnsons","dog","shiba inu","red sesame",1.5,35,"m",False,"both",True),
    ("Alfred","Johnsons","cat","mix","tuxedo",4,12,"m",True,"indoor",True),
    ("Petra","Smith","cat","ragdoll","calico",6,None,"f",False,"both",True),
    ("Ava","Smith","dog","mix","blk/wht",12,32,"f",True,"both",False),
    ("Schroder","Brown","cat","mix","orange",13,15,"m",False,"indoor",True),
    ("Blackbeard","Brown","bird","parrot","multi",5,3,"f",False,"indoor",),
]

# define the labels
labels = ["name","owner","type","breed","color","age","weight","gender","health issues","indoor/outboor","vaccinated"]

# create dataframe
vet_records = pd.DataFrame.from_records(data, columns=labels)
Examining the Data in a DataFrame There are several different ways to examine data using a pandas dataFrame. Two are .head() and .tail(). These show the first five and the last five rows of the DataFrame respectively.
 # displays the first five rows in the dataframe
vet_records.head() 
 # displays the first five rows in the dataframe
vet_records.tail()
# displays all the records of the datframe
vet_records
.dtypes show the types of data in the DataFrame by column. If the dtype is object, this indicates that pandas is seeing that data as more than one type.
# object means a mixed type column
vet_records.dtypes
Notice all the string columns are listed as object. This is because a string type takes a maximum length argument, so when importing from csv, they are imported as an object so they can be variable length.

Creating a panda DataFrame and Examining Its Properties, Part 2

00:07:16

Lesson Description:

In this lesson, we continue to explore pandas DataFrame objects. The information below is composed of markdown and code cells that can be copied and pasted into a Jupyter Notebook to follow along. .describe shows statistical operations on columns that these operations can be performed on.

# `.describe` shows statistical information on columns that the operations can be performed on
vet_records.describe()
# to show all columns in `.describe` add `include="all"`
vet_records.describe(include="all")
.at allows the user to change the value of a specific cell.
# change a specific value with `.at`
vet_records.at[0, "weight"] = 34.7 
# notice the weight was changed for Dexter
vet_records
.assign is used to add another column of data.
# we are going to add the ratio age:weight as a column to the dataframe
# notice that this method iterates throught the dataframe
vet_records = vet_records.assign(age_weight=(vet_records['age']/vet_records['weight']))
# review the new dataframe
vet_records

Slicing and Other Operations on panda DataFrames, Part 1

00:06:11

Lesson Description:

In this lesson, we learn about some of the operations we can use to manipulate data in a pandas DataFrame. The information below is composed of markdown and code cells that can be copied and pasted into a Jupyter Notebook to follow along. A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways, but we suggest downloading the PDF, watching through the entire lesson, and then completing the steps outside of the lesson. Of course, follow along as desired. This is the same information as in the PDF: pandas DataFrames - Slicing and Filtering lesson_2_2_2 We Will Use the Same DataFrame as Last Lesson Import Packages

import pandas as pd
Creating a Basic DataFrame From JSON
# define the data as a list
data = [
    ("Dexter","Johnsons","dog","shiba inu","red sesame",1.5,35,"m",False,"both",True),
    ("Alfred","Johnsons","cat","mix","tuxedo",4,12,"m",True,"indoor",True),
    ("Petra","Smith","cat","ragdoll","calico",None,10,"f",False,"both",True),
    ("Ava","Smith","dog","mix","blk/wht",12,32,"f",True,"both",False),
    ("Schroder","Brown","cat","mix","orange",13,15,"m",False,"indoor",True),
    ("Blackbeard","Brown","bird","parrot","multi",5,3,"f",False,"indoor",),
]

# define the labels
labels = ["name","owner","type","breed","color","age","weight","gender","health issues","indoor/outboor","vaccinated"]

# create dataframe
vet_records = pd.DataFrame.from_records(data, columns=labels)
A Note of Caution Changes and updates to a DataFrame are only permanent if saved to the DataFrame. So, for example, we might say vet_records = ... to permanently change the DataFrame vet_records. In many cases, keeping a reference DataFrame is a good practice. For example, vet_records_dogs = vet_records[vet_records.type=="dog"] instead of vet_records = vet_records[vet_records.type=="dog"]. This will leave a DataFrame to reference that contains the unaldulterated data. Grouping and Counting Data Using counting and grouping can help provide a better grasp of the data.
# How many types of pets do we have?
vet_records.type.count()
vet_records.groupby('type').count()
vet_records.type.value_counts()
Slicing (Filtering) Data Slicing data (i.e. picking parts of the data we want to use for a specific purpose) is easy with pandas once we have the concepts down. Here we slice the data to get only the weight column.
# Create a pandas series from the dataframe
weight = vet_records['weight']
weight
Notice that vet_records was not changed.
vet_records.head()
While weight does show us all the weights for the animals in the DataFrame, unless we are interested in straight weight values for some calculation, it is not very useful data. A list of numbers by themselves is usually not data that can be used. So, instead let's get all the dog weights.
# Collect the dog weights only using a boolean filter
dog_weight = vet_records.weight[vet_records.type=='dog']
dog_weight
While this is still only a list of values, at least we know these are the weights of all the dogs in the sample because of the variable name. A better way might be to just slice all the dog data.
dogs = vet_records[vet_records.type=='dog']
dogs

Slicing and Other Operations on panda DataFrames, Part 2

00:05:15

Lesson Description:

Continued from part 1 pandas DataFrames - Slicing and Filtering lesson_2_2_2 We Will Use the Same DataFrame as Last Lesson Import Packages

import pandas as pd
Creating a Basic DataFrame From JSON
# define the data as a list
data = [
    ("Dexter","Johnsons","dog","shiba inu","red sesame",1.5,35,"m",False,"both",True),
    ("Alfred","Johnsons","cat","mix","tuxedo",4,12,"m",True,"indoor",True),
    ("Petra","Smith","cat","ragdoll","calico",None,10,"f",False,"both",True),
    ("Ava","Smith","dog","mix","blk/wht",12,32,"f",True,"both",False),
    ("Schroder","Brown","cat","mix","orange",13,15,"m",False,"indoor",True),
    ("Blackbeard","Brown","bird","parrot","multi",5,3,"f",False,"indoor",),
]

# define the labels
labels = ["name","owner","type","breed","color","age","weight","gender","health issues","indoor/outboor","vaccinated"]

# create dataframe
vet_records = pd.DataFrame.from_records(data, columns=labels)
A Note of Caution Changes and updates to a DataFrame is only permanent if saved. For example, we might say vet_records = ... to permanently change the DataFrame vet_records. In many cases, keeping a reference DataFrame is good practice. For example, vet_records_dogs = vet_records[vet_records.type=="dog"] instead of vet_records = vet_records[vet_records.type=="dog"]. This leaves a DataFrame to reference that contains the unaldulterated data. Grouping and Counting Data Using counting and grouping can help provide a better grasp of the data.
# How many types of pets do we have?
vet_records.type.count()
vet_records.groupby('type').count()
vet_records.type.value_counts()
Slicing (Filtering) Data Slicing data (i.e. picking parts of the data to use for a specific purpose) is easy with pandas once we have the concepts down. Here we slice the data to get only the weight column.
# Create a pandas series from the dataframe
weight = vet_records['weight']
weight
Notice that vet_records was not changed.
vet_records.head()
While weight does show us all the weights for the animals in the DataFrame, unless we are interested in straight weight values for some calculation, it is not very useful data. A list of numbers by themselves is not usually data that can be used. So, instead let's get all the dog weights.
# Collect the dog weights only using a boolean filter
dog_weight = vet_records.weight[vet_records.type=='dog']
dog_weight
While this is still only a list of values, we know these are the weights of all the dogs in the sample because of the variable name. A better way might be to just slice all the dog data.
dogs = vet_records[vet_records.type=='dog']
dogs

Slicing and Other Operations on panda DataFrames, Part 3

00:07:05

Lesson Description:

Continued from part 2 pandas DataFrames - Slicing and Filtering lesson_2_2_2 We Will Use the Same DataFrame as Last Lesson Import Packages

import pandas as pd
Creating a Basic DataFrame From JSON
# define the data as a list
data = [
    ("Dexter","Johnsons","dog","shiba inu","red sesame",1.5,35,"m",False,"both",True),
    ("Alfred","Johnsons","cat","mix","tuxedo",4,12,"m",True,"indoor",True),
    ("Petra","Smith","cat","ragdoll","calico",None,10,"f",False,"both",True),
    ("Ava","Smith","dog","mix","blk/wht",12,32,"f",True,"both",False),
    ("Schroder","Brown","cat","mix","orange",13,15,"m",False,"indoor",True),
    ("Blackbeard","Brown","bird","parrot","multi",5,3,"f",False,"indoor",),
]

# define the labels
labels = ["name","owner","type","breed","color","age","weight","gender","health issues","indoor/outboor","vaccinated"]

# create dataframe
vet_records = pd.DataFrame.from_records(data, columns=labels)
A Note of Caution Changes and updates to a DataFrame is only permanent if saved. For example, we might say vet_records = ... to permanently change the DataFrame vet_records. In many cases, keeping a reference DataFrame is good practice. For example, vet_records_dogs = vet_records[vet_records.type=="dog"] instead of vet_records = vet_records[vet_records.type=="dog"]. This leaves a DataFrame to reference that contains the unaldulterated data. Grouping and Counting Data Using counting and grouping can help provide a better grasp of the data.
# How many types of pets do we have?
vet_records.type.count()
vet_records.groupby('type').count()
vet_records.type.value_counts()
Slicing (Filtering) Data Slicing data (i.e. picking parts of the data to use for a specific purpose) is easy with pandas once we have the concepts down. Here we slice the data to get only the weight column.
# Create a pandas series from the dataframe
weight = vet_records['weight']
weight
Notice that vet_records was not changed.
vet_records.head()
While weight does show us all the weights for the animals in the DataFrame, unless we are interested in straight weight values for some calculation, it is not very useful data. A list of numbers by themselves is not usually data that can be used. So, instead let's get all the dog weights.
# Collect the dog weights only using a boolean filter
dog_weight = vet_records.weight[vet_records.type=='dog']
dog_weight
While this is still only a list of values, we know these are the weights of all the dogs in the sample because of the variable name. A better way might be to just slice all the dog data.
dogs = vet_records[vet_records.type=='dog']
dogs

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

00:45:00

SQL Python Database Packages

PEP 249 and Python Database Packages

00:06:16

Lesson Description:

In this lesson, we consider PEP 249 and how it helps in working with databases using Python. A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways, but it is suggested that you download the PDF, watch through the entire lesson, and then complete the steps on your own outside of the lesson. Of course, you can use it to follow along with the lesson as you wish. This is the same information as in the PDF: PEP 249Pep 249 documentation can be found here. Provides a blueprint for developers to make database connection packages that are similar Since the packages are similar, it is easy to learn one for the database you are currently using and transfer that knowledge to any future database you may encounterConnection ObjectsA connection object is usually generated with <package_name>.connect(host, dbname, user, password, port, driver, character set, ...). The connection object does not require all of the above parameters, but the connection to your db might.Common Connection Objects MethodsMethod Useclose( ) closes the connection to the databasecommit( ) commits any pending transactions, should be used before calling close()cursor( ) used to create a cursor object to the databaseCursor ObjectsA cursor object is used to manage a database operation All the cursor methods except close take a SQL statement as the operationCommonly Used Cursor MethodsMethod Useclose( ) closes the cursorexecute( ) prepares and executes a database operationfetchone( ) gets the next row of a query result from the dbfetchmany( ) gets the next set of rows of a query result from the dbfetchall( ) gets all the (remaining) rows of a query result from the dbDatabase Packages Python Drivers for Some Common DatabasesThese can be pip installed as neededDatabase Python DriverSQL Server pyodbcPostgreSQL psycopg2MySQL mysql-connectorMongoDB pymongoOthers Search the web for python drivers for <database>

Connecting to and Operations on a Database, Part 1

00:04:43

Lesson Description:

In this lesson, we use PostgreSQL as an example database for connecting to and exploring operations on a database. Because of PEP 249, what we learn from one database is transferrable to another, which makes using Python with databases so easy. This lesson is broken into three parts: Part 1 - Installing PostgreSQL on the Playground Server Part 2 - Using psycopg2 For Common Operations Part 3 - Using pandas to Read Data From a Database A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways, but it is suggested that you download the PDF, watch through the entire lesson, and then complete the steps on your own outside of the lesson. Of course, you can use it to follow along with the lesson as you wish. This is the same information as in the PDF: At the command line, the first one will require your password:sudo apt update sudo apt install postgresql postgresql-contribGain Access to PSQL Command Linepsql is the interactive terminal for working with PostgreSQLAt the command line:sudo -u postgres psqlYou are not logged in as the "postgres" superuser. Create User, Database and Grant Access Create DatabaseCREATE DATABASE cloud_user;Create UserCREATE USER cloud_user WITH ENCRYPTED PASSWORD 'cloud_user';Grant Access to Database by UserGRANT ALL PRIVILEGES ON DATABASE cloud_user TO cloud_user;You now have a database you can access the named cloud_user as the user cloud_user. Leave PSQLqInstall Postgresql Driver to Your Virtual Environmentconda activate python_data_course conda install psycopg2Start and connect to the Jupyter Notebook server as usual.Server Operations Using Python's Psycopg2

import pandas as pd
import psycopg2

CONNECT_DB = "host=localhost port=5432 dbname=cloud_user user=cloud_user password=cloud_user"
A Word About Database Connections In the cell below, you will see I used a try, except, finally block. There are a couple of main reasons, and I thought it important to call them out.Exceptions: Exceptions can occur when trying to connect to a database like PostgreSQL. Maybe the server is down. Or perhaps it has already exceeded the maximum number of collections. It is important to catch those errors and report out to the user (even if it is only you).Connections: There are a limited number of connections a database server can accept. While this is a rather large number, it is possible to reach that number and be refused a connection. Be a good neighbor and always only open a connection for an operation and then close it. Don't open a connection and leave it open while you review the data you have gotten.Create Table
create_table_query = '''CREATE TABLE tips (
    ID SERIAL PRIMARY KEY,
    weekday varchar (10),
    meal_type varchar (10),
    wait_staff varchar (10),
    party_size smallint,
    meal_total float4,
    tip float4
); '''

try:
    # Make connection to db
    cxn = psycopg2.connect(CONNECT_DB)

    # Create a cursor to db
    cur = cxn.cursor()

    # Send sql query to request
    cur.execute(create_table_query)
    records = cxn.commit()

except (Exception, psycopg2.Error) as error :
    print ("Error while connecting to PostgreSQL", error)

finally:
    #closing database connection.
    if(cxn):
        cur.close()
        cxn.close()
        print("PostgreSQL connection is closed")

print(f'Records:n {records}')
Add the Data to Table
try:
    # Make connection to db
    cxn = psycopg2.connect(CONNECT_DB)

    # Create a cursor to db
    cur = cxn.cursor()

    with open('./tips.csv', 'r') as f:
        # skip first row, header row
        next(f)
        cur.copy_from(f, 'tips', sep=",")
        cxn.commit()

except (Exception, psycopg2.Error) as error :
    print ("Error while connecting to PostgreSQL", error)

finally:
    #closing database connection.
    if(cxn):
        cur.close()
        cxn.close()
        print("PostgreSQL connection is closed")
        print("tips table populated")
Selecting Data From a Server Use .fetchall() with LIMIT or TOP (#)LIMIT works for most databases but does not work with SQL Server TOP (#) is used in place of LIMIT on SQL Server
def db_server_fetch(sql_query):
    try:
        # Make connection to db
        cxn = psycopg2.connect(CONNECT_DB)

        # Create a cursor to db
        cur = cxn.cursor()

        # Send sql query to request
        cur.execute(sql_query)
        records = cur.fetchall()

    except (Exception, psycopg2.Error) as error :
        print ("Error while connecting to PostgreSQL", error)

    finally:
        #closing database connection.
        if(cxn):
            cur.close()
            cxn.close()
            print("PostgreSQL connection is closed")
        return records
The test table is populated by Selecting the first five rows.
select_query = '''SELECT * FROM tips LIMIT 5;'''

records = db_server_fetch(select_query)
print(records)
def db_server_change(sql_query):
    try:
        # Make connection to db
        cxn = psycopg2.connect(CONNECT_DB)

        # Create a cursor to db
        cur = cxn.cursor()

        # Send sql query to request
        cur.execute(sql_query)
        records = cxn.commit()

    except (Exception, psycopg2.Error) as error :
        print ("Error while connecting to PostgreSQL", error)

    finally:
        #closing database connection.
        if(cxn):
            cur.close()
            cxn.close()
            print("PostgreSQL connection is closed")
        return records
Add a new record with the following data: On Saturday, new wait staff Alfred had one person at Breakfast for 10.76 and received a 0.50 tip.
add_data = '''INSERT INTO tips
    (id, weekday, meal_type, wait_staff, party_size, meal_total, tip)
    VALUES
    (504, 'Saturday', 'Breakfast', 'Alfred', 1, 10.76, 0.50);'''

db_server_change(add_data)
Make a SELECT Request to Get New Records
select_query = '''SELECT * FROM tips WHERE wait_staff='Alfred';'''

records = db_server_fetch(select_query)
print(records)
Accessing a SQL Database With Pandas pandas.read_sql( ) - loads data from database pandas.to_sql( ) - write data to database CAUTION: Please don't write to a database unless you know what you are doing and are authorized. If you are not, your permission should allow read-only.
def pandas_db_server_fetch(sql_query):
    try:
        # Make connection to db
        cxn = psycopg2.connect(CONNECT_DB)

        # Send sql query to request and create dataframe
        df = pd.read_sql(sql_query, cxn)

    except (Exception, psycopg2.Error) as error :
        print ("Error while connecting to PostgreSQL", error)

    finally:
        #closing database connection.
        if(cxn):
            cxn.close()
            print("PostgreSQL connection is closed")
        return df
select_query = '''SELECT * FROM tips WHERE wait_staff='Alfred';'''

alfred_df = pandas_db_server_fetch(select_query)
alfred_df.head()
tips_df = pandas_db_server_fetch('''SELECT * FROM tips;''')
tips_df.head()

Connecting to and Operations on a Database, Part 2

00:10:57

Lesson Description:

Continued From Part 1 At the command line, the first one will require your password:sudo apt update sudo apt install postgresql postgresql-contribGain Access to PSQL Command Linepsql is the interactive terminal for working with PostgreSQLAt the command line:sudo -u postgres psqlYou are not logged in as the "postgres" superuser. Create User, Database and Grant Access Create DatabaseCREATE DATABASE cloud_user;Create UserCREATE USER cloud_user WITH ENCRYPTED PASSWORD 'cloud_user';Grant Access to Database by UserGRANT ALL PRIVILEGES ON DATABASE cloud_user TO cloud_user;You now have a database you can access named cloud_user as the user cloud_user. Leave PSQLqInstall Postgresql Driver to Your Virtual Environmentconda activate python_data_course conda install psycopg2Start and connect to the Jupyter Notebook server as usual.Server Operations Using Python's Psycopg2

import pandas as pd
import psycopg2

CONNECT_DB = "host=localhost port=5432 dbname=cloud_user user=cloud_user password=cloud_user"
A Word About Database Connections In the cell below, you will see I used a try, except, finally block. There are a couple of main reasons, and I thought it important to call them out.Exceptions: Exceptions can occur when trying to connect to a database like PostgreSQL. Maybe the server is down. Or perhaps it has already exceeded the maximum number of collections. It is important to catch those errors and report out to the user (even if it is only you).Connections: There are a limited number of connections a database server can accept. While this is a rather large number, it is possible to reach that number and be refused a connection. Be a good neighbor and always only open a connection for an operation and then close it. Don't open a connection and leave it open while you review the data you have gotten.Create Table
create_table_query = '''CREATE TABLE tips (
    ID SERIAL PRIMARY KEY,
    weekday varchar (10),
    meal_type varchar (10),
    wait_staff varchar (10),
    party_size smallint,
    meal_total float4,
    tip float4
); '''

try:
    # Make connection to db
    cxn = psycopg2.connect(CONNECT_DB)

    # Create a cursor to db
    cur = cxn.cursor()

    # Send sql query to request
    cur.execute(create_table_query)
    records = cxn.commit()

except (Exception, psycopg2.Error) as error :
    print ("Error while connecting to PostgreSQL", error)

finally:
    #closing database connection.
    if(cxn):
        cur.close()
        cxn.close()
        print("PostgreSQL connection is closed")

print(f'Records:n {records}')
Add the Data to Table
try:
    # Make connection to db
    cxn = psycopg2.connect(CONNECT_DB)

    # Create a cursor to db
    cur = cxn.cursor()

    with open('./tips.csv', 'r') as f:
        # skip first row, header row
        next(f)
        cur.copy_from(f, 'tips', sep=",")
        cxn.commit()

except (Exception, psycopg2.Error) as error :
    print ("Error while connecting to PostgreSQL", error)

finally:
    #closing database connection.
    if(cxn):
        cur.close()
        cxn.close()
        print("PostgreSQL connection is closed")
        print("tips table populated")
Selecting Data From a Server Use .fetchall() with LIMIT or TOP (#)LIMIT works for most databases but does not work with SQL Server TOP (#) is used in place of LIMIT on SQL Server
def db_server_fetch(sql_query):
    try:
        # Make connection to db
        cxn = psycopg2.connect(CONNECT_DB)

        # Create a cursor to db
        cur = cxn.cursor()

        # Send sql query to request
        cur.execute(sql_query)
        records = cur.fetchall()

    except (Exception, psycopg2.Error) as error :
        print ("Error while connecting to PostgreSQL", error)

    finally:
        #closing database connection.
        if(cxn):
            cur.close()
            cxn.close()
            print("PostgreSQL connection is closed")
        return records
The test table is populated by Selecting the first five rows.
select_query = '''SELECT * FROM tips LIMIT 5;'''

records = db_server_fetch(select_query)
print(records)
def db_server_change(sql_query):
    try:
        # Make connection to db
        cxn = psycopg2.connect(CONNECT_DB)

        # Create a cursor to db
        cur = cxn.cursor()

        # Send sql query to request
        cur.execute(sql_query)
        records = cxn.commit()

    except (Exception, psycopg2.Error) as error :
        print ("Error while connecting to PostgreSQL", error)

    finally:
        #closing database connection.
        if(cxn):
            cur.close()
            cxn.close()
            print("PostgreSQL connection is closed")
        return records
Add a new record with the following data: On Saturday, new wait staff Alfred had one person at Breakfast for 10.76 and received a 0.50 tip.
add_data = '''INSERT INTO tips
    (id, weekday, meal_type, wait_staff, party_size, meal_total, tip)
    VALUES
    (504, 'Saturday', 'Breakfast', 'Alfred', 1, 10.76, 0.50);'''

db_server_change(add_data)
Make a SELECT Request to Get New Records
select_query = '''SELECT * FROM tips WHERE wait_staff='Alfred';'''

records = db_server_fetch(select_query)
print(records)
Accessing a SQL Database With Pandas pandas.read_sql( ) - loads data from database pandas.to_sql( ) - write data to database CAUTION: Please don't write to a database unless you know what you are doing and are authorized. If you are not, your permission should allow read-only.
def pandas_db_server_fetch(sql_query):
    try:
        # Make connection to db
        cxn = psycopg2.connect(CONNECT_DB)

        # Send sql query to request and create dataframe
        df = pd.read_sql(sql_query, cxn)

    except (Exception, psycopg2.Error) as error :
        print ("Error while connecting to PostgreSQL", error)

    finally:
        #closing database connection.
        if(cxn):
            cxn.close()
            print("PostgreSQL connection is closed")
        return df
select_query = '''SELECT * FROM tips WHERE wait_staff='Alfred';'''

alfred_df = pandas_db_server_fetch(select_query)
alfred_df.head()
tips_df = pandas_db_server_fetch('''SELECT * FROM tips;''')
tips_df.head()

Connecting to and Operations on a Database, Part 3

00:01:59

Lesson Description:

Continued From Part 2 At the command line, the first one will require your password:sudo apt update sudo apt install postgresql postgresql-contribGain Access to PSQL Command Linepsql is the interactive terminal for working with PostgreSQLAt the command line:sudo -u postgres psqlYou are not logged in as the "postgres" superuser. Create User, Database and Grant Access Create DatabaseCREATE DATABASE cloud_user;Create UserCREATE USER cloud_user WITH ENCRYPTED PASSWORD 'cloud_user';Grant Access to Database by UserGRANT ALL PRIVILEGES ON DATABASE cloud_user TO cloud_user;You now have a database you can access named cloud_user as the user cloud_user. Leave PSQLqInstall Postgresql Driver to Your Virtual Environmentconda activate python_data_course conda install psycopg2Start and connect to the Jupyter Notebook server as usual.Server Operations Using Python's Psycopg2

import pandas as pd
import psycopg2

CONNECT_DB = "host=localhost port=5432 dbname=cloud_user user=cloud_user password=cloud_user"
A Word About Database Connections In the cell below, you will see I used a try, except, finally block. There are a couple of main reasons, and I thought it important to call them out.Exceptions: Exceptions can occur when trying to connect to a database like PostgreSQL. Maybe the server is down. Or perhaps it has already exceeded the maximum number of collections. It is important to catch those errors and report out to the user (even if it is only you).Connections: There are a limited number of connections a database server can accept. While this is a rather large number, it is possible to reach that number and be refused a connection. Be a good neighbor and always only open a connection for an operation and then close it. Don't open a connection and leave it open while you review the data you have gotten.Create Table
create_table_query = '''CREATE TABLE tips (
    ID SERIAL PRIMARY KEY,
    weekday varchar (10),
    meal_type varchar (10),
    wait_staff varchar (10),
    party_size smallint,
    meal_total float4,
    tip float4
); '''

try:
    # Make connection to db
    cxn = psycopg2.connect(CONNECT_DB)

    # Create a cursor to db
    cur = cxn.cursor()

    # Send sql query to request
    cur.execute(create_table_query)
    records = cxn.commit()

except (Exception, psycopg2.Error) as error :
    print ("Error while connecting to PostgreSQL", error)

finally:
    #closing database connection.
    if(cxn):
        cur.close()
        cxn.close()
        print("PostgreSQL connection is closed")

print(f'Records:n {records}')
Add the Data to Table
try:
    # Make connection to db
    cxn = psycopg2.connect(CONNECT_DB)

    # Create a cursor to db
    cur = cxn.cursor()

    with open('./tips.csv', 'r') as f:
        # skip first row, header row
        next(f)
        cur.copy_from(f, 'tips', sep=",")
        cxn.commit()

except (Exception, psycopg2.Error) as error :
    print ("Error while connecting to PostgreSQL", error)

finally:
    #closing database connection.
    if(cxn):
        cur.close()
        cxn.close()
        print("PostgreSQL connection is closed")
        print("tips table populated")
Selecting Data From a Server Use .fetchall() with LIMIT or TOP (#)LIMIT works for most databases, but does not work with SQL Server TOP (#) is used in place of LIMIT on SQL Server
def db_server_fetch(sql_query):
    try:
        # Make connection to db
        cxn = psycopg2.connect(CONNECT_DB)

        # Create a cursor to db
        cur = cxn.cursor()

        # Send sql query to request
        cur.execute(sql_query)
        records = cur.fetchall()

    except (Exception, psycopg2.Error) as error :
        print ("Error while connecting to PostgreSQL", error)

    finally:
        #closing database connection.
        if(cxn):
            cur.close()
            cxn.close()
            print("PostgreSQL connection is closed")
        return records
The test table is populated by Selecting the first five rows.
select_query = '''SELECT * FROM tips LIMIT 5;'''

records = db_server_fetch(select_query)
print(records)
def db_server_change(sql_query):
    try:
        # Make connection to db
        cxn = psycopg2.connect(CONNECT_DB)

        # Create a cursor to db
        cur = cxn.cursor()

        # Send sql query to request
        cur.execute(sql_query)
        records = cxn.commit()

    except (Exception, psycopg2.Error) as error :
        print ("Error while connecting to PostgreSQL", error)

    finally:
        #closing database connection.
        if(cxn):
            cur.close()
            cxn.close()
            print("PostgreSQL connection is closed")
        return records
Add a new record with the following data: On Saturday, new wait staff Alfred had one person at Breakfast for 10.76 and received a 0.50 tip.
add_data = '''INSERT INTO tips
    (id, weekday, meal_type, wait_staff, party_size, meal_total, tip)
    VALUES
    (504, 'Saturday', 'Breakfast', 'Alfred', 1, 10.76, 0.50);'''

db_server_change(add_data)
Make a SELECT Request to Get New Records
select_query = '''SELECT * FROM tips WHERE wait_staff='Alfred';'''

records = db_server_fetch(select_query)
print(records)
Accessing a SQL Database With Pandas pandas.read_sql( ) - loads data from database pandas.to_sql( ) - write data to database CAUTION: Please don't write to a database unless you know what you are doing and are authorized. If you are not, your permission should allow read-only.
def pandas_db_server_fetch(sql_query):
    try:
        # Make connection to db
        cxn = psycopg2.connect(CONNECT_DB)

        # Send sql query to request and create dataframe
        df = pd.read_sql(sql_query, cxn)

    except (Exception, psycopg2.Error) as error :
        print ("Error while connecting to PostgreSQL", error)

    finally:
        #closing database connection.
        if(cxn):
            cxn.close()
            print("PostgreSQL connection is closed")
        return df
select_query = '''SELECT * FROM tips WHERE wait_staff='Alfred';'''

alfred_df = pandas_db_server_fetch(select_query)
alfred_df.head()
tips_df = pandas_db_server_fetch('''SELECT * FROM tips;''')
tips_df.head()

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:00:00

MongoDB Package: pymongo

Using Python MongoDB Package - pymongo

00:09:56

Lesson Description:

In this lesson, we learn how to access a MongoDB database and perform operations on it using Python and the Python package PyMongo. A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways but we suggest downloading the PDF, watching through the entire lesson, and then complete the steps outside of the lesson. Of course, feel free to use it as desired. This is the same information as in the PDF: Information for Installing and Remote Access Information on installation is taken from MongoDB docs. Information on remote access is found at How to connect to your remote MongoDB server. Installing MongoDB on the Playground Server Information on installation is taken from MongoDB docs. Information on remote access can be found at How to connect to your remote MongoDB server. Import the GPG key from MongoDB. wget -qO - https://www.mongodb.org/static/pgp/server-4.2.asc | sudo apt-key add - Create a list file for MongoDB. echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.2 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.2.list Reload the package database. sudo apt update Install MongoDB. sudo apt install -y mongodb-org Start the MongoDB service. sudo systemctl start mongod This starts the MongoDB service as needed. To have the MongoDB service start every time we start the server: sudo systemctl enable mongod We only start the service when necessary, but this is an option. Create User, Database, and Grant Access Create User At the terminal: bash mongo Now at the mongo terminal: ```mongo use cloud_user

db.createUser({
    user: 'cloud_user',
    pwd: 'cloud_user',
    roles: [{ role: 'readWrite', db:'cloud_user'}]
})
```
Leave the mongo shell with CTRL-c. Create Collection from File A special thanks to MongoDB for the zips dataset. Import collection: mongoimport -v --db=cloud_user --file=/home/cloud_user/python_data_course/data/zips.json Install MongoDB Driver to Our Virtual Environment
```
conda activate python_data_course
conda install pymongo
```
Start and Connect to the Jupyter Notebook Server as Usual Using Python to Connect to the Server Imports and DB Connection
from pymongo import MongoClient
import pandas as pd

client = MongoClient("mongodb://cloud_user:cloud_user@localhost:27017/cloud_user")
df = pd.DataFrame.from_records(client.cloud_user.zips.find())
df.head()
oz_data = {"_id": 99990, "city": "EMERALD", "loc": [-510.9, -600.89], "pop": 564372, "state": "Munchkin Land"}
client.cloud_user.zips.insert_one(oz_data)
df1 = pd.DataFrame.from_records(client.cloud_user.zips.find())
df1.tail()
df.count()
df1.count()
client.close()

Delimited Data Text Files

What Are Delimited Data Files

00:04:33

Lesson Description:

In this lesson, we learn about some of the operations we can use to manipulate data in a pandas DataFrame. The information below is composed of markdown and code cells that can be copied and pasted into a Jupyter Notebook to follow along. A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways, but we suggest downloading the PDF, watching through the entire lesson, and then completing the steps outside of the lesson. Of course, use it to follow along as desired. This is the same information as in the PDF: Delimited DataRepeated value types per record are separated by a delimiter and records separated by a 'n'. The values can be determined based on the delimiter. Each time a delimiter character is found, it signals the end of one value and the next character is the start of a new value.Common Types of DelimitersThe most common delimiter is the comma (','). Essentially anything can be used: ';' 't' size or space (e.g. every 10 characters is a new value). This is uncommon.Example of a CSV FileComma Seperated Values

'color1', 'color2', 'color3'
'red', 'green', 'blue'
'blue', 'pink', 'green'
'purple', 'egg white', 'beige'
What are Non-Delimited Data Text Files Non-Delimited Data Non-delimited data is rare. It is also impossible to predict. Review the data in the file and see if a pattern can be discerned. If so, it may be possible to write a Python script that will parse the data into a CSV file.
('color1', 'color2', 'color3'), 'red', 'green', 'blue', [('blue', 'pink'), 'green'] 'purple' 'egg white' 'beige'

Reading and Writing CSV Files as pandas DataFrames

00:04:28

Lesson Description:

In this lesson, we learn how to read and write csv files using pandas. A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways, but we suggest downloading the PDF, watching through the entire lesson, and then completing the steps outside of the lesson. Of course, use it to follow along as desired. This is the same information as in the PDF: Reading CSV Files with pandas Documentation for pandas.read_csv(). pandas has a function named read_csv() that will read a csv file directly into a DataFrame. This method takes the following arguments:filepath sep=',': This has a default of ',', but we encourage its use so when we encounter a file delimited with something else, we know how to set it. header=0: This tells it to treat the first line as the column names. Refer to documentation for other options.There are other arguments to skip blank lines, set encoding, and provide column names. Please refer to the documentation for more information. Read tips.csv into a DataFrame

import pandas as pd

tips_df = pd.read_csv('./tips.csv', sep=',', header=0)
tips_df.head()
Add the tip_percent Column and Calculation to the DataFrame
percent_tip = pd.Series(tip_df['tip']/tips_df['meal_total'],name='tip_percent')
tips_df = pd.concat([tips_df, percent_tip], axis=1)
tips_df.head()
Writing CSV Files with pandas Documentation for pandas.to_csv(). pandas has a function named to_csv() that will write a DataFrame to a csv file. This method takes the following arguments:`filepath sep=',': This has a default of ',', but we encourage its use. header=True: This tells it to treat the first line as the column names. index=False: Prevents index values being written to the file.There are other arguments. Please refer to the documentation for more information. Write tips_df to tips_percent.csv
tips_df.to_csv('tips_percent.csv', sep=',', header=True, index=False)
Check the File was Written Correctly by Reading into a New df
tips_df_from_file = pd.read_csv('./tips_percent.csv', sep=',', header=0)
tips_df_from_file.head()

Excel Data Files

Using pandas DataFrames with Excel

00:07:17

Lesson Description:

In this lesson, we use pandas to read and write Excel spreadsheets. A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways, but it is suggested that you download the PDF, watch through the entire lesson, and then complete the steps on your own outside of the lesson. Of course, you can use it to follow along with the lesson as you wish. This is the same information as in the PDF: pandas and Excelpandas can read and write to Excel pandas uses openpyxl, install conda install openpyxl xlrd pandas.to_excel() documentation pandas.read_excel() documentationRead tips.xlsx as a dataframe

import pandas as pd

tips_df = pd.read_excel('tips.xlsx', index_col=0)  

tips_df.head()
Create a separate df for each meal type
breakfast_df = tips_df[tips_df.meal_type=='Breakfast']
lunch_df = tips_df[tips_df.meal_type=='Lunch']
dinner_df = tips_df[tips_df.meal_type=='Dinner']
Use pd.to_excel() to create an Excel workbook with the breakfast data
breakfast_df.to_excel("breakfast_tips.xlsx")  
Test the file is created and has data
breakfast_tips_df = pd.read_excel('breakfast_tips.xlsx', index_col=0)  

breakfast_tips_df.head()
Write Excel file meal_type_tips.xlsx with a worksheet for each meal-type and one for the original data NOTE: It is advised to keep a copy of the original data. So, I suggest you always save to a new file. To write to separate worksheets, we will use ExcelWriter with .to_excel().
with pd.ExcelWriter('meal_type_tips.xlsx') as writer:  
    breakfast_df.to_excel(writer, sheet_name='breakfast')
    lunch_df.to_excel(writer, sheet_name='lunch')
    dinner_df.to_excel(writer, sheet_name='dinner')
    tips_df.to_excel(writer, sheet_name='tips_orig')
To read all sheets in as an ordered_dict:
meal_type_tips_df = pd.read_excel('meal_type_tips.xlsx', sheet_name=None)  

meal_type_tips_df.keys()
meal_type_tips_df['breakfast']
To read each sheet into a dataframe separately, use argument sheet_name.
breakfast_tips_df = pd.read_excel('meal_type_tips.xlsx', sheet_name='breakfast')  

breakfast_tips_df.head()

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

00:30:00

Writing the Report

Brief Overview of Adding LaTeX to a Report

00:05:56

Lesson Description:

In this lesson, we learn how to use LaTex to make professional-looking mathematical formulas in our reports. A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways, but it is suggested that you download the PDF, watch through the entire lesson, and then complete the steps on your own outside of the lesson. Of course, you can use it to follow along with the lesson as you wish. This is the same information as in the PDF: LaTex LaTex is a document preparation system that is designed to ease the creation of technical and scientific documents. It is really good at displaying mathematical formulas and is built-in to Jupyter Notebooks. Depending on your use case, it may be something you need. In this lesson, we are going to cover the basic use of LaTex to add mathematical formulas to your report. You can easily find many examples of how to use LaTex on the web. Just search for latex matrices or whatever subject you need. LaTex Documentation Inline LaTex Equations LaTex formulas are written in a Markdown cell. Inline formulas are those that are within a sentence. The Pythagorean theorem is a^2 + b^2 = c^2. While the above statement is factual, it doesn't look nice for a report or publication, The Pythagorean theorem is $a^2 + b^2 = c^2$. Created with $a^2 + b^2 = c^2$. This looks much better and is generated by placing $ at the beginning and the end of the equation. Inline formulas can be designated with $. Latex Equations Not Inline LaTex formulas that are not inline are denoted with a $$ at the start and end of the formula. Be default the formulas will be centered. Greek characters are denoted by a and the Greek letter. For example beta results in $beta$. To create a matrix use begin{matrix} and end with end{matix}.

$$
begin{bmatrix}
alpha& beta^{*}\
gamma^{*}& delta
end{bmatrix}    
$$
becomes: $$ begin{bmatrix} alpha& beta^{} gamma^{}& delta end{bmatrix}$$ Here is one last example demonstrating a summation and a fraction.
$$
sum_{i=1}^{n}i=frac{n(n+1)}{2}
$$
Becomes: $$ sum_{i=1}^{n}i=frac{n(n+1)}{2} $$

Good Charting Practices and Review of a Complete Report

00:17:36

Lesson Description:

A PDF is provided here. This PDF documents the work we will be doing during this lesson. It can be used in a variety of ways, but it is suggested that you download the PDF, watch through the entire lesson, and then complete the steps on your own outside of the lesson. Of course, you can use it to follow along with the lesson as you wish. This is the same information as in the PDF: Install matplotlib into the conda virtual environment first Predicting Suicide Rates From Human Development Index Suicide is a complex social issue that touches many families and individuals. Individual opinions on suicide are varied. Reasons for committing or considering suicide are even more varied. We all have those moments where we say to ourselves, "I should just kill myself." These are not truly suicidal thoughts, just like "I could kill him" is not truly a call for murder. Those contemplating suicide may show certain behaviors that are often not recognized until after the suicide; others show no signs. The reasons for suicide can only be guessed at by those left behind from a note, behaviors, or other artifacts. Others are well documented or can be decidedly clear.Terry Pratchett Suicide Documentary Other DocumentariesI am not attempting to sway opinions on suicide, just to gain a better understanding of contributing factors when the reasons seem unclear. Questions This data set does not allow us to come any closer to an answer for questions that are indicated in the above statements, but it does allow us to consider some aspects of socioeconomic factors that may impact it.What factors lead to high suicide rates across the world?Why do some countries have higher suicide rates than others?Do the rates vary by age group and gender and by country?What other analysis could be performed on this data or what other information might make this a more valuable dataset?DatasetDataset obtained from Kaggle on 2019-May-20 and was last updated on 2018-Dec-01.Dataset was created by RustyDataset was developed from:United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebookWorld Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/Initial Examination of Dataset

# imports needed for analyzing data
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random

%matplotlib inline
# read the suicide dataset into pandas
df = pd.read_csv('./master.csv')
# review data provided by the dataset
df.head()
The dataset shows redundant information in some cases and data that may be subject to interpretation in others. Specifically, 'country-year' is just a repeat of 'country' and 'year'. Also, 'generation' may have categories that are not well-defined and subject to interpretation to determine the value. I will be using 'country', 'year', 'suicides/100k pop', 'age', 'sex', and 'HDI for year'. HDI - Human Development Index is an index that is calculated for a country based on life expectancy, attainable levels of education, and income. It is important to note that the calculation of HDI changed in 2010. While this would have to be accounted for in a more thorough study, for this instance, I will only be using the HDI for the year 2010. It is an assumption that the HDI was approximately the same for previous years and may represent a slight bias. suicides/100k pop is a normalized rate of suicide determined by: $$ frac{suicides}{100k pop} = frac{suicides_no}{frac{population}{100000}}$$ Examination of ranges of parameters being used
# review parameters of data that is of interest in this examination
print(f'Genders: {df.sex.unique()}n')
print(f'Age groups: {df.age.unique()}n')
print(f'Data year range: {df.year.min()} - {df.year.max()}n')
print(f'How many countries are reported in each year; is the data normalizable by country?')
number_countries_year = df['country'].groupby(df['year']).unique()
for i in number_countries_year.keys():
    print(f'{i}: {len(number_countries_year[i])}')
It appears that the data is not complete for all years. Some years have more countries than others. This makes a full examination of data impossible for all countries. I will be narrowing what I look at throughout the project. The decade data 1990, 2000, and 2010 will be the years included. There are no specific reasons I picked those years for this report, just a draw to the decade years. A straightforward comparison of suicide rates per year by gender
# create dataframe for male and female
df_male = df[df['sex']=='male'].copy()
df_female = df[df['sex']=='female'].copy()

# Comparison on same graph
suicides_by_male = df_male['suicides/100k pop'].groupby(df['year']).mean().reset_index(level=0)
suicides_by_female = df_female['suicides/100k pop'].groupby(df['year']).mean().reset_index(level=0)
plt.figure(figsize=(33, 5))
plt.subplot(1,2,1)
plt.plot(suicides_by_male['year'], suicides_by_male['suicides/100k pop'], color='red', alpha=0.5, label='male')
plt.plot(suicides_by_female['year'], suicides_by_female['suicides/100k pop'], color='blue', alpha=0.5, label='female')
plt.title(f'Male and Female')
plt.xlabel('year')
plt.xticks(rotation=90)
ax = plt.gca()
ax.set_ylim([0, 30])
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper right')
plt.show()

# group suicide rates by year and plot
suicides_by_male = df_male['suicides/100k pop'].groupby(df['year']).mean().reset_index(level=0)
suicides_by_female = df_female['suicides/100k pop'].groupby(df['year']).mean().reset_index(level=0)
plt.figure(figsize=(15, 5))
plt.subplot(1,2,1)
plt.plot(suicides_by_male['year'], suicides_by_male['suicides/100k pop'], color='purple', alpha=0.5)
plt.title(f'Male')
plt.xlabel('year')
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')

# plt.figure(figsize=(15, 5))
plt.subplot(1,2,2)
plt.plot(suicides_by_female['year'], suicides_by_female['suicides/100k pop'], color='purple', alpha=0.5)
plt.title(f'Female')
plt.xlabel('year')
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')

plt.show()
Female suicide rates are much lower than male suicide rates. Both show a peak at about 1995 and have been declining since to a relatively low rate. There is an uptick that appears around 2016, the last year in the dataset. However, this is a possible outlier since there are only 16 countries in the dataset for 2016. The countries for 2016 are predominantly in the area commonly known as Eastern Europe. Later this notebook will be able to supply evidence that these countries tend to have a high suicide rate. Examine HDI data As noted above, I decided to only use the HDI data from 2010 since that is the year they put into place a new calculation to determine this value. If someone would like to look at HDI from other years, it is possible by changing the value in the variable below - year_of_ineterest.
# pick the year you are interested in looking at
year_of_interest = 2010

# get the data
df_year_interest = df[df['year']==year_of_interest].copy()
# There are null values in the column 'HDI for year', these can be safely ignored
missing_values = df['HDI for year'].isnull()
# filter returns truw for null values, need to negate that to get values
hdi = df_year_interest['HDI for year'][~missing_values]

plt.figure(figsize=(15, 5))
plt.hist(hdi, facecolor='orange', alpha=0.5)
# add mean
plt.axvline(hdi.mean(), color='b', linestyle='solid', linewidth=2, label='mean')
# +1 sd
plt.axvline(hdi.mean() + hdi.std(), color='b', linestyle='dashed', linewidth=2, label='+1 standard deviation')
# -1 sd
plt.axvline(hdi.mean() - hdi.std(), color='b', linestyle='dashed', linewidth=2, label='-1 standard deviation')

plt.title(f'HDI for {year_of_interest}')
plt.xlabel('HDI value')
plt.ylabel('count of specific HDI values')
plt.legend(loc='upper left')
plt.show()
hdi_describe = hdi.describe()
print(f'Stats for HDI for {year_of_interest}:n {hdi_describe}')
A thorough examination of the HDI effect on the suicide rate would be an exhaustive process and outside the range of this study. As a result, I have decided to only look at one country from each of the four quartiles shown above and see if any conclusions can be drawn. Since there is no guarantee that a country will have an HDI exactly that of a random HDI selection, I have determined to use a range around the mean of the quartile. After calculating the mean for the quartile, I then find an upper and lower bound by adding or subtracting a small percentage from the mean. This percentage can be changed to explore the effect by changing the size; changes should be made to 'quartile range' below. Using this method results in a list of countries for each quartile, to make sure I do not select a country on bias, I choose the country to use by generating a random integer that is used as an index for the list. I did not set a random seed so that multiple runs result in different countries for an easy observational study of the data.
# set the range desired for the quartile mean
quartile_range = 0.05
qtr_1_mean = (hdi_describe['min'] + hdi_describe['25%']) / 2 
qtr_2_mean = (hdi_describe['25%'] + hdi_describe['50%']) / 2
qtr_3_mean = (hdi_describe['50%'] + hdi_describe['75%']) / 2
qtr_4_mean = (hdi_describe['75%'] + hdi_describe['max']) / 2

print(f'Randomly selected countries:')
# qtr 1 df country
qtr_1_bool_low = df['HDI for year'] > (qtr_1_mean - (quartile_range * qtr_1_mean))
qtr_1_bool_high = df['HDI for year'] < (qtr_1_mean + (quartile_range * qtr_1_mean))
qtr_1_df = df[qtr_1_bool_low & qtr_1_bool_high]
qtr_1_countries = qtr_1_df['country'].unique()
qtr_1_country = qtr_1_countries[random.randint(0, len(qtr_1_countries))]
print(f'qtr_1_country: {qtr_1_country}')

# qtr 2 df country
qtr_2_bool_low = df['HDI for year'] > (qtr_2_mean - (quartile_range * qtr_2_mean))
qtr_2_bool_high = df['HDI for year'] < (qtr_2_mean + (quartile_range * qtr_2_mean))
qtr_2_df = df[qtr_2_bool_low & qtr_2_bool_high]
qtr_2_countries = qtr_2_df['country'].unique()
qtr_2_country = qtr_2_countries[random.randint(0, len(qtr_2_countries))]
print(f'qtr_2_country: {qtr_2_country}')

# qtr 3 df country
qtr_3_bool_lo = df['HDI for year'] > (qtr_3_mean - (quartile_range * qtr_3_mean))
qtr_3_bool_hi = df['HDI for year'] < (qtr_3_mean + (quartile_range * qtr_3_mean))
qtr_3_df = df[qtr_3_bool_lo & qtr_3_bool_hi]
qtr_3_countries = qtr_3_df['country'].unique()
qtr_3_country = qtr_3_countries[random.randint(0, len(qtr_3_countries))]
print(f'qtr_3_country: {qtr_3_country}')

# qtr 4 df country
qtr_4_bool_lo = df['HDI for year'] > (qtr_4_mean - (quartile_range * qtr_4_mean))
qtr_4_bool_hi = df['HDI for year'] < (qtr_4_mean + (quartile_range * qtr_4_mean))
qtr_4_df = df[qtr_4_bool_lo & qtr_4_bool_hi]
qtr_4_countries = qtr_4_df['country'].unique()
qtr_4_country = qtr_4_countries[random.randint(0, len(qtr_4_countries))]
print(f'qtr_4_country: {qtr_4_country}')
Replace the category labels in 'age' to allow for controlling the output in later graphs The string values in the 'age' column are not sortable by matplotlib, and so it seems to sort by the highest value; instead, this results in graphs that are not easily comparable. Instead, the category labels will be replaced with an integer and then rename the xticks.
to_replace_string = ['5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years', '75+ years']
replace_with_string = [1, 2, 3, 4, 5, 6]
# The dataframe with the years as integers
df_replace = df.replace(to_replace=to_replace_string, value=replace_with_string).copy()
# create dfs for each year using in the graphs
df_1990 = df_replace[df_replace['year']==1990].copy()
df_2000 = df_replace[df_replace['year']==2000].copy()
df_2010 = df_replace[df_replace['year']==2010].copy()

# Get data for countries of interest
country_name = qtr_4_country
df_1990_qtr_4 = df_1990[df_1990.country == country_name].copy()
df_2000_qtr_4 = df_2000[df_2000.country == country_name].copy()
df_2010_qtr_4 = df_2010[df_2010.country == country_name].copy()

country_name = qtr_3_country
df_1990_qtr_3 = df_1990[df_1990.country == country_name].copy()
df_2000_qtr_3 = df_2000[df_2000.country == country_name].copy()
df_2010_qtr_3 = df_2010[df_2010.country == country_name].copy()

country_name = qtr_2_country
df_1990_qtr_2 = df_1990[df_1990.country == country_name].copy()
df_2000_qtr_2 = df_2000[df_2000.country == country_name].copy()
df_2010_qtr_2 = df_2010[df_2010.country == country_name].copy()

country_name = qtr_1_country
df_1990_qtr_1 = df_1990[df_1990.country == country_name].copy()
df_2000_qtr_1 = df_2000[df_2000.country == country_name].copy()
df_2010_qtr_1 = df_2010[df_2010.country == country_name].copy()
Generate the graphs
# Quartile 4 Suicide Plot
plt.figure(figsize=(15, 5))

# create dfs for male
df_1990_qtr_4_male = df_1990_qtr_4[df_1990_qtr_4.sex == 'male'].copy()
df_2000_qtr_4_male = df_2000_qtr_4[df_2000_qtr_4.sex == 'male'].copy()
df_2010_qtr_4_male = df_2010_qtr_4[df_2010_qtr_4.sex == 'male'].copy()

# generate the male plot for the three years
# plt.subplot(1,2,1)
plt.subplot(4,2,1)
plt.scatter(df_1990_qtr_4_male['age'], df_1990_qtr_4_male['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_4_male['age'], df_2000_qtr_4_male['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_4_male['age'], df_2010_qtr_4_male['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 4: {qtr_4_country} Male')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1, 2, 3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# create dfs for femles
df_1990_qtr_4_female = df_1990_qtr_4[df_1990_qtr_4.sex == 'female'].copy()
df_2000_qtr_4_female = df_2000_qtr_4[df_2000_qtr_4.sex == 'female'].copy()
df_2010_qtr_4_female = df_2010_qtr_4[df_2010_qtr_4.sex == 'female'].copy()

# generate the female plot for the three years
#plt.subplot(1,2,2)
plt.subplot(4,2,2)
plt.scatter(df_1990_qtr_4_female['age'], df_1990_qtr_4_female['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_4_female['age'], df_2000_qtr_4_female['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_4_female['age'], df_2010_qtr_4_female['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 4: {qtr_4_country} Female')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# Quartile 3 Suicide Plot
plt.figure(figsize=(15, 5))

# create dfs for male
df_1990_qtr_3_male = df_1990_qtr_3[df_1990_qtr_3.sex == 'male'].copy()
df_2000_qtr_3_male = df_2000_qtr_3[df_2000_qtr_3.sex == 'male'].copy()
df_2010_qtr_3_male = df_2010_qtr_3[df_2010_qtr_3.sex == 'male'].copy()

# generate the male plot for the three years
#plt.subplot(1,2,1)
plt.subplot(4,2,3)
plt.scatter(df_1990_qtr_3_male['age'], df_1990_qtr_3_male['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_3_male['age'], df_2000_qtr_3_male['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_3_male['age'], df_2010_qtr_3_male['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 3: {qtr_3_country} Male')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# create dfs for femles
df_1990_qtr_3_female = df_1990_qtr_3[df_1990_qtr_3.sex == 'female'].copy()
df_2000_qtr_3_female = df_2000_qtr_3[df_2000_qtr_3.sex == 'female'].copy()
df_2010_qtr_3_female = df_2010_qtr_3[df_2010_qtr_3.sex == 'female'].copy()

# generate the female plot for the three years
# plt.subplot(1,2,2)
plt.subplot(4,2,4)
plt.scatter(df_1990_qtr_3_female['age'], df_1990_qtr_3_female['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_3_female['age'], df_2000_qtr_3_female['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_3_female['age'], df_2010_qtr_3_female['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 3: {qtr_3_country} Female')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# Quartile 2 Suicide Plot
plt.figure(figsize=(15, 5))

# create dfs for male
df_1990_qtr_2_male = df_1990_qtr_2[df_1990_qtr_2.sex == 'male'].copy()
df_2000_qtr_2_male = df_2000_qtr_2[df_2000_qtr_2.sex == 'male'].copy()
df_2010_qtr_2_male = df_2010_qtr_2[df_2010_qtr_2.sex == 'male'].copy()

# generate the male plot for the three years
#plt.subplot(1,2,1)
plt.subplot(4,2,5)
plt.scatter(df_1990_qtr_2_male['age'], df_1990_qtr_2_male['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_2_male['age'], df_2000_qtr_2_male['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_2_male['age'], df_2010_qtr_2_male['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 2: {qtr_2_country} Male')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# create dfs for femles
df_1990_qtr_2_female = df_1990_qtr_2[df_1990_qtr_2.sex == 'female'].copy()
df_2000_qtr_2_female = df_2000_qtr_2[df_2000_qtr_2.sex == 'female'].copy()
df_2010_qtr_2_female = df_2010_qtr_2[df_2010_qtr_2.sex == 'female'].copy()

# generate the female plot for the three years
# plt.subplot(1,2,2)
plt.subplot(4,2,6)
plt.scatter(df_1990_qtr_2_female['age'], df_1990_qtr_2_female['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_2_female['age'], df_2000_qtr_2_female['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_2_female['age'], df_2010_qtr_2_female['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 2: {qtr_2_country} Female')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# Quartileqtr_1 Suicide Plot
plt.figure(figsize=(15, 5))

# create dfs for male
df_1990_qtr_1_male = df_1990_qtr_1[df_1990_qtr_1.sex == 'male'].copy()
df_2000_qtr_1_male = df_2000_qtr_1[df_2000_qtr_1.sex == 'male'].copy()
df_2010_qtr_1_male = df_2010_qtr_1[df_2010_qtr_1.sex == 'male'].copy()

# generate the male plot for the three years
# plt.subplot(1,2,1)
plt.subplot(4,2,7)
plt.scatter(df_1990_qtr_1_male['age'], df_1990_qtr_1_male['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_1_male['age'], df_2000_qtr_1_male['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_1_male['age'], df_2010_qtr_1_male['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 1: {qtr_1_country} Male')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# create dfs for femles
df_1990_qtr_1_female = df_1990_qtr_1[df_1990_qtr_1.sex == 'female'].copy()
df_2000_qtr_1_female = df_2000_qtr_1[df_2000_qtr_1.sex == 'female'].copy()
df_2010_qtr_1_female = df_2010_qtr_1[df_2010_qtr_1.sex == 'female'].copy()

# generate the female plot for the three years
# plt.subplot(1,2,2)
plt.subplot(4,2,8)
plt.scatter(df_1990_qtr_1_female['age'], df_1990_qtr_1_female['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_1_female['age'], df_2000_qtr_1_female['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_1_female['age'], df_2010_qtr_1_female['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 1: {qtr_1_country} Female')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

plt.show()
# Quartile 4 Suicide Plot
plt.figure(figsize=(15, 5))

# create dfs for male
df_1990_qtr_4_male = df_1990_qtr_4[df_1990_qtr_4.sex == 'male'].copy()
df_2000_qtr_4_male = df_2000_qtr_4[df_2000_qtr_4.sex == 'male'].copy()
df_2010_qtr_4_male = df_2010_qtr_4[df_2010_qtr_4.sex == 'male'].copy()

# generate the male plot for the three years
plt.subplot(1,2,1)
plt.scatter(df_1990_qtr_4_male['age'], df_1990_qtr_4_male['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_4_male['age'], df_2000_qtr_4_male['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_4_male['age'], df_2010_qtr_4_male['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 4: {qtr_4_country} Male')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1, 2, 3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# create dfs for femles
df_1990_qtr_4_female = df_1990_qtr_4[df_1990_qtr_4.sex == 'female'].copy()
df_2000_qtr_4_female = df_2000_qtr_4[df_2000_qtr_4.sex == 'female'].copy()
df_2010_qtr_4_female = df_2010_qtr_4[df_2010_qtr_4.sex == 'female'].copy()

# generate the female plot for the three years
plt.subplot(1,2,2)
plt.scatter(df_1990_qtr_4_female['age'], df_1990_qtr_4_female['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_4_female['age'], df_2000_qtr_4_female['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_4_female['age'], df_2010_qtr_4_female['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 4: {qtr_4_country} Female')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# Quartile 3 Suicide Plot
plt.figure(figsize=(15, 5))

# create dfs for male
df_1990_qtr_3_male = df_1990_qtr_3[df_1990_qtr_3.sex == 'male'].copy()
df_2000_qtr_3_male = df_2000_qtr_3[df_2000_qtr_3.sex == 'male'].copy()
df_2010_qtr_3_male = df_2010_qtr_3[df_2010_qtr_3.sex == 'male'].copy()

# generate the male plot for the three years
plt.subplot(1,2,1)
plt.scatter(df_1990_qtr_3_male['age'], df_1990_qtr_3_male['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_3_male['age'], df_2000_qtr_3_male['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_3_male['age'], df_2010_qtr_3_male['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 3: {qtr_3_country} Male')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# create dfs for femles
df_1990_qtr_3_female = df_1990_qtr_3[df_1990_qtr_3.sex == 'female'].copy()
df_2000_qtr_3_female = df_2000_qtr_3[df_2000_qtr_3.sex == 'female'].copy()
df_2010_qtr_3_female = df_2010_qtr_3[df_2010_qtr_3.sex == 'female'].copy()

# generate the female plot for the three years
plt.subplot(1,2,2)
plt.scatter(df_1990_qtr_3_female['age'], df_1990_qtr_3_female['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_3_female['age'], df_2000_qtr_3_female['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_3_female['age'], df_2010_qtr_3_female['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 3: {qtr_3_country} Female')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# Quartile 2 Suicide Plot
plt.figure(figsize=(15, 5))

# create dfs for male
df_1990_qtr_2_male = df_1990_qtr_2[df_1990_qtr_2.sex == 'male'].copy()
df_2000_qtr_2_male = df_2000_qtr_2[df_2000_qtr_2.sex == 'male'].copy()
df_2010_qtr_2_male = df_2010_qtr_2[df_2010_qtr_2.sex == 'male'].copy()

# generate the male plot for the three years
plt.subplot(1,2,1)
plt.scatter(df_1990_qtr_2_male['age'], df_1990_qtr_2_male['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_2_male['age'], df_2000_qtr_2_male['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_2_male['age'], df_2010_qtr_2_male['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 2: {qtr_2_country} Male')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# create dfs for femles
df_1990_qtr_2_female = df_1990_qtr_2[df_1990_qtr_2.sex == 'female'].copy()
df_2000_qtr_2_female = df_2000_qtr_2[df_2000_qtr_2.sex == 'female'].copy()
df_2010_qtr_2_female = df_2010_qtr_2[df_2010_qtr_2.sex == 'female'].copy()

# generate the female plot for the three years
plt.subplot(1,2,2)
plt.scatter(df_1990_qtr_2_female['age'], df_1990_qtr_2_female['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_2_female['age'], df_2000_qtr_2_female['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_2_female['age'], df_2010_qtr_2_female['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 2: {qtr_2_country} Female')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# Quartileqtr_1 Suicide Plot
plt.figure(figsize=(15, 5))

# create dfs for male
df_1990_qtr_1_male = df_1990_qtr_1[df_1990_qtr_1.sex == 'male'].copy()
df_2000_qtr_1_male = df_2000_qtr_1[df_2000_qtr_1.sex == 'male'].copy()
df_2010_qtr_1_male = df_2010_qtr_1[df_2010_qtr_1.sex == 'male'].copy()

# generate the male plot for the three years
plt.subplot(1,2,1)
plt.scatter(df_1990_qtr_1_male['age'], df_1990_qtr_1_male['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_1_male['age'], df_2000_qtr_1_male['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_1_male['age'], df_2010_qtr_1_male['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 1: {qtr_1_country} Male')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

# create dfs for femles
df_1990_qtr_1_female = df_1990_qtr_1[df_1990_qtr_1.sex == 'female'].copy()
df_2000_qtr_1_female = df_2000_qtr_1[df_2000_qtr_1.sex == 'female'].copy()
df_2010_qtr_1_female = df_2010_qtr_1[df_2010_qtr_1.sex == 'female'].copy()

# generate the female plot for the three years
plt.subplot(1,2,2)
plt.scatter(df_1990_qtr_1_female['age'], df_1990_qtr_1_female['suicides/100k pop'], color='red', alpha=0.5, label='1990')
plt.scatter(df_2000_qtr_1_female['age'], df_2000_qtr_1_female['suicides/100k pop'], color='blue', alpha=0.5, label='2000')
plt.scatter(df_2010_qtr_1_female['age'], df_2010_qtr_1_female['suicides/100k pop'], color='green', alpha=0.5, label='2010')
plt.title(f'Qtr 1: {qtr_1_country} Female')
plt.xlabel('age group')
ax = plt.gca() # grab the current axes
ax.set_xticks([1,2,3, 4, 5, 6]) # choose which x locations to have ticks
ax.set_xticklabels(to_replace_string) # list to replace ticks label
plt.xticks(rotation=90)
plt.ylabel('suicides/100k pop')
plt.legend(loc='upper left')

plt.show()
Conclusions After several runs of the notebook, I reached a conclusion that there is no discernable connection between HDI and suicide rate. There is clear evidence that males have a higher suicide rate than females for the time periods in this dataset. A general conclusion can be drawn that the older a person becomes, the greater chance they will commit suicide; however, older ages are fraught with medical conditions, and that is more likely the cause for the increased suicide rate. Finally, since 1995, suicide rates have been in decline. 2016 shows an uptick in the suicide rate. However, the countries in 2016 are mostly from an area I would call Eastern Europe, and they seem to have a high suicide rate for the three years examined here as well. Further Study Studying the data from several runs indicates that the area of the world in which the country resides may be a better off suicide rate than HDI. While the other trends, age, gender, etc., seem to hold true regardless of the area of the world, the suicide rates do appear to be higher in some areas than others. For example, the area I would call Eastern Europe shows high suicide rates, while the area I would call the Bahamas shows low suicide rates. This may be a function of government stability and war as opposed to living standards. To study this information, a determination of areas of the world and the countries they comprise would have to be decided and then added to the dataset. Then graphs of suicide rates by area could be examined.

Hands-on Labs are real live environments that put you in a real scenario to practice what you have learned without any other extra charge or account to manage.

01:30:00

Conclusion

What's Next

00:02:12

Lesson Description:

Congratulations, you have made it to the end of the course. We have covered a lot of material, and I truly appreciate you spending your time with me on this course. I hope you found the information as exciting and valuable as I do.

Take this course and learn a new skill today.

Transform your learning with our all access plan.

Start 7-Day Free Trial