Skip to main content

Tricks for Working with Data in Python

Posted on June 5, 2020 by Larry FrittsLarry Fritts

What is Data Examination?

The terms ‘Data Science’ and ‘Data Analyst’ are considered hot topics in the IT world. The problem is that these are generic terms and mean different things to different people and organizations. As a result, I am calling this ‘Data Examination’. In this case, Data Examination refers to the ability to use the Pandas DataFrame to get a first look at the data, slice it into useful bits of information to make a decision, clean the data, get stats on the data, and add more information to a dataframe.
In the end, you will have a better understanding of how Pandas can be used for data examination and, as such, can aid a data scientist, data analyst, or anyone that examines data.

Why Python and Pandas?

Python is an easy-to-use programming language that is widely adopted for writing scripts, APIs, and most other IT work. It’s great for connections between servers, examining how a server is performing, writing apps that aggregate data, and many other uses. To learn more about Python, I suggest you refer to this documentation or complete the following courses:

 

Programming Use Cases with Python

Certified Associate in Python Programming Certification

Certified Entry-Level Python Programmer Certification

 

Pandas is an open-source data manipulation and analysis tool. It is built on Python and allows a user to easily examine and manipulate the data they are processing.

You can also get more information for any of these tips by taking my course, Using Python for Data Management and Reporting.

Creating a Dataframe

There are many ways data can be loaded into a Pandas DataFrame. In the case below, we are going to use this link which can be used to load lists to a dataframe. You can also load data directly from Excel using read_excel or read CSV text files using read_csv. You can create a dataframe from a database using SQL. All of the methods described here are covered even more in my course linked above. And to learn more about SQL, check out our SQL Deep Dive course.

import pandas as pd

Define the Data as a List


data = [
("Dexter","Johnsons","dog","shiba inu","red sesame",1.5,35,"m",False,"both",True),
("Alfred","Johnsons","cat","mix","tuxedo",4,12,"m",True,"indoor",True),
("Petra","Smith","cat","ragdoll","calico",6,None,"f",False,"both",True),
("Ava","Smith","dog","mix","blk/wht",12,32,"f",True,"both",False),
("Schroder","Brown","cat","mix","orange",13,15,"m",False,"indoor",True),
("Blackbeard","Brown","bird","parrot","multi",5,3,"f",False,"indoor",),
]

Define the Labels


labels = ["name","owner","type","breed","color","age","weight","gender","health issues","indoor/outboor","vaccinated"]

Create DataFrame

vet_records = pd.DataFrame.from_records(data, columns=labels)

vet_records

nameownertypebreedcolorageweightgenderhealth issuesindoor/outdoorvaccinated
0DexterJohnsonsdogshiba inured sesame1.535.0mFalsebothTrue
1AlfredJohnsonscatmixtuxedo4.012.0mTrueindoorTrue
2PetraSmithcatragdollcalico6.0NaNfFalsebothTrue
3AvaSmithdogmixblk/wht12.032.0fTruebothFalse
4SchroderBrowncatmixorange13.015.0mFalseindoorTrue
5BlackbeardBrownbirdparrotmulti5.03.0fFalseindoorNone

As you can see above, vet_records dataframe is very similar to a spreadsheet and can be thought of that way. This is a small dataset and just rpinting the dataframe gives us all the information we need to examine the data. However, generally, the datasets are going to be much larger and we will need to use other methods to get an idea of what the data looks like.

Determining What the Data Looks Like

.head() will display just the first five rows of the data. This is useful when the data set is larger than 10 rows. .tail() as you might expect shows the last five rows of the dataframe.

vet_records.head()
nameownertypebreedcolorageweightgenderhealth issuesindoor/outdoorvaccinated
0DexterJohnsonsdogshiba inured sesame1.535.0mFalsebothTrue
1AlfredJohnsonscatmixtuxedo4.012.0mTrueindoorTrue
2PetraSmithcatragdollcalico6.0NaNfFalsebothTrue
3AvaSmithdogmixblk/wht12.032.0fTruebothFalse
4SchroderBrowncatmixorange13.015.0mFalseindoorTrue
vet_records.tail()
nameownertypebreedcolorageweightgenderhealth issuesindoor/outdoorvaccinated
1AlfredJohnsonscatmixtuxedo4.012.0mTrueindoorTrue
2PetraSmithcatragdollcalico6.0NaNfFalsebothTrue
3AvaSmithdogmixblk/wht12.032.0fTruebothFalse
4SchroderBrowncatmixorange13.015.0mFalseindoorTrue
5BlackbeardBrownbirdparrotmulti5.03.0fFalseindoorNone

To get more information about the dataframe we have several other methods.

.shape will give the dimensions of the dataframe.
len(DF) will give the number of rows in the dataframe.
.size will give you the number of elements in the dataframe.
.dtypes will tell the type of data in each row.

shape = vet_records.shape # gives the dimension of dataframe, ie (5, 6)
rows = len(vet_records) 
num_elements = vet_records.size # gives the number of individual datum
data_types = vet_records.dtypes # gives the type of data for each column

print(f"Shape of dataframe: {shape}\n")
print(f"Number of rows: {rows}\n")
print(f"Number of elements: {num_elements}\n")
print(f"Types of data: {data_types}\n")
Shape of dataframe: (6, 11)

Number of rows: 6

Number of elements: 66

Types of data: name               object
owner              object
type               object
breed              object
color              object
age               float64
weight            float64
gender             object
health issues        bool
indoor/outboor     object
vaccinated         object
dtype: object

Notice all the string columns are listed as object. This is because a string type takes a maximum length argument, so they are imported as a object so they can be variable length.

Stats and Cleaning Data

Cleaning data is one of the most important and time-consuming jobs when examining data. The data we loaded is mostly “clean”, however, most data, especially text, will have errors that will need to be addressed. For example, maybe you need to determine the cost of articles that the researchers at your lab have ordered. When the researcher orders an article they enter the Title, Journal, and Cost into a spreadsheet. As you might imagine the data might look something like this:

TitleJournalCost
The Moons of JupJournal of Jupyter$23.89
The Spot on JupyterJrnl of Jup34.67

As you can see comparing the articles from the Journal of Jupyter will be hard when there is not a consistent name. Also, the costs are entered in different formats which will make it difficult for you to analyze. There is not a shortcut to fix this, it will be up to you to decide what to call the journal and to make the changes; the same for cost. This requires a process of looking at each entry and making the corrections.

Using .describe() will give you some statistics on the columns that have numeric data.

vet_records.describe()
ageweight
count6.0000005.000000
mean6.91666719.400000
std4.58712013.649176
min1.5000003.000000
25%4.25000012.000000
50%5.50000015.000000
75%10.50000032.000000
max13.00000035.000000

As you can see, describe() gives you the number of records, the mean, the standard deviation, the min and max, and the quartiles. Please Note: We had 6 records for age but only 5 for weight. Pandas will not include any column that is not filled in. Let’s use .isna() to determine if there are missing data points. .isna() returns a boolean for each element; False if it is there, and True if it is missing.

missing_data = vet_records.isna()
missing_data
nameownertypebreedcolorageweightgenderhealth issuesindoor/outboorvaccinated
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrue

If you look closely you can see that there are a missing weight and a missing vaccinated record. We can use missing_data to return the rows missing the data.

vet_records[missing_data.any(axis=1)]
nameownertypebreedcolorageweightgenderhealth issuesindoor/outboorvaccinated
2PetraSmithcatragdollcalico6.0NaNfFalsebothTrue
5BlackbeardBrownbirdparrotmulti5.03.0fFalseindoorNone

Depending on your needs, you may need to change these values. There are several ways to do this, some are great if there is a lot of missing data. For this example, we are going to use .at to fix the weight for Petra.

vet_records.at[2, "weight"] = 34.7 
vet_records
nameownertypebreedcolorageweightgenderhealth issuesindoor/outboorvaccinated
0DexterJohnsonsdogshiba inured sesame1.535.0mFalsebothTrue
1AlfredJohnsonscatmixtuxedo4.012.0mTrueindoorTrue
2PetraSmithcatragdollcalico6.034.7fFalsebothTrue
3AvaSmithdogmixblk/wht12.032.0fTruebothFalse
4SchroderBrowncatmixorange13.015.0mFalseindoorTrue
5BlackbeardBrownbirdparrotmulti5.03.0fFalseindoorNone

Here you can see that the weight for Petra has been updated. Let’s look at the stats again.

vet_records.described()

Now the weight statistics are based on 6 records instead of 5.

Slicing the Dataframe

Slicing a dataframe is very similar to slicing any list or type in python. There are many ways to slice data, in this article we are going to use the .loc() which allows slicing the data by column headers. To see more ways to slice the data please refer to my course linked above.

# get the pet name and owner for the 2nd record in the dataframe
vet_records.loc[1,["name", "owner"]]
name       Alfred
owner    Johnsons
Name: 1, dtype: object
# get the pet name and owner for all pets in the dataframe
vet_records.loc[:,["name", "owner"]]
nameowner
0DexterJohnsons
1AlfredJohnsons
2PetraSmith
3AvaSmith
4SchroderBrown
5BlackbeardBrown

We can also extract a specific column. This will create a Pandas’ series rather than a dataframe since it only has a single set of values.

Create a Pandas Series From the DataFrame



weight = vet_records['weight']
weight

0    35.0
1    12.0
2    34.7
3    32.0
4    15.0
5     3.0
Name: weight, dtype: float64

Counting and Groupby

We can also use count and groupby to look at the data in other ways.

# How many types of pets do we have?
vet_records.type.count()
6

So we know that we have 6 pets. Notice that this counts each rtype individual, so why we may have 2 dogs and it would be logical for them to be counted as 1 type, this results in each individual record being counted, thus 6.

To get a better idea of the count by types, we can use .groupby().

vet_records.groupby('type').count()
nameownerbreedcolorageweightgenderhealth issuesindoor/outdoorvaccinated
type
bird1111111110
cat3333333333
dog2222222222

So we can now tell we have 1 bird, 3 cats, and 2 dogs. However, this looks rather untidy. Let’s try value_counts() instead of count() and groupby() tp see if we get something that looks a little better.

vet_records.type.value_counts()
cat     3
dog     2
bird    1
Name: type, dtype: int64

That looks much better and is easier to read.

Adding Data to a Dataframe

You would like to look at the ratio of age: weight and have it be a new column in our dataframe. We can do this using the assign().

We are going to add the ratio age: weight as a column to the datagram.

Notice that this method iterates throughout the dataframe.



vet_records = vet_records.assign(age_weight=(vet_records['age']/vet_records['weight']))
vet_records
nameownertypebreedcolorageweightgenderhealth issuesindoor/outdoorvaccinatedage_weight
0DexterJohnsonsdogshiba inured sesame1.535.0mFalsebothTrue0.042857
1AlfredJohnsonscatmixtuxedo4.012.0mTrueindoorTrue0.333333
2PetraSmithcatragdollcalico6.034.7fFalsebothTrue0.172911
3AvaSmithdogmixblk/wht12.032.0fTruebothFalse0.375000
4SchroderBrowncatmixorange13.015.0mFalseindoorTrue0.866667
5BlackbeardBrownbirdparrotmulti5.03.0fFalseindoorNone1.666667

Wow! What a Journey

In this article, we looked at:

  1. How to create a dataframe
  2. How to determine what the data looks like
  3. Cleaning the data and getting basic statistical information about it
  4. How to slice the data
  5. How to group the data and count it
  6. Finally, how we can add our own data to the dataframe.

This was a brief intro to what Pandas can do and how to use just the basic functions to examine data for decision making. Pandas can do much more, and I talk about these things more in-depth and show you how to plot data, get data from a database, and use Jupyter Notebooks for your data examination in my course Using Python for Data Management and Reporting.

I hope you enjoyed this article and I hope it helps you on your journey to becoming a data professional.

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *