Working with a Dataframe in Python

In this post, we will learn to do some basic exploration of a dataframe in Python. Some of the task we will complete include the following…

  • Import data
  • Examine data
  • Work with strings
  • Calculating descriptive statistics

Import Data 

First, you need data, therefore, we will use the Titanic dataset, which is readily available on the internet. We will need to use the pd.read_csv() function from the pandas package. This means that we must also import pandas. Below is the code.

import pandas as pd
df=pd.read_csv('FILE LOCATION HERE')

In the code above we imported pandas as pd so we can use the functions within it. Next, we create an object called ‘df’. Inside this object, we used the pd.read_csv() function to read our file into the system. The location of the file needs to type in quotes inside the parentheses. Having completed this we can now examine the data.

Data Examination

Now we want to get an idea of the size of our dataset, any problems with missing. To determine the size we use the .shape function as shown below.

df.shape
Out[33]: (891, 12)

Results indicate that we have 891 rows and 12 columns/variables. You can view the whole dataset by typing the name of the dataframe “df” and pressing enter. If you do this you may notice there are a lot of NaN values in the “Cabin” variable. To determine exactly how many we can use is.null() in combination with the values_count. variables.

df['Cabin'].isnull().value_counts()
Out[36]: 
True     687
False    204
Name: Cabin, dtype: int64

The code starts with the name of the dataframe. In the brackets, you put the name of the variable. After that, you put the functions you are using. Keep in mind that the order of the functions matters. You can see we have over 200 missing examples. For categorical varable, you can also see how many examples are part of each category as shown below.

df['Embarked'].value_counts()
Out[39]: 
S    644
C    168
Q     77
Name: Embarked, dtype: int64

This time we used our ‘Embarked’ variable. However, we need to address missing values before we can continue. To deal with this we will use the .dropna() function on the dataset. THen we will check the size of the dataframe again with the “shape” function.

df=df.dropna(how='any')
df.shape
Out[40]: (183, 12)

You can see our dataframe is much smaller going 891 examples to 183. We can now move to other operations such as dealing with strings.

Working with Strings

What you do with strings really depends or your goals. We are going to look at extraction, subsetting, determining the length. Our first step will be to extract the last name of the first five people. We will do this with the code below.

df['Name'][0:5].str.extract('(\w+)')
Out[44]: 
1 Cumings
3 Futrelle
6 McCarthy
10 Sandstrom
11 Bonnell
Name: Name, dtype: object

As you can see we got the last names of the first five examples. We did this by using the following format…

dataframe name[‘Variable Name’].function.function(‘whole word’))

.str is a function for dealing with strings in dataframes. The .extract() function does what its name implies.

If you want, you can even determine how many letters each name is. We will do this with the .str and .len() function on the first five names in the dataframe.

df['Name'][0:5].str.len()
Out[64]: 
1 51
3 44
6 23
10 31
11 24
Name: Name, dtype: int64

Hopefully, the code is becoming easier to read and understand.

Aggregation

We can also calculate some descriptive statistics. We will do this for the “Fare” variable. The code is repetitive in that only the function changes so we will run all of them at once. Below we are calculating the mean, max, minimum, and standard deviation  for the price of a fare on the Titanic

df['Fare'].mean()
Out[77]: 78.68246885245901

df['Fare'].max()
Out[78]: 512.32920000000001

df['Fare'].min()
Out[79]: 0.0

df['Fare'].std()
Out[80]: 76.34784270040574

Conclusion

This post provided you with some ways in which you can maneuver around a dataframe in Python.

Leave a Reply