In this post, we will learn to do some basic exploration of a dataframe in Python. Some of the task we will complete include the following…
- Import data
- Examine data
- Work with strings
- Calculating descriptive statistics
Import Data
First, you need data, therefore, we will use the Titanic dataset, which is readily available on the internet. We will need to use the pd.read_csv() function from the pandas package. This means that we must also import pandas. Below is the code.
import pandas as pd df=pd.read_csv('FILE LOCATION HERE')
In the code above we imported pandas as pd so we can use the functions within it. Next, we create an object called ‘df’. Inside this object, we used the pd.read_csv() function to read our file into the system. The location of the file needs to type in quotes inside the parentheses. Having completed this we can now examine the data.
Data Examination
Now we want to get an idea of the size of our dataset, any problems with missing. To determine the size we use the .shape function as shown below.
df.shape Out[33]: (891, 12)
Results indicate that we have 891 rows and 12 columns/variables. You can view the whole dataset by typing the name of the dataframe “df” and pressing enter. If you do this you may notice there are a lot of NaN values in the “Cabin” variable. To determine exactly how many we can use is.null() in combination with the values_count. variables.
df['Cabin'].isnull().value_counts() Out[36]: True 687 False 204 Name: Cabin, dtype: int64
The code starts with the name of the dataframe. In the brackets, you put the name of the variable. After that, you put the functions you are using. Keep in mind that the order of the functions matters. You can see we have over 200 missing examples. For categorical varable, you can also see how many examples are part of each category as shown below.
df['Embarked'].value_counts() Out[39]: S 644 C 168 Q 77 Name: Embarked, dtype: int64
This time we used our ‘Embarked’ variable. However, we need to address missing values before we can continue. To deal with this we will use the .dropna() function on the dataset. THen we will check the size of the dataframe again with the “shape” function.
df=df.dropna(how='any') df.shape Out[40]: (183, 12)
You can see our dataframe is much smaller going 891 examples to 183. We can now move to other operations such as dealing with strings.
Working with Strings
What you do with strings really depends or your goals. We are going to look at extraction, subsetting, determining the length. Our first step will be to extract the last name of the first five people. We will do this with the code below.
df['Name'][0:5].str.extract('(\w+)') Out[44]: 1 Cumings 3 Futrelle 6 McCarthy 10 Sandstrom 11 Bonnell Name: Name, dtype: object
As you can see we got the last names of the first five examples. We did this by using the following format…
dataframe name[‘Variable Name’].function.function(‘whole word’))
.str is a function for dealing with strings in dataframes. The .extract() function does what its name implies.
If you want, you can even determine how many letters each name is. We will do this with the .str and .len() function on the first five names in the dataframe.
df['Name'][0:5].str.len() Out[64]: 1 51 3 44 6 23 10 31 11 24 Name: Name, dtype: int64
Hopefully, the code is becoming easier to read and understand.
Aggregation
We can also calculate some descriptive statistics. We will do this for the “Fare” variable. The code is repetitive in that only the function changes so we will run all of them at once. Below we are calculating the mean, max, minimum, and standard deviation for the price of a fare on the Titanic
df['Fare'].mean() Out[77]: 78.68246885245901 df['Fare'].max() Out[78]: 512.32920000000001 df['Fare'].min() Out[79]: 0.0 df['Fare'].std() Out[80]: 76.34784270040574
Conclusion
This post provided you with some ways in which you can maneuver around a dataframe in Python.