In this post and future posts, we will work with actual data that is available in R to apply various skills. For now, we will work with the ‘mtcars’ data that is available in R. This dataset contains information about 32 different cars that were built in the 1970’s.
Often in research, the person who collects the data and the person who analyzes the data (the statistician) are different. This often leads to the statistician getting a lot of missing data that cannot be analyzed in its current state. This leads to the statistician having to clean the data in order to analyze it.
One of the first things a statistician does after he has received some data is to see how many different variables there are and perhaps decide if any can be converted to factors. In order to achieve these two goals we need to answer the following questions
- How many variables are there? Many functions can answer this question
- How many unique values does each variable have? This will tell us if the variable is a candidate for becoming a factor.
Below is the code for doing this with the ‘mtcar’ dataset.
> sapply(mtcars, function(x) length(unique(x))) mpg cyl disp hp drat wt qsec vs am gear carb 25 3 27 22 22 29 30 2 2 3 6
Here is what we did
- We used the ‘sapply’ function because we want to apply our function on the whole dataframe at once.
- Inside the ‘sapply’ function we tell R that the dataset is ‘mtcars’
- Next is the function we want to use. We tell R to use the function ‘x’ which is an anonymous function.
- After indicating that the function is anonymous we tell R what is inside the anonymous function.
- The anonymous function contains the length function with the unique function within it.
- This means that we want the length (or number) of unique values in each variable.
We have the answers to our question
- There are eleven variables in the ‘mtcars’ dataset as listed in the table above
- The unique values are also listed in the table above.
A common rule of thumb for determining whether to convert a variable to a factor is that the variable has less then 10 unique values. Based on this rule, the cyl, vs, am, gear, and carb variables could be converted to factors. Converting to a factor makes a continuous variable a categorical one and opens up various analysis possibilities.
Now that we have an idea of what are the characteristics of the dataset we have to decide the following…
- What variables are we going to use
- What type of analysis are we going to do with the variables we use.
For question 1 we are going to use the 1st (mpg), the 2nd (cyl), the 9th (am) and the 10th (gear) variables for our analysis. Then we are going to make the ‘am’ variable a factor. Lastly, we are going to put in order the values in the ‘gear’ variable. Below is the code
> cars <- mtcars[c(1,2,9,10)] > cars$am <- factor(cars$am, labels=c('auto', 'manual')) > cars$gear <- ordered(cars$gear)
Here is what we did
- We created the variable ‘cars’ and assigned a subset of variables from the ‘mtcars’ dataset. In particular, we took the 1st (mpg), the 2nd (cyl), the 9th (am) and the 10th (gear) variables from ‘mtcars’ and saved them in ‘cars’
- For the ‘am’ variable we converted it to a factor. Data points that were once 0 became ‘auto’ and data points that were once 1 became ‘manual’
- We made the ‘gear’ variable an ordered factor this means that 5 is more than 4, 4 is more than 3, etc.
Our next post on R will focus on analyzing the ‘cars’ variable that we created.