In the last post on R, we began to look at how to pull information from a data set. In this post, we will continue this by examining how to describe continuous variables. One of the first steps in any data analysis is to look at the descriptive statistics to determine if there are any problems, errors or issues with normality. Below is the code for the data frame we created.
> cars <- mtcars[c(1,2,9,10)] > cars$am <- factor(cars$am, labels=c('auto', 'manual')) > cars$gear <- ordered(cars$gear)
Finding the Mean, Median, Standard Deviation, Range, and Quartiles
The mean is useful to know as it gives you an idea of the centrality of the data. Finding the mean is simple and involves the use of the “mean” function. Keep in mind that there are four variables in our data frame which are mpg, cyl, am, and gear. Only ‘mpg’ has a mean because the other variables are all categorical (cyl, am, and gear). Below is the script for finding the mean of ‘mpg’ with the answer.
> mean(cars$mpg)  20.09062
The median is the number found exactly in the middle of a dataset. Below is the median of ‘mpg.’
> median(cars$mpg)  19.2
The answer above indicates that the median of ‘mpg’ is 19.2. This means half the values are above this number while half the values are below it.
Standard deviation is the average amount that a particular data point is different from the mean. For example, if the standard deviation is 3 and the mean is 10 this means that any given data point is either is between 7 and 13 or -3 to +3 different from the mean. In other words, the standard deviation calculates the average amount of deviance from the mean. Below is a calculation of the standard deviation of the ‘mpg’ in the ‘cars’ data frame using ‘sd’ function.
> sd(cars$mpg)  6.026948
What this tells is that the average amount of deviance from the mean of 20.09 for ‘mpg’ in the ‘car’ data frame is 6.02. In simple terms, most of the data points are between 14.0 and 26.0 mpg.
Range is the highest and lowest number in a data set. It gives you an idea of the scope of a dataset. This can be calculated in R using the ‘range’ function. Below is the range for ‘mpg’ in the ‘cars’ data frame.
> range(cars$mpg)  10.4 33.9
These results mean that the lowest ‘mpg’ in the cars data frame is 10.4 while the highest is 33.9. There are no values lower or higher than these.
Lastly, quartiles tell breaks the data into several groups based on a percentage. For example, the 25th percentile gives you a number that tells that 75% of the numbers are higher than it and that 25% is lower. If there are 100 data points in a dataset and the number 25 is the 25th percentile, this means that 75% or 75 numbers are of greater value than 25 and that there are about 25 numbers lower than 25. Below is an example using the ‘mpg’ information from the ‘cars’ data frame using the ‘quantile’ function.
> quantile(cars$mpg) 0% 25% 50% 75% 100% 10.400 15.425 19.200 22.800 33.900
In this example above, 15.4 is the 25th percentile. This means that 75% of the values are above 15.4 while 25% are below it. In addition, you may have noticed that the 0% and the 100% are the same as the range. Furthermore, the 50% is the same as the median. In other words, calculating the quartiles gives you the range and median. Therefore, calculating the quartiles saves you time.
These are the basics of describing continuous variables. The next post on R will look at describing categorical variables.