There are times when conducting research that you want to know if there is a difference in categorical data. For example, is there a difference in the number of men who have blue eyes and who have brown eyes. Or is there a relationship between gender and hair color. In other words, is there a difference in the count of a particular characteristic or is there a relationship between two or more categorical variables.
For our example, we are going to use data that is already available in R called “HairEyeColor”. Below is the data
> HairEyeColor , , Sex = Male Eye Hair Brown Blue Hazel Green Black 32 11 10 3 Brown 53 50 25 15 Red 10 10 7 7 Blond 3 30 5 8 , , Sex = Female Eye Hair Brown Blue Hazel Green Black 36 9 5 2 Brown 66 34 29 14 Red 16 7 7 7 Blond 4 64 5 8
As you can see, the data comes in the form of a list and shows hair and eye color for men and women in separate tables. The current data is unusable for us in terms of calculating differences. However, by using the ‘marign.table’ function we can make the data useable as shown in the example below.
> HairEyeNew<- margin.table(HairEyeColor, margin = c(1,2)) > HairEyeNew Eye Hair Brown Blue Hazel Green Black 68 20 15 5 Brown 119 84 54 29 Red 26 17 14 14 Blond 7 94 10 16
Here is what we did. We created the variable ‘HairEyeNew’ and we stored the information from ‘HairEyeColor’ into one table using the ‘margin.table’ function. The margin was set 1,2 for the table.
Now all are data from the list are combined into one table.
We now want to see if there is a particular relationship between hair and eye color that is more common. To do this, we calculate the chi-square statistic as in the example below.
> chisq.test(HairEyeNew) Pearson's Chi-squared test data: HairEyeNew X-squared = 138.29, df = 9, p-value < 2.2e-16
The test tells us that one or more of the relationships are more common than others within the table. To determine which relationship between hair and eye color is more common than the rest we will calculate the proportions for the table as seen below.
> HairEyeNew/sum(HairEyeNew) Eye Hair Brown Blue Hazel Green Black 0.114864865 0.033783784 0.025337838 0.008445946 Brown 0.201013514 0.141891892 0.091216216 0.048986486 Red 0.043918919 0.028716216 0.023648649 0.023648649 Blond 0.011824324 0.158783784 0.016891892 0.027027027
As you can see from the table, brown hair and brown eyes are the most common (0.20 or 20%) flowed by blond hair and blue eyes (0.15 or 15%).
The chi-square serves to determine differences among categorical data. This tool is useful for calculating the potential relationships among non-continuous variables