In this post, we are going to analyze some data in order to make some classification rules. Specifically, we are going to look at the “Males” dataset from the “Ecdat” package. Our goal is to make some rules that explain when a male is married or not. Below is the code to load the needed packages as well as the dataset “Males”
library(Ecdat)
library(RWeka)
data("Males")
1. Explore the Data
The first step as always is to explore the data. We need to determine what kind of variables we are working with by using the “str” function
str(Males)
## 'data.frame': 4360 obs. of 12 variables:
## $ nr : int 13 13 13 13 13 13 13 13 17 17 ...
## $ year : int 1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
## $ school : int 14 14 14 14 14 14 14 14 13 13 ...
## $ exper : int 1 2 3 4 5 6 7 8 4 5 ...
## $ union : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ ethn : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ maried : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ health : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ wage : num 1.2 1.85 1.34 1.43 1.57 ...
## $ industry : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
## $ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
## $ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...
The first two variables “nr” and “year” are not too useful for us. “nr” is an identification number and “year” is the year the data was collected. We shouldn’t expect much change in marriage rates over a few years and the identification number has no meaning in our analysis. So we will ignore these two variables.
Nex, we visualize the data with some tables and histograms. Integer variables will be visualized with histograms and factor variables with tables. The code and results are below.
prop.table(table(Males$maried))
##
## no yes
## 0.5610092 0.4389908
prop.table(table(Males$union))
##
## no yes
## 0.7559633 0.2440367
prop.table(table(Males$ethn))
##
## other black hisp
## 0.7284404 0.1155963 0.1559633
prop.table(table(Males$industry))
##
## Agricultural Mining
## 0.03211009 0.01559633
## Construction Trade
## 0.07500000 0.26811927
## Transportation Finance
## 0.06559633 0.03692661
## Business_and_Repair_Service Personal_Service
## 0.07591743 0.01674312
## Entertainment Manufacturing
## 0.01513761 0.28233945
## Professional_and_Related Service Public_Administration
## 0.07637615 0.04013761
prop.table(table(Males$health))
##
## no yes
## 0.98302752 0.01697248
prop.table(table(Males$residence))
##
## rural_area north_east nothern_central south
## 0.02728732 0.23531300 0.30947030 0.42792937
prop.table(table(Males$occupation))
##
## Professional, Technical_and_kindred Managers, Officials_and_Proprietors
## 0.10389908 0.09151376
## Sales_Workers Clerical_and_kindred
## 0.05344037 0.11146789
## Craftsmen, Foremen_and_kindred Operatives_and_kindred
## 0.21422018 0.20206422
## Laborers_and_farmers Farm_Laborers_and_Foreman
## 0.09197248 0.01467890
## Service_Workers
## 0.11674312
#explore histograms
hist(Males$school)
hist(Males$exper)
hist(Males$wage)
There is no time or space to explain the tables and histograms in detail. Only two things are worth mentioning. 1. Are “maried” are classifiying variable is mostly balance between those who are married and those who are not (56% to 44%) 2. The “health” variable is horribly unbalanced and needs to be removed (98% no vs 2% yes).
2. Develop and Evaluate the Model
We can now train our first model. The first model will be a one rule model, which means that R will develop one rule for classification purposes. In this post, we are not doing any prediction since we simply want to make a rule for the sample data. Below is the code.
Males_1R<-OneR(maried~ethn+union+industry+school+exper+
occupation+residence, data=Males)
We used the “OneR” function to create the model. This function analyzes the data and makes a single rule for it. We will now evaluate the model be first looking at the rule that was generated.
Males_1R
## exper:
## < 7.5 -> no
## < 12.5 -> yes
## < 13.5 -> no
## >= 13.5 -> yes
## (1973/3115 instances correct)
The “exper” variable was selected for generating the rule. To state the rule clearly it literally it means “If a man has lest than 7.5 years of experience he is not married if he has more than 7.5 years of experience but less than 12.5 years of experience he is married, if he has more than 12.5 years of experience but less than 13.5 years of experience he is not married, and if he has more than 13.5 years of experience he is married”
Explaining this can take many interpretations. Young guys have less experience so they aren’t ready to marry. After about 8 years they marry. However, after about 12 years of experience males are suddenly not married. This is probably due to divorce. After 13 years, the typical male is married again. This may be because his marriage survived the scary 12th year or may be due to remarriage.
However, as we look at the accuracy of the model we will see some problems a you will notice below after typing in the following code
summary(Males_1R)
##
## === Summary ===
##
## Correctly Classified Instances 1973 63.3387 %
## Incorrectly Classified Instances 1142 36.6613 %
## Kappa statistic 0.2287
## Mean absolute error 0.3666
## Root mean squared error 0.6055
## Relative absolute error 75.2684 %
## Root relative squared error 122.6945 %
## Coverage of cases (0.95 level) 63.3387 %
## Mean rel. region size (0.95 level) 50 %
## Total Number of Instances 3115
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 1351 457 | a = no
## 685 622 | b = yes
We only correctly classified 63% of the data. This is pretty bad. Perhaps if we change our approach and develop more than one rule we will have more success.
We will now use the “JRip” function to develop multiple classification rules. Below is the code.
Males_JRip<-JRip(maried~ethn+union+school+exper+industry+occupation+residence, data=Males)
Males_JRip
## JRIP rules:
## ===========
##
## (exper >= 7) and (occupation = Craftsmen, Foremen_and_kindred) and (school >= 9) and (residence = south) and (exper >= 8) and (union = yes) => maried=yes (28.0/3.0)
## (exper >= 6) and (exper >= 8) and (school >= 11) and (ethn = other) => maried=yes (649.0/238.0)
## (exper >= 6) and (residence = south) and (ethn = hisp) => maried=yes (102.0/36.0)
## (exper >= 5) and (school >= 14) and (school >= 15) => maried=yes (76.0/25.0)
## (exper >= 5) and (ethn = other) and (occupation = Craftsmen, Foremen_and_kindred) => maried=yes (232.0/93.0)
## => maried=no (2028.0/615.0)
##
## Number of Rules : 6
There are six rules all together below is there meaning
- If a male has seven years are more of experience, is a craftsmen or foreman, has at least nine years of school, and his ethnicity is other then he is married.
- If a male has at least 6 years of experience, has at least 11 years of school and his ethnicity is Hispanic then he is married.
- If a male has at least 6 years of experience, resides in the south, and his ethnicity is Hispanic then he is married.
- If a male has at least 5 years of and has at least 14 years of school then he is married.
- If a male has at least 5 years of experience, his ethnicity is other, and his occupation is craftsmen or foremen then he is married
- If not any of these then he is not married
Notice how all rules begin with “exper” this is one reason why the “OneR” function made its rule on experience. Experience is the best predictor of marriage in this dataset. However, are accuracy has not improve much as you will see in the following code.
summary(Males_JRip)
##
## === Summary ===
##
## Correctly Classified Instances 2105 67.5762 %
## Incorrectly Classified Instances 1010 32.4238 %
## Kappa statistic 0.3184
## Mean absolute error 0.4351
## Root mean squared error 0.4664
## Relative absolute error 89.3319 %
## Root relative squared error 94.5164 %
## Coverage of cases (0.95 level) 100 %
## Mean rel. region size (0.95 level) 100 %
## Total Number of Instances 3115
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 1413 395 | a = no
## 615 692 | b = yes
We are only at 67% which is not much better. Since this is a demonstration the actually numbers do not matter as much.
Conclusion
Classification rules provide easy to understand rules for organizing data homogeneously. This yet another way to analyze data with machine learning approaches.
Pingback: Developing Classification Rules in R | Educatio...
There is a new native R package for the OneR algorithm on CRAN: https://cran.rstudio.com/web/packages/OneR/
Additionally It provides several enhancements for e.g. missing value handling, numeric data, tie breaking, diagnostics asf.
More info can be found here: http://vonjd.github.io/OneR/
(Full disclosure: I am the author of this package)
cool nice to meet the author of this package. Really appreciate your work
With the new OneR package you get the same accuracy with only half of the rules (2 instead of 4)!
> library(OneR)
> library(Ecdat)
> data(Males)
>
> data model summary(model)
Rules:
If exper = (-0.016,7.22] then maried = no
If exper = (7.22,16] then maried = yes
Accuracy:
1972 of 3115 instances classified correctly (63.31%)
Contingency table:
exper
maried (-0.016,7.22] (7.22,16] Sum
no * 1344 464 1808
yes 679 * 628 1307
Sum 2023 1092 3115
—
Maximum in each column: ‘*’
Pearson’s Chi-squared test:
X-squared = 165.99, df = 1, p-value plot(model) # not shown here
thanks for this information.