# Developing Classification Rules in R

``library(Ecdat)``
``````library(RWeka)
data("Males")``````

1. Explore the Data

The first step as always is to explore the data. We need to determine what kind of variables we are working with by using the “str” function

``str(Males)``
``````## 'data.frame':    4360 obs. of  12 variables:
##  \$ nr        : int  13 13 13 13 13 13 13 13 17 17 ...
##  \$ year      : int  1980 1981 1982 1983 1984 1985 1986 1987 1980 1981 ...
##  \$ school    : int  14 14 14 14 14 14 14 14 13 13 ...
##  \$ exper     : int  1 2 3 4 5 6 7 8 4 5 ...
##  \$ union     : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  \$ ethn      : Factor w/ 3 levels "other","black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  \$ maried    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  \$ health    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  \$ wage      : num  1.2 1.85 1.34 1.43 1.57 ...
##  \$ industry  : Factor w/ 12 levels "Agricultural",..: 7 8 7 7 8 7 7 7 4 4 ...
##  \$ occupation: Factor w/ 9 levels "Professional, Technical_and_kindred",..: 9 9 9 9 5 2 2 2 2 2 ...
##  \$ residence : Factor w/ 4 levels "rural_area","north_east",..: 2 2 2 2 2 2 2 2 2 2 ...``````

The first two variables “nr” and “year” are not too useful for us. “nr” is an identification number and “year” is the year the data was collected. We shouldn’t expect much change in marriage rates over a few years and the identification number has no meaning in our analysis. So we will ignore these two variables.

Nex, we visualize the data with some tables and histograms. Integer variables will be visualized with histograms and factor variables with tables. The code and results are below.

``prop.table(table(Males\$maried))``
``````##
##        no       yes
## 0.5610092 0.4389908``````
``prop.table(table(Males\$union))``
``````##
##        no       yes
## 0.7559633 0.2440367``````
``prop.table(table(Males\$ethn))``
``````##
##     other     black      hisp
## 0.7284404 0.1155963 0.1559633``````
``prop.table(table(Males\$industry))``
``````##
##                     Agricultural                           Mining
##                       0.03211009                       0.01559633
##                       0.07500000                       0.26811927
##                   Transportation                          Finance
##                       0.06559633                       0.03692661
##                       0.07591743                       0.01674312
##                    Entertainment                    Manufacturing
##                       0.01513761                       0.28233945
##                       0.07637615                       0.04013761``````
``prop.table(table(Males\$health))``
``````##
##         no        yes
## 0.98302752 0.01697248``````
``prop.table(table(Males\$residence))``
``````##
##      rural_area      north_east nothern_central           south
##      0.02728732      0.23531300      0.30947030      0.42792937``````
``prop.table(table(Males\$occupation))``
``````##
## Professional, Technical_and_kindred Managers, Officials_and_Proprietors
##                          0.10389908                          0.09151376
##                       Sales_Workers                Clerical_and_kindred
##                          0.05344037                          0.11146789
##      Craftsmen, Foremen_and_kindred              Operatives_and_kindred
##                          0.21422018                          0.20206422
##                Laborers_and_farmers           Farm_Laborers_and_Foreman
##                          0.09197248                          0.01467890
##                     Service_Workers
##                          0.11674312``````
``````#explore histograms
hist(Males\$school)``````

``hist(Males\$exper)``

``hist(Males\$wage)``

There is no time or space to explain the tables and histograms in detail. Only two things are worth mentioning. 1. Are “maried” are classifiying variable is mostly balance between those who are married and those who are not (56% to 44%) 2. The “health” variable is horribly unbalanced and needs to be removed (98% no vs 2% yes).

2. Develop and Evaluate the Model

We can now train our first model. The first model will be a one rule model, which means that R will develop one rule for classification purposes. In this post, we are not doing any prediction since we simply want to make a rule for the sample data. Below is the code.

``````Males_1R<-OneR(maried~ethn+union+industry+school+exper+
occupation+residence, data=Males)``````

We used the “OneR” function to create the model. This function analyzes the data and makes a single rule for it. We will now evaluate the model be first looking at the rule that was generated.

``Males_1R``
``````## exper:
##  < 7.5   -> no
##  < 12.5  -> yes
##  < 13.5  -> no
##  >= 13.5 -> yes
## (1973/3115 instances correct)``````

The “exper” variable was selected for generating the rule. To state  the rule clearly it literally it means “If a man has lest than 7.5 years of experience he is not married if he has more than 7.5 years of experience but less than 12.5 years of experience he is married, if he has more than 12.5 years of experience but less than 13.5 years of experience he is not married, and if he has more than 13.5 years of experience he is married”

Explaining this can take many interpretations. Young guys have less experience so they aren’t ready to marry. After about 8 years they marry. However, after about 12 years of experience males are suddenly not married. This is probably due to divorce. After 13 years, the typical male is married again. This may be because his marriage survived the scary 12th year or may be due to remarriage.

However, as we look at the accuracy of the model we will see some problems a you will notice below after typing in the following code

``summary(Males_1R)``
``````##
## === Summary ===
##
## Correctly Classified Instances        1973               63.3387 %
## Incorrectly Classified Instances      1142               36.6613 %
## Kappa statistic                          0.2287
## Mean absolute error                      0.3666
## Root mean squared error                  0.6055
## Relative absolute error                 75.2684 %
## Root relative squared error            122.6945 %
## Coverage of cases (0.95 level)          63.3387 %
## Mean rel. region size (0.95 level)      50      %
## Total Number of Instances             3115
##
## === Confusion Matrix ===
##
##     a    b   <-- classified as
##  1351  457 |    a = no
##   685  622 |    b = yes``````

We only correctly classified 63% of the data. This is pretty bad. Perhaps if we change our approach and develop more than one rule we will have more success.

We will now use the “JRip” function to develop multiple classification rules. Below is the code.

``````Males_JRip<-JRip(maried~ethn+union+school+exper+industry+occupation+residence, data=Males)
Males_JRip``````
``````## JRIP rules:
## ===========
##
## (exper >= 7) and (occupation = Craftsmen, Foremen_and_kindred) and (school >= 9) and (residence = south) and (exper >= 8) and (union = yes) => maried=yes (28.0/3.0)
## (exper >= 6) and (exper >= 8) and (school >= 11) and (ethn = other) => maried=yes (649.0/238.0)
## (exper >= 6) and (residence = south) and (ethn = hisp) => maried=yes (102.0/36.0)
## (exper >= 5) and (school >= 14) and (school >= 15) => maried=yes (76.0/25.0)
## (exper >= 5) and (ethn = other) and (occupation = Craftsmen, Foremen_and_kindred) => maried=yes (232.0/93.0)
##  => maried=no (2028.0/615.0)
##
## Number of Rules : 6``````

There are six rules all together below is there meaning

1. If a male has seven years are more of experience, is a craftsmen or foreman, has at least nine years of school, and his ethnicity is other then he is married.
2. If a male has at least 6 years of experience, has at least 11 years of school and his ethnicity is Hispanic then he is married.
3. If a male has at least 6 years of experience, resides in the south, and his ethnicity is Hispanic then he is married.
4. If a male has at least 5 years of and has at least 14 years of school then he is married.
5. If a male has at least 5 years of experience, his ethnicity is other, and his occupation is craftsmen or foremen then he is married
6. If not any of these then he is not married

Notice how all rules begin with “exper” this is one reason why the “OneR” function made its rule on experience. Experience is the best predictor of marriage in this dataset. However, are accuracy has not improve much as you will see in the following code.

``summary(Males_JRip)``
``````##
## === Summary ===
##
## Correctly Classified Instances        2105               67.5762 %
## Incorrectly Classified Instances      1010               32.4238 %
## Kappa statistic                          0.3184
## Mean absolute error                      0.4351
## Root mean squared error                  0.4664
## Relative absolute error                 89.3319 %
## Root relative squared error             94.5164 %
## Coverage of cases (0.95 level)         100      %
## Mean rel. region size (0.95 level)     100      %
## Total Number of Instances             3115
##
## === Confusion Matrix ===
##
##     a    b   <-- classified as
##  1413  395 |    a = no
##   615  692 |    b = yes``````

We are only at 67% which is not much better. Since this is a demonstration the actually numbers do not matter as much.

Conclusion

Classification rules provide easy to understand rules for  organizing data homogeneously. This yet another way to analyze data with machine learning approaches.

## 5 thoughts on “Developing Classification Rules in R”

1. Holger K. von Jouanne-Diedrich

With the new OneR package you get the same accuracy with only half of the rules (2 instead of 4)!

> library(OneR)
> library(Ecdat)
> data(Males)
>
> data model summary(model)

Rules:
If exper = (-0.016,7.22] then maried = no
If exper = (7.22,16] then maried = yes

Accuracy:
1972 of 3115 instances classified correctly (63.31%)

Contingency table:
exper
maried (-0.016,7.22] (7.22,16] Sum
no * 1344 464 1808
yes 679 * 628 1307
Sum 2023 1092 3115

Maximum in each column: ‘*’

Pearson’s Chi-squared test:
X-squared = 165.99, df = 1, p-value plot(model) # not shown here