Category Archives: Data mining

Primary Tasks in Data Analysis

Performing a data analysis in the realm of data science is a difficult task due to the huge number of decisions that need to be made. For some people,  plotting the course to conduct an analysis is easy. However, for most of us, beginning a project leads to a sense of paralysis as we struggle to determine what to do.

In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five task are

  1. Developing  your question(s)
  2. Data exploration
  3. Developing a statistical model
  4. Interpreting the results
  5. Sharing the results

Developing Your Question(s)

You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s).

There are several types of research questions. The point is you need to ask them in order to answer them.

Data Exploration

Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible.

In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables.

Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all.  Mastery of this in many ways means being a master of data science

Develop a Statistical Model

Your research questions  and the data exploration  process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values.

This is probably the funniest part of data analysis and is much easier then having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s)

Interpreting the Results

Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders.

With interpretation you are trying to determine “what does this answer mean to the stakeholders?”  For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out?

Communication of Results

Now  is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following

  • The audience or stakeholders
  • The actual answers to the questions
  • The benefits of knowing this

You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be  different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way.

Conclusion

The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis.

Advertisements

Data Wrangling in R

Collecting and preparing data for analysis is the primary job of a data scientist. This experience is called data wrangling. In this post, we will look at an example of data wrangling using a simple artificial data set. You can create the table below in r or excel. If you created it in excel just save it as a csv and load it into r. Below is the initial code

library(readr)
apple <- read_csv("~/Desktop/apple.csv")
## # A tibble: 10 × 2
##        weight      location
##         <chr>         <chr>
## 1         3.2        Europe
## 2       4.2kg       europee
## 3      1.3 kg          U.S.
## 4  7200 grams           USA
## 5          42 United States
## 6         2.3       europee
## 7       2.1kg        Europe
## 8       3.1kg           USA
## 9  2700 grams          U.S.
## 10         24 United States

This a small dataset with the columns of “weight” and “location”. Here are some of the problems

  • Weights are in different units
  • Weights are written in different ways
  • Location is not consistent

In order to have any success with data wrangling you need to state specifically what it is you want to do. Here are our goals for this project

  • Convert the “Weight variable” to a numerical variable instead of character
  • Remove the text and have only numbers in the “weight variable”
  • Change weights in grams to kilograms
  • Convert the “location” variable to a factor variable instead of character
  • Have consistent spelling for Europe and United States in the “location” variable

We will begin with the “weight” variable. We want to convert it to a numerical variable and remove any non-numerical text. Below is the code for this

corrected.weight<-as.numeric(gsub(pattern = "[[:alpha:]]","",apple$weight))
corrected.weight
##  [1]    3.2    4.2    1.3 7200.0   42.0    2.3    2.1    3.1 2700.0   24.0

Here is what we did.

  1. We created a variable called “corrected.weight”
  2. We use the function “as.numeric” this makes whatever results inside it to be a numerical variable
  3. Inside “as.numeric” we used the “gsub” function which allows us to substitute one value for another.
  4. Inside “gsub” we used the argument pattern and set it to “[[alpha:]]” and “” this told r to look for any lower or uppercase letters and replace with nothing or remove it. This all pertains to the “weight” variable in the apple dataframe.

We now need to convert the weights in grams to kilograms so that everything is the same unit. Below is the code

gram.error<-grep(pattern = "[[:digit:]]{4}",apple$weight)
corrected.weight[gram.error]<-corrected.weight[gram.error]/1000
corrected.weight
##  [1]  3.2  4.2  1.3  7.2 42.0  2.3  2.1  3.1  2.7 24.0

Here is what we did

  1. We created a variable called “gram.error”
  2. We used the grep function to search are the “weight” variable in the apple data frame for input that is a digit and is 4 digits in length this is what the “[[:digit:]]{4}” argument means. We do not change any values yet we just store them in “gram.error”
  3. Once this information is stored in “gram.error” we use it as a subset for the “corrected.weight” variable.
  4. We tell r to save into the “corrected.weight” variable any value that is changeable according to the criteria set in “gram.error” and to divided it by 1000. Dividing it by 1000 converts the value from grams to kilograms.

We have completed the transformation of the “weight” and will move to dealing with the problems with the “location” variable in the “apple” dataframe. To do this we will first deal with the issues related to the values that relate to Europe and then we will deal with values related to United States. Below is the code.

europe<-agrep(pattern = "europe",apple$location,ignore.case = T,max.distance = list(insertion=c(1),deletions=c(2)))
america<-agrep(pattern = "us",apple$location,ignore.case = T,max.distance = list(insertion=c(0),deletions=c(2),substitutions=0))
corrected.location<-apple$location
corrected.location[europe]<-"europe"
corrected.location[america]<-"US"
corrected.location<-gsub(pattern = "United States","US",corrected.location)
corrected.location
##  [1] "europe" "europe" "US"     "US"     "US"     "europe" "europe"
##  [8] "US"     "US"     "US"

The code is a little complicated to explain but in short We used the “agrep” function to tell r to search the “location” to look for values similar to our term “europe”. The other arguments provide some exceptions that r should change because the exceptions are close to the term europe. This process is repeated for the term “us”. We then store are the location variable from the “apple” dataframe in a new variable called “corrected.location” We then apply the two objects we made called “europe” and “america” to the “corrected.location” variable. Next we have to make some code to deal with “United States” and apply this using the “gsub” function.

We are almost done, now we combine are two variables “corrected.weight” and “corrected.location” into a new data.frame. The code is below

cleaned.apple<-data.frame(corrected.weight,corrected.location)
names(cleaned.apple)<-c('weight','location')
cleaned.apple
##    weight location
## 1     3.2   europe
## 2     4.2   europe
## 3     1.3       US
## 4     7.2       US
## 5    42.0       US
## 6     2.3   europe
## 7     2.1   europe
## 8     3.1       US
## 9     2.7       US
## 10   24.0       US

If you use the “str” function on the “cleaned.apple” dataframe you will see that “location” was automatically converted to a factor.

This looks much better especially if you compare it to the original dataframe that is printed at the top of this post.

Ensemble Learning for Machine Models

One way to improve a machine learning model is to not make just one model. Instead, you can make several models  that all have different strengths and weaknesses. This combination of diverse abilities can allow for much more accurate predictions.

The use of multiple models is know as ensemble learning. This post will provide insights into ensemble learning as they are used in developing machine models.

The Major Challenge

The biggest challenges in creating an ensemble of models is deciding what models to develop and how the various models are combined to make predictions. To deal with these challenges involves the use of training data and several different functions.

The Process

Developing an ensemble model begins with training data. The next step is the use of some sort of allocation function. The allocation function determines how much data each model receives in order to make predictions. For example, each model may receive a subset of the data or limit how many features each model can use. However, if several different algorithms are used the allocation function may pass all the data to each model with making any changes.

After the data is allocated, it is necessary for the models to be created. From there, the next step is to determine how to combine the models. The decision on how to combine the models is made with a combination function.

The combination function can take one of several approaches for determining final predictions. For example, a simple majority vote can be used which means that if 5 models where develop and 3 vote “yes” than the example is classified as a yes. Another option is to weight the models so that some have more influence then others in the final predictions.

Benefits of Ensemble Learning

Ensemble learning provides several advantages. One, ensemble learning improves the generalizability of your model. With the combine strengths of many different models and or algorithms it is difficult to go wrong

Two, ensemble learning approaches allow for tackling large datasets. The biggest enemy to machine learning is memory. With ensemble approaches, the data can be broken into smaller pieces for each model.

Conclusion

Ensemble learning is yet another critical tool in the data scientist’s toolkit. The complexity of the world today makes it too difficult to lean on a singular model to explain things. Therefore, understanding the application of ensemble methods is a necessary step.

 

Improving the Performance of Machine Learning Model

For many, especially beginners, making a machine learning model is difficult enough. Trying to understand what to do, how to specify the model, among other things is confusing in itself. However, after developing a model it is necessary to assess ways in which to improve performance.

This post will serve as an introduction to understanding how to improving model performance. In particular, we will look at the following

  • When it is necessary to improve performance
  • Parameter tuning

When to Improve

It is not always necessary to try and improve the performance of a model. There are times when a model does well and you know this through the evaluating it. If the commonly used measures are adequate there is no cause for concern.

However, there are times when improvement is necessary. Complex problems, noisy data, and trying to look for subtle/unclear relationships can make improvement necessary. Normally, real-world data has the problems so model improvement is usually necessary.

Model improvement requires the application of scientific means in an artistic manner. It requires a sense of intuition at times and also brute trial-and-error effort as well. The point is that there is no singular agreed upon way to improve a model. It is better to focus on explaining how you did it if necessary.

Parameter Tuning

Parameter tuning is the actual adjustment of model fit options. Different machine learning models have different options that can be adjusted. Often, this process can be automated in r through the use of the “caret” package.

When trying to decide what to do when tuning parameters it is important to remember the following.

  • What machine learning model and algorithm you are using for your data.
  • Which parameters you can adjust.
  • What criteria you are using to evaluate the model

Naturally, you need to know what kind of model and algorithm you are using in order to improve the model. There are three types of models in machine learning, those that classify, those that employ regression, and those that can do both. Understanding this helps you to make decision about what you are trying to do.

Next, you need to understand what exactly you or r are adjusting when analyzing the model. For example, for C5.0 decision trees “trials” is one parameter you can adjust. If you don’t know this, you will not know how the model was improved.

Lastly, it is important to know what criteria you are using to compare the various models. For classifying models you can look at the kappa and the various information derived from the confusion matrix. For regression based models you may look at the r-square, the RMSE (Root mean squared error), or the ROC curve.

Conclusion

As you can perhaps tell there is an incredible amount of choice and options in trying to improve a model. As such, model improvement requires a clearly developed strategy that allows for clear decision-making.

In a future post, we will look at an example of model improvement.

Understanding Market Basket Analysis

Market basket analysis a machine learning approach that attempts to find relationships among a group of items in a data set. For example, a famous use of this method was when retailers discovered an association between beer and diapers.

Upon closer examination, the retailers found that when men came to purchase diapers for their babies they would often buy beer in the same trip. With this knowledge, the retailers placed beer and diapers next to each other in the store and this further increased sales.

In addition, many of the recommendation systems we experience when shopping online use market basket analysis results to suggest additional products to us. As such, market basket analysis is an intimate part of our lives with us even knowing.

In this post, we will look at some of the details of market basket analysis such as association rules, apriori, and the role of support and confidence.

Association Rules

The heart of market basket analysis are association rules. Association rules explain patterns of relationship  among items. Below is an example

{rice, seaweed} -> {soy sauce}

Everything in curly braces { } is an itemset, which is some form of data that occurs often in the dataset based on a criteria. Rice and seaweed is our itemset on the left and soy sauce is our itemset on the right. The arrow -> indicates what comes first as we read from left to right. If we put this association rule in simply English it would say “if someone buys rice and seaweed then they will buy soy sauce”.

The practical application of this rule is to place rice, seaweed and soy sauce near each other in order to reinforce this rule when people come to shop.

The Algorithm

Market basket analysis uses a apriori algorithm. This algorithm is useful for unsupervised learning that does not require any training and thus no predictions. The apriori algorithm is especially useful with large datasets but it employs simple procedures to find useful relationships among the items.

The shortcut that this algorithm uses is the “apriori property” which states that all sugsets of a frequent itemset must also be frequent. What this means in simply English is that the items in an itemset need to be common in the overall dataset. This simple rule saves a tremendous amount of computational time.

Support and Confidence

To key pieces of information that can further refine the work of the apriori algorithm is support and confidence. Support is a measure of the frequency of an itemset ranging from 0 (no support) to 1 (highest support). High support indicates the importance of the itemset in the data and contributes to the itemset being used to generate association rule(s).

Returning to our rice, seaweed, and soy sauce example. We can say that the support for soy sauce is 0.4. This means that soy sauce appears in 40% of the purchases in the dataset which is pretty high.

Confidence is a measure of the accuracy of an association rule which is measured from 0 to 1. The higher the confidence the more accurate the association rule. If we say that our rice, seaweed, and soy sauce rule has a confidence of 0.8 we are saying that when rice and seaweed are purchased together, 80% of the time soy sauce is purchased as well.

Support and confidence can be used to influence the apriori algorithm by setting cutoff values to be searched for. For example, if we setting a minimum support of 0.5 and a confidence of 0.65 we are telling the computer to only report to us association rules that are above these cutoff points. This helps to remove useless rules that are obvious or useless.

Conclusion

Market basket analysis is a useful tool for mining information from large datasets. The rules are easy to understanding. In addition, market basket analysis can be used in many fields beyond shopping and can include relationships within DNA, and other forms of human behavior. As such, care must be made so that unsound conclusions are not drawn from random patterns in the data

Steps for Approaching Data Science Analysis

Research is difficult for many reasons. One major challenge of research is knowing exactly what to do. You have to develop your way of approaching your problem, data collection and analysis that is acceptable to peers.

This level of freedom leads to people literally freezing and not completing a project. Now imagine have several gigabytes or terabytes of data and being expected to “analyze” it.

This is a daily problem in data science. In this post, we will look at one simply six step process to approaching data science. The process involves the following six steps

  1. Acquire data
  2. Explore the data
  3. Process the data
  4. Analyze the data
  5. Communicate the results
  6. Apply the results

Step 1 Acquire the Data

This may seem obvious but it needs to be said. The first step is to access data for further analysis. Not always, but often data scientist are given data that was already collected by others who want answers from it.

In contrast with traditional empirical research in which you are often involved from the beginning to the end, in data science you jump to analyze a mess of data that others collected. This is challenging as it may not be clear what people what to know are what exactly the collected.

Step 2 Explore the Data

Exploring the data allows you to see what is going on. You have to determine what kinds of potential feature variables you have, the level of data that was collected (nominal, ordinal, interval, ratio). In addition, exploration allows you to determine what you need to do to prep the data for analysis.

Since data can come in many different formats from structured to unstructured. It is critical to take a look at the data through using summary statistics and various visualization options such as plots and graphs.

Another purpose for exploring data is that it can provide insights into how to analyze the data. If you are not given specific instructions as to what stakeholders want to know, exploration can help you to determine what may be valuable for them to know.

Step 3 Process the Data

Processing data involves cleaning it. This involves dealing with missing data, transforming features, addressing outliers, and other necessary processes for preparing analysis. The primary goal is to organize the data for analysis

This is a critical step as various machine learning models have different assumptions that must be met. Some models can handle missing data some cannot. Some models are affected by outliers some are not.

Step 4 Analyze the Data

This is often the most enjoyable part of the process. At this step, you actually get to develop your model. How this is done depends on the type of model you selected.

In machine learning, analysis is almost never complete until some form of validation of the model takes place. This involves taking the model developed on one set of data and seeing how well the model predicts the results on another set of data. One of the greatest fears of statistical modeling is overfitting, which is a model that only works on one set of data and lacks the ability to generalize.

Step 5 Communicate Results

This step is self-explanatory. The results of your analysis needs to be shared with those involved. This is actually an art in data science called storytelling. It involves the use of visuals as well-spoken explanations.

Steps 6 Apply the Results

This is the chance to actual use the results of a study. Again, how this is done depends on the type of model developed. If a model was developed to predict which people to approve for home loans, then the model will be used to analyze applications by people interested in applying for a home loan.

Conclusion

The steps in this process is just one way to approach data science analysis. One thing to keep in mind is that these steps are iterative, which means that it is common to go back and forth and to skip steps as necessary. This process is just a guideline for those who need direction in doing an analysis.

Types of Machine Learning

Machine learning is a tool used in analytics for using data to make decision for action. This field of study is at the crossroads of regular academic research and action research used in professional settings. This juxtaposition of skills has led to exciting new opportunities in the domains of academics and industry.

This post will provide information on basic types of machine learning which includes predictive models, supervised learning, descriptive models, and unsupervised learning.

Predictive Models and Supervised Learning

Predictive models do as their name implies. Predictive models predict one value based on other values. For example, a model might predict who is mostly likely to buy a plane ticket or purchase a specific book.

Predictive models are not limited to the future. They can also be used to predict something that has already happen but we are not sure when. For example, data can be collect from expectant mothers to determine the date that they conceived. Such information would be useful in preparing for birth .

Predictive models are intimately connected with supervised learning. Supervised learning is a form of machine learning in which the predictive model is given clear direction as to what it they need to learn and how to do it.

For example, if we want to predict who will be accept or rejected for a home loan we would provide clear instructions to our model. We might include such features as salary, gender, credit score, etc. These features would be used to predict whether an individual person should be accepted or reject for the home loan. The supervisors in this example or the features (salary, gender, credit score) used to predict the target feature (home loan).

The target feature can either be a classification or a numeric prediction. A classification target feature is a nominal variable such as gender, race, type of car, etc. A classification feature has a limited number of choices or classes that the feature can take. In addition, the classes are mutually exclusive. At least in machine learning, someone can only be classified as male or female, current algorithms cannot place a person in both classes.

A numeric prediction predicts a number that has an infinite number of possibilities. Examples include height, weight, and salary.

Descriptive Models and Unsupervised Learning

Descriptive models summarizes data to provide interesting insights. There is no target feature that you are trying to predict. Since there is no specific goal or target to predict there are no supervisors or specific features that are used to predict the target feature. Instead, descriptive models use a process of unsupervised learning. There are no instructions given to model as to what to do per say.

Descriptive models are very useful for discovering patterns. For example, one descriptive model analysis found a relationship between beer purchases and diaper purchases. It was later found that when men went to the store they often would be beer for themselves and diapers for their small children. Stores used this information and they placed beer and diapers next to each in the stores. This led to an increase in profits as men could now find beer and diapers together. This kind of relationship can only be found through machine learning techniques.

Conclusion

The model you used depends on what you want to know. Prediction is for, as you can guess, predicting. With this model you are not as concern about relationships as you are about understanding what affects specifically the target feature. If you want to explore relationships then descriptive models can be of use. Machine learning models are tools that are appropriate for different situations.

Data Mining Process

Processes serve the purpose of providing people with a clear step-by-step procedures to accomplish a task. In many ways a process serves as a shortcut in solving a problem. As data mining is a complex situation with an endless number of problems there have been developed several processes for completing a data mining project. In this post we will look at the Cross-Industry Standard Process for Data Mining (CRISP-DM).

CRISP-DM

The CRISP-DM is an iterative process that has the following steps…

  1. Organizational understanding
  2. Data understanding
  3. Data preparation
  4. Modeling
  5. Evaluation
  6. Deployment

We will look at each step briefly

  1. Organizational Understanding

Step 1 involves assessing the current goals of the organization and the current context. This information is then used to in deciding goals or research questions for data mining. Data mining needs to be done with a sense of purpose and not just to see what’s out there. Organizational understanding is similar to the introduction section of a research paper in which you often include the problem, questions, and even the intended audience of the research

2. Data Understanding

Once a purpose and questions have been developed for data mining, it is necessary to determine what it will take to answer the questions. Specifically, the data scientist assess the data requirements, description, collection, and assesses data quality. In many ways, data understanding is similar to methodology of a standard research paper in which you assess how you will answer the research questions.

It is particularly common to go back and forth between steps one and two. Organizational understanding influences data understanding which influences data under standing.

3. Data Preparation

Data preparation involves cleaning the data. Another term for this is data mugging. This is the main part of an analysis in data mining. Often the data comes in a very messy way with information spread all over the place and incoherently. This requires the researcher to carefully deal with this problem.

4. Modeling

A model provides a numerical explanation of something in the data. How this is done depends on the type of analysis that is used. As you develop various models you are arriving at various answers to your research questions. It is common to move back and forth between step 3 and 4 as the preparation affects the modeling and the type of modeling you may want to develop may influence data preparation. The results of this step can also be seen as being similar to the results section of an empirical paper.

5. Evaluation

Evaluation is about comparing the results of the study with the original questions. In many ways it is about determining the appropriateness of the answers to the research questions. This experience leads to ideas for additional research. As such, this step is similar to the discussion section of a research paper.

6. Deployment

The last step is when the results are actually used for decision-making or action. If the results indicate that a company should target people under 25 then this is what they do as an example.

Conclusion

The CRISP-DM process is a useful way to begin the data mining experience. The primary goal of data mining is providing evidence for making decisions and or taking action. This end goal has shaped the development of a clear process for taking action.