Processes serve the purpose of providing people with a clear step-by-step procedures to accomplish a task. In many ways a process serves as a shortcut in solving a problem. As data mining is a complex situation with an endless number of problems there have been developed several processes for completing a data mining project. In this post we will look at the Cross-Industry Standard Process for Data Mining (CRISP-DM).
The CRISP-DM is an iterative process that has the following steps…
- Organizational understanding
- Data understanding
- Data preparation
We will look at each step briefly
- Organizational Understanding
Step 1 involves assessing the current goals of the organization and the current context. This information is then used to in deciding goals or research questions for data mining. Data mining needs to be done with a sense of purpose and not just to see what’s out there. Organizational understanding is similar to the introduction section of a research paper in which you often include the problem, questions, and even the intended audience of the research
2. Data Understanding
Once a purpose and questions have been developed for data mining, it is necessary to determine what it will take to answer the questions. Specifically, the data scientist assess the data requirements, description, collection, and assesses data quality. In many ways, data understanding is similar to methodology of a standard research paper in which you assess how you will answer the research questions.
It is particularly common to go back and forth between steps one and two. Organizational understanding influences data understanding which influences data under standing.
3. Data Preparation
Data preparation involves cleaning the data. Another term for this is data mugging. This is the main part of an analysis in data mining. Often the data comes in a very messy way with information spread all over the place and incoherently. This requires the researcher to carefully deal with this problem.
A model provides a numerical explanation of something in the data. How this is done depends on the type of analysis that is used. As you develop various models you are arriving at various answers to your research questions. It is common to move back and forth between step 3 and 4 as the preparation affects the modeling and the type of modeling you may want to develop may influence data preparation. The results of this step can also be seen as being similar to the results section of an empirical paper.
Evaluation is about comparing the results of the study with the original questions. In many ways it is about determining the appropriateness of the answers to the research questions. This experience leads to ideas for additional research. As such, this step is similar to the discussion section of a research paper.
The last step is when the results are actually used for decision-making or action. If the results indicate that a company should target people under 25 then this is what they do as an example.
The CRISP-DM process is a useful way to begin the data mining experience. The primary goal of data mining is providing evidence for making decisions and or taking action. This end goal has shaped the development of a clear process for taking action.