Machine learning is about using data to take action. This post will explain common steps that are taking when using machine learning in the analysis of data. In general, there are five steps when applying machine learning.
- Collecting data
- Preparing data
- Exploring data
- Training a model on the data
- Assessing the performance of the model
- Improving the model
We will go through each step briefly
Step One, Collecting Data
Data can come from almost anywhere. Data can come from a database in a structured format like mySQL. Data can also come unstructured such as tweets collected from twitter. However you get the data, you need to develop a way to clean and process it so that it is ready for analysis.
There are some distinct terms used in machine learning that some coming from traditional research may not be familiar.
- example-An example is one set of data. In an excel spreadsheet, an example would be one row. In empirical social science research, we would call an example a respondent or participant.
- Unit of observation-This is how an example is measured. The units can be time, money, height, weight, etc.
- feature-A feature is a characteristic of an example. In other forms of research, we normally call a feature a variable. For example, ‘salary’ would be a feature in machine learning but a variable in traditional research.
Step Two, Preparing Data
This is actually the most difficult step in machine learning analysis. It can take up to 80% of the time. With data coming from multiple sources and in multiple formats it is a real challenge to get everything where it needs to be for an analysis.
Missing data needs to be addressed, duplicate records, and other issues are a part of this process. Once these challenges are dealt with it is time to explore the data.
Step Three, Explore the Data
Before analyzing the data, it is critical that the data is explored. This is often done visually in the form of plots and graphs but also with summary statistics. You are looking for some insights into the data and the characteristics of different features. You are also looking out for things that might be unusually such as outliers. There are also times when a variable needs to be transformed because there are issues with normality.
Step Four, Training a Model
After exploring the data, you should have an idea of what you want to know if you did not already know. Determining what you want to know helps you to decide which algorithm is most appropriate for developing a model.
To develop a model, we normally split the data into a training and testing set. This allows us to assess the model for its strengths and weaknesses.
Step Five, Assessing the Model
The strength of the model is determined. Every model has certain biases that limit its usefulness. How to assess a model depends on what type of model is developed and the purpose of the analysis. Whenever suitable, we want to try and improve the model.
Step Six, Improving the Model
Improve can happen in many ways. You might decide to normalize the variables in a different way. Or you may choose to add or remove features from the model. Perhaps you may switch to a different model.
Success in data analysis involves have a clear path for achieving your goals. The steps presented here provide one way of tackling machine learning.