Analyzing data can be extremely challenging. It is often common to not know where to begin. Perhaps you know some basic ways of analyzing data, but it is unclear what should be done first and what should follow.
This is where a data analysis framework can come in handy. Having a basic step-by-step process, you always follow can make it much easier to start and complete a project. One example of a data analysis framework is the OSEMN model. The OSEMN model is an acronym that defines each step of the data analysis process. The steps are as follows
We will now go through each of these steps.
The first step of this model is obtaining data. Depending on the context, this can be done for you because the stakeholders have already provided data for analysis. In other situations, you have to find the data you need to answer whatever questions you are looking for insights into.
Data can be found anywhere, so the obtained data must help achieve the goals. It is also necessary to have the skills or connections to get the data. For example, data may have to be scraped from the web, pulled from a database, or even collected through the development of surveys. Each of these examples requires specific skills needed for success.
Once data is obtained, it must be scrubbed or cleaned. Completing these tasks requires several things. Duplicates need to be removed, missing data must be addressed, outlier considered, the shape of the data addressed, among other tasks. In addition, it is often useful to look at descriptive statistics and visualizations to identify potential problems. Lastly, you often need to clean categories within a variable if they are misspelled or involve other errors such as punctuation and converting numbers.
The concepts mentioned above are just some of the steps that need to be taken to clean data. Dirty will lead to bad insights. Therefore, this must be done well.
Exploring data and scrubbing data will often happen at the same time. With exploration, you are looking for insights into your data. One of the easiest ways to do this is to drill down as far as possible into your continuous variables by segmenting with the categorical variables.
For example, you might look at average scores by gender, then you look at average scores by gender and major, then you might look at average scores by gender, major, and class. Each time you find slightly different patterns that may be useful or not. Another approach would be to look at scatterplots that consider different combinations of categorical variables.
If the objectives are clear, it can help you focus your exploration on reducing the chance of presenting non-relevant information to your stakeholders. Suppose the stakeholders want to know the average scores of women. In that case, there is maybe no benefit to knowing the average score of male music majors.
Modeling involves regression/classification in the case of supervised learning or segmentation in the case of unsupervised learning. Modeling in the context of supervised learning helps in predicting future values, while segmentation helps develop insights into groups within a dataset that have similar traits.
Once again, the objectives of the analysis shape what tool to use in this context. If you want to predict enrollment, then regression tools may be appropriate. If you want what car a person will buy, then classification may help. If, on the other hand, you want to know what are some of the traits of high-performing students, then unsupervised approaches may be the best option.
Interpreting involves sharing what does all this stuff means. It is truly difficult to explain the intricacies of data analysis to a layman. Therefore, this involves not just analytical techniques but communication skills. Breaking down the complex analysis so that people can understand it is difficult. As such, ideas around storytelling have been developed to help data analysis connect the code with the audience.
The framework provided here is not the only way to approach data analysis. Furthermore, as you become more comfortable with analyzing data, you do not have to limit yourself to the steps or order in which they are performed. Frameworks are intended for getting people started in the creative process of whatever task they are trying to achieve.