Performing a data analysis in the realm of data science is a difficult task due to the huge number of decisions that need to be made. For some people, plotting the course to conduct an analysis is easy. However, for most of us, beginning a project leads to a sense of paralysis as we struggle to determine what to do.
In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five tasks are
- Developing your question(s)
- Data exploration
- Developing a statistical model
- Interpreting the results
- Sharing the results
Developing Your Question(s)
You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s).
There are several types of research questions. The point is you need to ask them in order to answer them.
Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible.
In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables.
Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all. Mastery of this in many ways means being a master of data science
Develop a Statistical Model
Your research questions and the data exploration process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values.
This is probably the funniest part of data analysis and is much easier than having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s)
Interpreting the Results
Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders.
With interpretation, you are trying to determine “what does this answer mean to the stakeholders?” For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out?
Communication of Results
Now is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following
- The audience or stakeholders
- The actual answers to the questions
- The benefits of knowing this
You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way.
The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis.