In data science, exploratory data analyst serves the purpose of assessing whether the data set that you have is suitable for answering the research questions of the project. As such, there are several steps that can be taken to make this process more efficient.
Therefore, the purpose of this post is to explain one process that can be used for exploratory data analyst. The steps include the following.
- Consult your questions
- Check the structure of the dataset
- Use visuals
Consult Your Questions
Research questions give a project a sense of direction. They help you to know what you want to know. In addition, research questions help you to determine what type of analyst to conduct as well.
During the data exploration stage, the purpose of a research question is not for analyst but rather to determine if your data can actually provide answers to the questions. For example, if you want to know what the average height of men in America are and your data tells you the salary of office workers there is a problem,. Your question (average height) cannot be answered with the current data that you have (office workers salaries).
As such, the research questions need to be answerable and specific before moving forward. By answerable, we mean that the data can provide the solution. By specific, we mean a question moves away from generalities and deals with a clearly defined phenomenon. For example, “what is the average height of males age 20-30 in the United states?” This question clearly identifies the what we want to know (average height) and among who (20-30, male Americans).
Not can you confirm if your questions are answerable you can also decide if you need to be more or less specific with your questions. Returning to our average height question. We may find that we can be more specific and check average height by state if we want. Or, we might learn that we can only determine the average height for a region. All this depends on the type of data we have.
Check the Structure
Checking the structure involves determining how many rows and columns in the dataset, the sample size, as well as looking for missing data and erroneous data. Data sets in data science almost always need some sort of cleaning or data wrangling before analyst and checking the structure helps to determine what needs to be done.
You should have a priori expectations for the structure of the dataset. If the stakeholders tell you that there should be several million rows in the data set and you check and there are only several thousand you know there is a problem. This concept also applies to the number of features you expect as well.
Visuals, which can be plots or tables, help you further develop your expectations as well as to look for deviations or outliers. Tables are an excellent source for summarizing data. Plots, on the other hand, allow you to see deviations from your expectations in the data.What kind of tables and plots to make depends heavily on
What kind of tables and plots to make depends heavily on the type of data as well as the type of questions that you have. For example, for descriptive questions tables of summary statistics with bar plots might be sufficient. For comparison questions, summary stats and boxplots may be enough. For relationship question, summary stat tables with a scatterplot may be enough. Please keep in mind that it is much more complicated than this.
Before questions can be answered the data needs to be explored. This will help to make sure that the potential answers that are developed are appropriate.