When assessing how to conduct quantitative data analysis there are several factors to consider. In this post, we look at common either-or choices in data analysis. The concepts are as follows
- Parametric vs Non-Parametric
- Bias vs Variance
- Accuracy vs Interpretability
- Supervised vs Unsupervised
- Regression vs Classifications
Parametric vs Non-Parametric
Parametric models make assumptions about the shape of a function. Often, the assumption is that the function is a linear in nature such as in linear regression. Making this assumption makes it much easier to estimate the actual function of a model.
Non-parametric models do not make any assumptions about the shape of the function. This allows the function to take any shape possible. Examples of non-parametric models include decision trees, support vector machines, and artificial neural networks.
The main concern with non-parametric models is they require a huge dataset when compared to parametric models.
Bias vs Variance Tradeoff
In relation to parametric and non-parametric models is the bias-variance tradeoff. A bias model is simply a model that does not listen to the data when it is tested on the test dataset. The function does what it wants as if it was not trained on the data. Parametric models tend to suffer from high bias.
Variance is the change in the function if it was estimated using new data. Variance is usually much higher in non-parametric models as they are more sensitive to the unique nature of each dataset.
Accuracy vs Interpretability
It is also important to determine what you want to know. If your goal is accuracy then a complicated model may be more appropriate. However, if you want to infer and make conclusions about your model then it is preferred to make a simpler model that can be explained.
A model can be made simpler or more complex through the inclusion of more features or the use of a more complex algorithm. For example, regression is much easier to interpret than artificial neural networks.
Supervised vs Unsupervised
Supervised learning is learning that involves a dependent variable. Examples of supervised learning include regression, support vector machines, K nearest neighbor, and random forest.
Unsupervised learning involves a dataset that does not have a dependent variable. In this situation, you are looking for patterns in the data. Examples of this include kmeans, principal component analysis, and cluster analysis.
Regression vs Classifications
Lastly, a problem can call for regression or classification. A regression problem involves a numeric dependent variable. A classification problem has a categorical dependent variable. Almost all models that are used for supervised machine learning can address regression or classification.
For example, regression includes numeric regression and logistic regression for classification. K nearest neighbor can do both as can random forest, support vector machines, and artificial neural networks.