Tag Archives: Big Data

Exploratory Data Analyst

In data science, exploratory data analyst serves the purpose of assessing whether the data set that you have is suitable for answering the research questions of the project. As such, there are several steps that can be taken to make this process more efficient.

Therefore, the purpose of this post is to explain one process that can be used for exploratory data analyst. The steps include ethe following.

  • Consult your questions
  • Check the structure of the dataset
  • Use visuals

Consult Your Questions

Research questions give a project a sense of direction. They help you to know what you want to know. In addition, research questions help you to determine what type of analyst to conduct as well.

During the data exploration stage, the purpose of a research question is not for analyst but rather to determine if your data can actually provide answers to the questions. For example, if you want to know what the average height of men in America are and your data tells you the salary of office workers there is a problem,. Your question (average height) cannot be answered with the current data that you have (office workers salaries).

As such, the research questions need to be answerable and specific before moving forward. By answerable, we mean that the data can provide the solution. By specific, we mean a question moves away from generalities and deals with clearly define phenomenon. For example, “what is the average height of males age 20-30 in the United states?” This question clearly identifies the what we want to know (average height) and among who (20-30, male Americans).

Not can you confirm if your questions are answerable you can also decide if you need to be more or less specific with your questions. Returning to our average height question. We may find that we can be more specific and check average height by state if we want. Or, we might learn that we can only determine the average height for a region. All this depends on the type of data we have.

Check the Structure

Checking the structure involves determining how many rows and columns in the dataset, the sample size, as well as looking for missing data and erroneous data. Data sets in data science almost always need some sort of cleaning or data wrangling before analyst and checking the structure helps to determine what needs to be done.

You should have a priori expectations for the structure of the data set. If the stakeholders tell you that there should be several million rows in the data set and you check and there are only several thousand you know there is a problem. This concept also applies to the number of features you expect as well.

Make Visuals

Visuals, which can be plots or tables, help you further develop your expectations as well as to look for deviations or outliers. Tables are an excellent source for summarizing data. Plots, on the other hand, allow you to see deviations from your expectations in the data.What kind of tables and plots to make depends heavily on

What kind of tables and plots to make depends heavily on the type of data as well as the type of questions that you have. For example, for descriptive questions tables of summary statistics with bar plots might be sufficient. For comparison questions, summary stats and boxplots may be enough. For relationship question, summary stat tables with a scatterplot may be enough. Please keep in mind that it is much more complicated than this.

Conclusion

Before questions can be answered the data needs to be explored. This will help to make sure that the potential answers that are developed are appropriate.

Advertisements

Regularized Linear Regression

Traditional linear regression has been a tried and true  model for making predictions for decades. However, with the growth of Big Data and datasets with 100’s of variables  problems have begun to arise. For example, using stepwise or best subset method with regression could take hours if not days to converge in even some of the best computers.

To deal with this problem, regularized regression has been developed to help to determine which features or variables to keep when developing models from large datasets with a huge number of variables. In this post, we will look at the following concepts

  • Definition of regularized regression
  • Ridge regression
  • Lasso regression
  • Elastic net regression

Regularization

Regularization involves the use of a shrinkage penalty in order to reduce the residual sum of squares (RSS). This is done through selecting a value for a tuning parameter called “lambda”. Tuning parameters are used in machine learning algorithms to control the behavior of the models that are developed.

The lambda is multiplied by the normalized coefficients of the model and added to the RSS. Below is an equation of what was just said

RSS + λ(normalized coefficients)

The benefits of regularization are at least three-fold. First, regularization is highly computationally efficient. Instead of fitting k-1 models when k is the number of variables available (for example, 50 variables would lead 49 models!), with regularization only one model is developed for each value of lambda you specify.

Second, regularization helps to deal with the bias-variance headache of model development. When small changes are made to data, such as switching from the training to testing data, there can be wild changes in the estimates. Regularization can often smooth this problem out substantially.

Finally, regularization can help to reduce or eliminate any multicollenarity in a model. As such, the benefits of using regularization make it clear that this should be considering when working with larger data sets.

Ridge Regression

Ridge regression involves the normalization of the squared weights or as shown in the equation below

RSS + λ(normalized coefficients^2)

This is also refered to as the L2-norm. As lambda increase in value the coefficients in the model are shrunk towards 0 but never reach 0. This is how the error is shrunk. The higher the lambda the lower the value of the coefficients as they are reduce more and more thus reducing the RSS.

The benefit is that predictive accuracy is often increased. However, interpreting and communicating your results can become difficult because no variables are removed from the model. Instead the variables are reduced near to zero. This can be especially tough if you have dozens of variables remaining in your model to try to explain.

Lasso

Lasso is short for “Least Absolute Shrinkage and Selection Operator”. This approach uses the L1-norm which is the sum of the absolute value of the coefficients or as shown in the equation below

RSS + λ(Σ|normalized coefficients|)

This shrinkage penalty will reduce a coefficient to 0 which is another way of saying that variables will be removed from the model. One problem is that highly correlated variables that need to be  in your model my be removed when Lasso shrinks coefficients. This is one reason why ridge regression is still used.

Elastic Net

Elastic net is the best of ridge and Lasso without the weaknesses of either. It combines extracts variables like Lasso and Ridge does not while also group variables like Ridge does but Lasso does not.

This is done by including a second tuning parameter called “alpha”. If alpha is set to 0 it is the same as ridge regression and if alpha is set to 1 it is the same as lasso regression. If you are able to appreciate it below is the formula used for elastic net regression

(RSS + l[(1 – alpha)(S|normalized coefficients|2)/2 + alpha(S|normalized coefficients|)])/N)

As such when working with elastic net you have to set two different tuning parameters (alpha and lambda) in order to develop a model.

Conclusion

Regularized regressio was developed as an answer to the growth in the size and number of variables in a data set today. Ridge, lasso an elastic net all provide solutions to converging over large datasets and selecting features.

Numeric Prediction Trees

Decision trees are used for classifying examples into distinct classes or categories. Such as pass/fail, win/lose, buy/sell/trade, etc. However, as we all know, categories are just one form of outcome in machine learning. Sometimes we want to make numeric predictions.

The use of trees in making predictions numeric involves the use of regression trees or model trees. In this post, we will look at each of these forms of numeric prediction with the use of trees.

Regression Trees and Modal Trees

Regression trees have been around since the 1980’s. They work by predicting the average value of specific examples that reach a given leaf in the tree. Despite their name, there is no regression involved with regression trees. Regression trees are straightforward  to interpret but at the expense of accuracy.

Modal trees are similar to regression trees but employs multiple regression with the examples at each leaf in a tree. This leads to many different regression models being used to split the data throughout a tree. This makes model trees hard to interpret and understand in comparison to regression trees. However, they are normally much more accurate than regression trees.

Both types of trees have the goal of making groups that are as homogeneous as possible. For decision trees, entropy is used to measure the homogeneity of groups. For numeric decision trees, the standard deviation reduction (SDR) is used. The detail of SDR are somewhat complex and technical and will be avoided for that reason.

Strengths of Numeric Prediction Trees

Numeric prediction trees do not have the assumptions of linear regression. As such, they can be used to model non-normal and or non-linear data. In addition, if a dataset has a large number of feature variables, a numeric prediction tree can easily select the most appropriate ones automatically. Lastly, numeric prediction trees also do not need the the model to be specific in advance of the analysis.

Weaknesses of Numeric Prediction Trees

This form of analysis requires a large amounts of data in the training set in order to develop a testable model. It is also hard to tell which variables are most important in shaping the outcome. Lastly, sometimes numeric prediction trees are hard to interpret. This naturally limits there usefulness among people who lack statistical training.

Conclusion

Numeric prediction trees combine the strength of decision tress with the ability to digest la large amount of numerical variables. This form of machine learning is useful when trying to rate or measure something that is very difficult to rate or measure. However, when possible, it is usually wise to allows try to use simpler methods if permissible.

Classification Rules in Machine Learning

Classification rules represent knowledge in an if-else format. These types of rules involve the terms antecedent and consequent. Antecedent is the before ad consequent is after. For example, I may have the following rule.

If the students studies 5 hours a week then they will pass the class with an A

This simple rule can be broken down into the following antecedent and consequent.

  • Antecedent–If the student studies 5 hours a week
  • Consequent-then they will pass the class with an A

The antecedent determines if the consequent takes place. For example, the student must study 5 hours a week to get an A. This is the rule in this particular context.

This post will further explain the characteristic and traits of classification rules.

Classification Rules and Decision Trees

Classification rules are developed on current data to make decisions about future actions. They are highly similar to the more common decision trees. The primary difference is that decision trees involve a complex step-by-step  process to make a decision.

Classification rules are stand alone rules that are abstracted from a process. To appreciate a classification rule you do not need to be familiar with the process that created it. While with decision trees you do need to be familiar with the process that generated the decision.

One catch with classification rules in machine learning is that the majority of the variables need to be nominal in nature. As such, classification rules are not as useful for large amounts of numeric variables. This is not a problem with decision trees.

The Algorithm

Classification rules use algorithms that employ a separate and conquer heuristic. What this means is that the algorithm will try to separate the data into smaller and smaller subset by generating enough rules to make homogeneous subsets. The goal is always to separate the examples in the data set into subgroups that have similar characteristics.

Common algorithms used in classification rules include the One Rule Algorithm and the RIPPER Algorithm. The One Rule Algorithm analyzes data and generates one all-encompassing rule. This algorithm works be finding the single rule that contains the less amount of error. Despite its simplicity it is surprisingly accurate.

The RIPPER algorithm grows as many rules as possible. When a rule begins to become so complex that in no longer helps to purify the various groups the rule is pruned or the part of the rule that is not beneficial is removed. This process of growing and pruning rules is continued until there is no further benefit.

RIPPER algorithm rules are more complex than One Rule Algorithm. This allows for the development of complex models. The drawback is that the rules can become to complex to make practical sense.

Conclusion

Classification rules are a useful way to develop clear principles as found in the data. The advantages of such an approach is simplicity. However, numeric data is harder to use when trying to develop such rules.

Characteristics of Big Data

In a previous post we talked about types of Big Data. However, another way to look at big data and define it is by looking at the characteristics of Big Data. In other words, what helps to identify makes Big Data  as data that is big.

This post will explain the 6 main characteristics of Big Data. These characteristics are often known as the V’s of Big Data. They are as follows

  • Volume
  • Variety
  • Velocity
  • Veracity
  • Valence
  • Value

Volume

Volume has to do with the size of the data. It is hard comprehend how volume is measured in computer science when it comes to memory for many people. Most of the computers that that the average person uses works in the range of gigabytes. For example, a dvd will hold about 5 gigabytes of data.

It is now becoming more and more common to find people with terabytes of storage. A terabyte is 1,000 gigabytes! This is enough memory to hold 500 dvds worth of data. The next step up is petabytes which is 1000 terabytes or 5,000,000 dvds.

Big data involves data that is large as in the examples above. Such massive amounts of data called on new ways of analysis.

Variety

Variety is another term for complexity. Big data can be highly or lowly complex. There was a previous post about structured and unstructured data that we won’t repeat here. The point is that these vary levels of complexity make analysis highly difficult because of the tremendous amount of data mugging or cleaning of the data that is often necessary.

Velocity

Velocity is the speed at which big data is created, stored, and or analyzed. Two approach to processing data are batch and real-time. Batch processing involves collecting and cleaning the data in “batches” for processing. It is necessary to wait for all the “batches” to come in before making a decision. As such this is a slow process.

An alternative is real-team processing. This approach involves streaming the information into machines which process the data immediately.

The speed at which data needs to be process is linked directly with the cost. As such, faster may not always be better or necessary.

Veracity

The quality of the data is what veracity is. If the data is no good the results are no good. The most reliable data tends to be collected companies and other forms of enterprise. The next lower level is social media data. Finally, the lowest level of data is often data that is captured by sensors. The differences between the levels is often the lack of discrimination.

Valence

Valence is a term that is used in chemistry and has to do with how an element has electrons available for bonding with other elements. This can lead to complex molecules due to elements being interconnected through sharing electrons.

In Big Data, valence is how interconnected  the data is. As there are more and more connections among the data the complexity of the analysis increases.

Value

Values is the ability to convert Big Data information into a monetary reward. For example, if you find a relationship between to products a t a point of sale, you can recommend them to customers at a website or put the products next to each in a store.

A lot of Big Data research is done with a motive of making money. However, there is a lot of Big Data research happening that is driven exclusively by a profit motive such as the research being used to analyze the human genome. As such, the “value” characteristic is not always included when talking about the characteristics of Big Data.

Conclusion

Understanding the traits of Big Data allows an individual to identify Big Data when the see it. The traits here are the common ones of Big Data. However, this list is far from exhaustive and there is much more that could be said.

Nearest Neighbor Classification

There are times when the relationships among examples you want to classify are messy and complicated. This makes it difficult to actually classify them. Yet in this same situation, items of the same class have a lot of features in common even though the overall sample is messy. In such a situation, nearest neighbor classification may be useful.

Nearest neighbor classification uses a simple technique to classify unlabeled examples. The algorithm assigns an unlabeled example the label of the nearest example. This based on the assumption that if two examples are next to each other they must be of the same class.

In this post, we will look at the characteristics of nearest neighbor classification as well as the strengths and weakness of this approach.

Characteristics

Nearest neighbor classification uses the features of the data set to create a multidimensional feature space. The number of features determines the number of dimensions. Therefore, two features leads to a two-dimensional feature space, three features leads to a three dimensional feature space, etc. In this feature space all the examples are placed based on their respective features.

The label of the unknown examples are determined by who the closet neighbor is or are. This calculation is based on Euclidean distance, which is the shortest distance possible. The number of neighbors that are used to calculate the distance varies at the discretion of the researcher. For example, we could use one neighbor or several to determine the label of an unlabeled example. There are pros and cons to how many neighbors to use. The more neighbors used the more complicated the classification becomes.

Nearest neighbor classification is considered a type of lazy learning. What is meant by lazy is that no abstraction of the data happens. This means there is no real explanation or theory provide by the model to understand why there are certain relationships. Nearest neighbor tells you where the relationships are but not why or how. This is partly due to the fact that it is a non-parametric learning method and provides no parameters (summary statistics) about the data.

Pros and Cons

Nearest neighbor classification has the advantage of being simple, highly effective, and fast during the training phase. There are also no assumptions made about the data distribution. This means that common problems like a lack of normality are not an issue.

Some problems include the lack of a model. This deprives us of insights into the relationships in the data. Another concern is the headache of missing data.  This forces you to spend time cleaning the data more thoroughly.  One final issue is that the classification phase of a project is slow and cumbersome because of the messy nature of the data.

Conclusion

Nearest neighbor classification is one useful tool in machine learning. This approach is valuable for times when the data is heterogeneous but with clear homogeneous groups in the data. In a future post, we will go through an example of this classification approach using R.

The Types of Data in Big Data

A well known quote in the business world is “cash is king.” Nothing will destroy a business faster than a lack of liquidity to meet a financial emergency. What your worth may not matter as much as what you can spend that makes a difference.

However, there is now a challenge to this mantra. In the world of data science there is the belief that data is king. This can potential make sense as using data to foresee financial disaster can help people to have cash ready.

In this post we are going to examine the different types of data in the world of data science. Generally, there are two types of data which are unstructured and structured data.

Unstructured Data

Unstructured data is data that is produce by people. Normally, this data is text heavy. Examples of unstructured data includes twits on twitter, customer feedback on Amazon, blogs, emails, etc. This type of data is very challenging to work with because it is not necessarily in a format for analysis.

Despite the challenges, there are techniques available for using this information to make decisions. Often, the analysis of unstructured data is used to target products and make recommendations for purchases by companies.

Structured Data

Structured data is in many ways the complete opposite of unstructured data. Structured data has a clear format and a specific place for various pieces of data. An excel document is one example of structured data. A receipt is another example. A receipt has a specific place for different pieces of information such as price, total, date, etc. Often, structured data is made by organizations and machines.

Naturally, analyzing structured data is often much easier than unstructured data. With a consistent format there is less processing required before analysis.

Working With Data

When approaching a project, data often comes from several sources. Normally, the data has to be moved around and consolidated into one space for analysis. When working with unstructured and or structured data that is coming from several different sources, there is a three step process used to facilitate this. The process is called ETL which stands for extract, transform, and load.

Extracting data means taking it from one place and planning to move it somewhere else. Transform means changing the data in some way or another. For example, this often means organizing it for the purposes of answer research questions. How this is done is context specific.

Load simply means placing all the transformed data into one place for analysis. This is a critical last step as it is helpful to have what you are analyzing in one convenient place. The details of this will be addressed in a future post

Conclusion

In what may be an interesting contradiction, as we collecting more and more data, data is actually becoming more valuable. Normally, an increase in a resource lessens its value but not with data. Organizations are collecting data at a recording break in order to anticipate the behavior of people. This predictive power derive from data can lead to significant profits, which leads to the conclusion that perhaps data is now the king.