Tag Archives: Big Data

Decision trees are used for classifying examples into distinct classes or categories. Such as pass/fail, win/lose, buy/sell/trade, etc. However, as we all know, categories are just one form of outcome in machine learning. Sometimes we want to make numeric predictions.

The use of trees in making predictions numeric involves the use of regression trees or model trees. In this post, we will look at each of these forms of numeric prediction with the use of trees.

Regression Trees and Modal Trees

Regression trees have been around since the 1980’s. They work by predicting the average value of specific examples that reach a given leaf in the tree. Despite their name, there is no regression involved with regression trees. Regression trees are straightforward to interpret but at the expense of accuracy.

Modal trees are similar to regression trees but employ multiple regression with the examples at each leaf in a tree. This leads to many different regression models being used to split the data throughout a tree. This makes model trees hard to interpret and understand in comparison to regression trees. However, they are normally much more accurate than regression trees.

Both types of trees have the goal of making groups that are as homogeneous as possible. For decision trees, entropy is used to measure the homogeneity of groups. For numeric decision trees, the standard deviation reduction (SDR) is used. The detail of SDR are somewhat complex and technical and will be avoided for that reason.

Strengths of Numeric Prediction Trees

Numeric prediction trees do not have the assumptions of linear regression. As such, they can be used to model non-normal and or non-linear data. In addition, if a dataset has a large number of feature variables, a numeric prediction tree can easily select the most appropriate ones automatically. Lastly, numeric prediction trees also do not need the model to be specific in advance of the analysis.

Weaknesses of Numeric Prediction Trees

This form of analysis requires a large amount of data in the training set in order to develop a testable model. It is also hard to tell which variables are most important in shaping the outcome. Lastly, sometimes numeric prediction trees are hard to interpret. This naturally limits there usefulness among people who lack statistical training.

Conclusion

Numeric prediction trees combine the strength of decision trees with the ability to digest a large amount of numerical variables. This form of machine learning is useful when trying to rate or measure something that is very difficult to rate or measure. However, when possible, it is usually wise to allow to try to use simpler methods if permissible.

Classification Rules in Machine Learning

1 Reply

Classification rules represent knowledge in an if-else format. These types of rules involve the terms antecedent and consequent. The antecedent is the before and consequent is after. For example, I may have the following rule.

If the students studies 5 hours a week then they will pass the class with an A

This simple rule can be broken down into the following antecedent and consequent.

Antecedent–If the student studies 5 hours a week
Consequent-then they will pass the class with an A

The antecedent determines if the consequent takes place. For example, the student must study 5 hours a week to get an A. This is the rule in this particular context.

This post will further explain the characteristics and traits of classification rules.

Classification Rules and Decision Trees

Classification rules are developed on current data to make decisions about future actions. They are highly similar to the more common decision trees. The primary difference is that decision trees involve a complex step-by-step process to make a decision.

Classification rules are stand-alone rules that are abstracted from a process. To appreciate a classification rule you do not need to be familiar with the process that created it. While with decision trees you do need to be familiar with the process that generated the decision.

One catch with classification rules in machine learning is that the majority of the variables need to be nominal in nature. As such, classification rules are not as useful for large amounts of numeric variables. This is not a problem with decision trees.

The Algorithm

Classification rules use algorithms that employ a separate and conquer heuristic. What this means is that the algorithm will try to separate the data into smaller and smaller subset by generating enough rules to make homogeneous subsets. The goal is always to separate the examples in the data set into subgroups that have similar characteristics.

Common algorithms used in classification rules include the One Rule Algorithm and the RIPPER Algorithm. The One Rule Algorithm analyzes data and generates one all-encompassing rule. This algorithm works by finding the single rule that contains the less amount of error. Despite its simplicity, it is surprisingly accurate.

The RIPPER algorithm grows as many rules as possible. When a rule begins to become so complex that in no longer helps to purify the various groups the rule is pruned or the part of the rule that is not beneficial is removed. This process of growing and pruning rules is continued until there is no further benefit.

RIPPER algorithm rules are more complex than One Rule Algorithm. This allows for the development of complex models. The drawback is that the rules can become too complex to make practical sense.

Conclusion

Classification rules are a useful way to develop clear principles as found in the data. The advantage of such an approach is simplicity. However, numeric data is harder to use when trying to develop such rules.

Characteristics of Big Data

3 Replies

In a previous post, we talked about types of Big Data. However, another way to look at big data and define it is by looking at the characteristics of Big Data. In other words, what helps to identify makes Big Data as data that is big.

This post will explain the 6 main characteristics of Big Data. These characteristics are often known as the V’s of Big Data. They are as follows

Volume
Variety
Velocity
Veracity
Valence
Value

Volume

Volume has to do with the size of the data. It is hard to comprehend how volume is measured in computer science when it comes to memory for many people. Most of the computers that the average person uses works in the range of gigabytes. For example, a dvd will hold about 5 gigabytes of data.

It is now becoming more and more common to find people with terabytes of storage. A terabyte is 1,000 gigabytes! This is enough memory to hold 500 dvds worth of data. The next step up is petabytes which is 1000 terabytes or 5,000,000 dvds.

Big data involves data that is large as in the examples above. Such massive amounts of data called on new ways of analysis.

Variety

Variety is another term for complexity. Big data can be highly or lowly complex. There was a previous post about structured and unstructured data that we won’t repeat here. The point is that these various levels of complexity make analysis highly difficult because of the tremendous amount of data mugging or cleaning of the data that is often necessary.

Velocity

Velocity is the speed at which big data is created, stored, and or analyzed. Two approaches to processing data are batch and real-time. Batch processing involves collecting and cleaning the data in “batches” for processing. It is necessary to wait for all the “batches” to come in before making a decision. As such this is a slow process.

An alternative is real-team processing. This approach involves streaming the information into machines which process the data immediately.

The speed at which data needs to be processed is linked directly with the cost. As such, faster may not always be better or necessary.

Veracity

The quality of the data is what veracity is. If the data is no good the results are no good. The most reliable data tends to be collected companies and other forms of enterprise. The next lower level is social media data. Finally, the lowest level of data is often data that is captured by sensors. The differences between the levels is often the lack of discrimination.

Valence

Valence is a term that is used in chemistry and has to do with how an element has electrons available for bonding with other elements. This can lead to complex molecules due to elements being interconnected through sharing electrons.

In Big Data, valence is how interconnected the data is. As there are more and more connections among the data the complexity of the analysis increases.

Value

Value is the ability to convert Big Data information into a monetary reward. For example, if you find a relationship between two products at a point of sale, you can recommend them to customers at a website or put the products next to each in a store.

A lot of Big Data research is done with a motive of making money. However, there is a lot of Big Data research happening that is driven exclusively by a profit motive such as the research being used to analyze the human genome. As such, the “value” characteristic is not always included when talking about the characteristics of Big Data.

Conclusion

Understanding the traits of Big Data allows an individual to identify Big Data when they see it. The traits here are the common ones of Big Data. However, this list is far from exhaustive and there is much more that could be said.

Nearest Neighbor Classification

2 Replies

There are times when the relationships among examples you want to classify are messy and complicated. This makes it difficult to actually classify them. Yet in this same situation, items of the same class have a lot of features in common even though the overall sample is messy. In such a situation, nearest neighbor classification may be useful.

Nearest neighbor classification uses a simple technique to classify unlabeled examples. The algorithm assigns an unlabeled example the label of the nearest example. This based on the assumption that if two examples are next to each other they must be of the same class.

In this post, we will look at the characteristics of nearest neighbor classification as well as the strengths and weakness of this approach.

Characteristics

Nearest neighbor classification uses the features of the data set to create a multidimensional feature space. The number of features determines the number of dimensions. Therefore, two features leads to a two-dimensional feature space, three features leads to a three dimensional feature space, etc. In this feature space all the examples are placed based on their respective features.

The label of the unknown examples are determined by who the closet neighbor is or are. This calculation is based on Euclidean distance, which is the shortest distance possible. The number of neighbors that are used to calculate the distance varies at the discretion of the researcher. For example, we could use one neighbor or several to determine the label of an unlabeled example. There are pros and cons to how many neighbors to use. The more neighbors used the more complicated the classification becomes.

Nearest neighbor classification is considered a type of lazy learning. What is meant by lazy is that no abstraction of the data happens. This means there is no real explanation or theory provide by the model to understand why there are certain relationships. Nearest neighbor tells you where the relationships are but not why or how. This is partly due to the fact that it is a non-parametric learning method and provides no parameters (summary statistics) about the data.

Pros and Cons

Nearest neighbor classification has the advantage of being simple, highly effective, and fast during the training phase. There are also no assumptions made about the data distribution. This means that common problems like a lack of normality are not an issue.

Some problems include the lack of a model. This deprives us of insights into the relationships in the data. Another concern is the headache of missing data. This forces you to spend time cleaning the data more thoroughly. One final issue is that the classification phase of a project is slow and cumbersome because of the messy nature of the data.

Conclusion

Nearest neighbor classification is one useful tool in machine learning. This approach is valuable for times when the data is heterogeneous but with clear homogeneous groups in the data. In a future post, we will go through an example of this classification approach using R.

The Types of Data in Big Data

2 Replies

A well-known quote in the business world is “cash is king.” Nothing will destroy a business faster than a lack of liquidity to meet a financial emergency. What your worth may not matter as much as what you can spend that makes a difference.

However, there is now a challenge to this mantra. In the world of data science, there is the belief that data is king. This can potentially make sense as using data to foresee financial disaster can help people to have cash ready.

In this post, we are going to examine the different types of data in the world of data science. Generally, there are two types of data which are unstructured and structured data.

Unstructured Data

Unstructured data is data that is produced by people. Normally, this data is text heavy. Examples of unstructured data include twits on Twitter, customer feedback on Amazon, blogs, emails, etc. This type of data is very challenging to work with because it is not necessarily in a format for analysis.

Despite the challenges, there are techniques available for using this information to make decisions. Often, the analysis of unstructured data is used to target products and make recommendations for purchases by companies.

Structured Data

Structured data is in many ways the complete opposite of unstructured data. Structured data has a clear format and a specific place for various pieces of data. An excel document is one example of structured data. A receipt is another example. A receipt has a specific place for different pieces of information such as price, total, date, etc. Often, structured data is made by organizations and machines.

Naturally, analyzing structured data is often much easier than unstructured data. With a consistent format, there is less processing required before analysis.

Working With Data

When approaching a project, data often comes from several sources. Normally, the data has to be moved around and consolidated into one space for analysis. When working with unstructured and or structured data that is coming from several different sources, there is a three-step process used to facilitate this. The process is called ETL which stands for extract, transform, and load.

Extracting data means taking it from one place and planning to move it somewhere else. Transform means changing the data in some way or another. For example, this often means organizing it for the purposes of answer research questions. How this is done is context specific.

Load simply means placing all the transformed data into one place for analysis. This is a critical last step as it is helpful to have what you are analyzing in one convenient place. The details of this will be addressed in a future post.

Conclusion

In what may be an interesting contradiction, as we collect more and more data, data is actually becoming more valuable. Normally, an increase in a resource lessens its value but not with data. Organizations are collecting data at a recording break in order to anticipate the behavior of people. This predictive power derived from data can lead to significant profits, which leads to the conclusion that perhaps data is now the king.

educational research techniques

Research techniques and education

Tag Archives: Big Data

Understanding Recommendation Engines

Like this:

Exploratory Data Analyst

Like this:

Regularized Linear Regression

Like this:

Numeric Prediction Trees

Like this:

Classification Rules in Machine Learning

Like this:

Characteristics of Big Data

Like this:

Nearest Neighbor Classification

Like this:

The Types of Data in Big Data

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: