You must have at least once solved a problem of probability in your high-school in which you were supposed to find the probability of getting a specific colored ball from a bag containing different colored balls, given the number of balls of each color. Random forests are simple if we try to learn them with this analogy in mind.
Random forests (RF) are basically a bag containing n Decision Trees (DT) having a different set of hyper-parameters and trained on different subsets of data. Let’s say I have 100 decision trees in my Random forest bag!! As I just said, these decision trees have a different set of hyper-parameters and a different subset of training data, so the decision or the prediction given by these trees can vary a lot. Let’s consider that I have somehow trained all these 100 trees with their respective subset of data. Now I will ask all the hundred trees in my bag that what is their prediction on my test data. Now we need to take only one decision on one example or one test data, we do it by taking a simple vote. We go with what the majority of the trees have predicted for that example.
In the above picture, we can see how an example is classified using n trees where the final prediction is done by taking a vote from all n trees.
This can be used for regression and classification tasks both. But we will discuss its use for classification because it’s more intuitive and easy to understand. Random forest is one of the most used algorithms because of its simplicity and stability.
While building subsets of data for trees, the word “random” comes into the picture. A subset of data is made by randomly selecting x number of features (columns) and y number of examples (rows) from the original dataset of n features and m examples.
Random forests are more stable and reliable than just a decision tree. This is just saying like- it’s better to take a vote from all cabinet ministers rather than just accepting the decision given by the PM.
As we have seen that the Random Forests are nothing but the collection of decision trees, it becomes essential to know the decision tree. Sharpen that up if you haven’t already!
In general, the more trees in the forest the more robust the forest looks like. In the same way in the random forest classifier, the higher the number of trees in the forest gives the high the accuracy results.
Why Random forest algorithm
The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random:
- Random sampling of training data points when building trees
- Random subsets of features considered when splitting nodes
The reasons we use random forest algorithm are:
- The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
- Random forest classifier will handle the missing values.
- When we have more trees in the forest, a random forest classifier won’t overfit the model.
- Can model the random forest classifier for categorical values also.
Random Forest Vs Decision Tree
Let’s explore this with an easy example.
Suppose you have to buy a packet of $5 cupcakes. Now, you have to decide one among several biscuits’ brands.
You choose a decision tree algorithm. Now, it will check the $5 packet, which is sweet. It will choose probably the most sold biscuits. You will decide to go for $5 chocolate cupcakes. You are happy!
But your friend used the Random forest algorithm. Now, he has made several decisions. Further, choosing the majority decision. He chooses among various strawberry, vanilla, blueberry, and orange flavoured cupcakes. He checks that a particular $5 packet served 3 units more than the original one. It was served in vanilla chocolate. He bought that vanilla chocolate cupcakes. He is the happiest, while you are left to regret your decision.
Decision Tree :
Decision Tree is a supervised learning algorithm used in machine learning. It operated in both classification and regression algorithms. As the name suggests, it is like a tree with nodes. The branches depend on the number of criteria. It splits data into branches like these till it achieves a threshold unit. A decision tree has root nodes, children nodes, and leaf nodes.
Recursion is used for traversing through the nodes. You need no other algorithm. It handles data accurately and works best for a linear pattern. It handles large data easily and takes less time.
Random Forest :
It is also used for supervised learning but is very powerful. It is very widely used. The basic difference being it does not rely on a singular decision. It assembles randomized decisions based on several decisions and makes the final decision based on the majority.
It does not search for the best prediction. Instead, it makes multiple random predictions. Thus, more diversity is attached, and prediction becomes much smoother.
You can infer Random forest to be a collection of multiple decision trees!
Bagging is the process of establishing random forests while decisions work parallelly.
What is Bagging?
- Take some training data set
- Make a decision tree
- Repeat the process for a definite period
- Now take the major vote. The one that wins is your decision to take.
What is Bootstrapping?
Bootstrapping is randomly choosing samples from training data. This is a random procedure.
Random Forest Step by Step (in simple terms) :
- Random choose conditions
- Calculate the root node
- You get a forest
Advantages of Random Forest:
- Powerful and highly accurate
- No need to normalizing
- Can handle several features at once
- Run trees in parallel ways
Disdvantages of Random Forest:
- Biased to certain features sometimes
- Can’t be used for linear methods
- Not good for high dimensional data
P.S. – Decision trees are very easy as compared to the random forest. A decision tree combines some decisions, whereas a random forest combines several decision trees. Thus, it is a long process, yet slow.
Whereas, a decision tree is fast and operates easily on large data sets, especially the linear one. The random forest model needs rigorous training. When you are trying to put up a project, you might need more than one model. Thus, a large number of random forests, more the time.
It depends on your requirements. If you have less time to work on a model, you are bound to choose a decision tree. However, stability and reliable predictions are in the basket of random forests.
A really good article on implementation of random forests is : https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76