Random Forest classification algorithm in Python

by Alex
Random Forest classification algorithm in Python

Random forest (RF) is a learning algorithm with a teacher. It can be used for both classification and regression. It is also the most flexible and easy to use algorithm. A forest consists of trees. It is said that the more trees in a forest, the stronger it is. RF creates decision trees for randomly chosen data samples, gets a prediction from each tree, and chooses the best solution through voting. It also provides a fairly effective criterion for the importance of indicators (traits). The random forest has many applications, such as recommendation mechanisms, image classification, and feature selection. It can be used to classify bona fide loan applicants, detect fraud, and predict disease. It is the basis of the Borut algorithm, which determines the most significant indicators of the dataset.

Random Forest Algorithm

Let’s understand the random forest algorithm using a non-technical analogy. Suppose you decide to go on a trip and want to get to a place you’re sure to like. So, what do you do to choose the right place? Look for information online: you can read many different reviews and opinions on travel blogs, sites like Q, travel portals – or just ask your friends. Suppose you decide to ask your acquaintances about their travel experiences. You’ll probably get recommendations from each friend and make a list of possible locations from them. Then you ask your acquaintances to vote, that is, to choose the best option for a trip from the list you compiled. The place with the most votes will be your final choice for travel. The aforementioned decision-making process consists of two parts.

  • The first consists of asking friends about their individual experiences and getting a recommendation based on the places a particular friend has visited. This part uses a decision tree algorithm. Each participant chooses only one option among the locations he or she knows.
  • The second part is a voting procedure to determine the best place, conducted after all the recommendations have been collected. Voting means choosing the best location from those provided based on your friends’ experiences. The whole process (parts one and two) from collecting recommendations to voting for the best fit is a random forest algorithm.

Technically, Random forest is a method (based on the divide-and-conquer approach) that uses an ensemble of decision trees created on a randomly divided dataset. The set of such classifier trees forms a forest. Each individual decision tree is generated using indicator selection metrics such as information gain criterion, gain ratio, and Gini index for each feature. Any such tree is generated from independent random sampling. In the classification task, each tree is voted and the most popular class is selected as the final result. In the case of regression, the final result is the average of all ensemble outputs. The random forest method is simpler and more efficient than other nonlinear classification algorithms.

How does the random forest work?

The algorithm consists of four stages:

  1. Create random samples from a given set of data.
  2. For each sample, build a decision tree and get a prediction result using this tree.
  3. Conduct a vote for each resulting prediction.
  4. Choose the prediction with the most votes as the final result.

Алгоритм классификации Random Forest на Python

Search for important features

Random forest also offers a good criterion for feature selection. Scikit-learn provides an additional variable when using the Random forest model, which shows the relative importance, that is, the contribution of each feature to the prediction. The library automatically calculates a relevance score for each trait during the training phase. The resulting value is then normalized so that the sum of all scores equals 1. This evaluation will help to select the most relevant indicators and discard the least important ones for model building. The Random Forest uses the Gini criterion, also known as Mean Uncertainty Reduction (MDI), to calculate the importance of each attribute. In addition, the Gini criterion is sometimes referred to as the overall node uncertainty reduction. It shows how much the accuracy of the model decreases when you discard a variable. The greater the reduction, the more significant the discarded feature. Thus, the average reduction is a necessary parameter for variable selection. You can also use this criterion to display the overall descriptive power of the traits.

Comparison of Random Forests and Decision Trees

  • A random forest is a set of many decision trees.
  • Deep decision trees can suffer from overtraining, but the random forest prevents overtraining by creating trees on random samples.
  • Decision trees are computationally faster than random forests.
  • A random forest is difficult to interpret, while a decision tree is easy to interpret and transform into rules.

Creating a classifier using Scikit-learn

You will build a model based on a dataset of iris flowers, which is a very well-known classification dataset. It includes the length and width of the sepal, the length and width of the petal, and the type of flower. There are three types (classes) of irises: Setosa, Versicolor, and Virginica. You will build a model that determines the type of flower from the above. This dataset is available in the scikit-learn library or you can download it from the UCI machine learning repository. We start by importing datasets from scikit-learn and load the iris dataset with load_iris().
from sklearn import datasets
# load dataset
iris = datasets.load_iris()
You can display the names of the target class and attributes to make sure it is the dataset you want:
['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
It is always advisable to study your data at least a little, so that you know what you are working with. Here you can see the result of the output of the first five lines of the dataset used, as well as all the values of the target dataset variable. Алгоритм классификации Random Forest на Python Below we create a dataframe from our Iris dataset.
import pandas as pd
'sepal length':iris.data[:,0],
'sepal width':iris.data[:,1],
'petal length':iris.data[:,2],
'petal width':iris.data[:,3],
Алгоритм классификации Random Forest на Python Next, we separate the columns into dependent and independent variables (target class attributes and labels). Then let’s create samples for training and testing from the raw data.
from sklearn.model_selection import train_test_split
X = data['sepal length', 'sepal width', 'petal length', 'petal width']]
y = data['species']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=85)
We then generate a random forest model on the training set and perform predictions on the test set.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
After creating the model, it is worth checking its accuracy using the actual and predicted values.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Accuracy: 0.93333333333333
You can also make a prediction for a single observation. Suppose sepal length=3, sepal width=5, petal length=4, petal width=2. We can determine which type of flower the selected one belongs to, as follows:
clf.predict([[3, 5, 4, 2]])
# result is 2
Above, the number 2 indicates the flower class “virginica”.

Search for important traits with scikit-learn

In this section, you identify or select the most important traits in the iris dataset. In scikit-learn, we can accomplish this task by following the steps listed:

  • Let’s create a random forest model.
  • We use the variable feature_importances_ to see the corresponding estimates of indicator significance.
  • Visualize the resulting estimates with the seaborn library.

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
clf = RandomForestClassifier(n_estimators=100)
feature_imp = pd.Series(clf.feature_importances_,index=iris.feature_names).sort_values(ascending=False)
petal width (cm) 0.470224
petal length (cm) 0.424776
sepal length (cm) 0.075913
sepal width (cm) 0.029087
You can also visualize the significance of the traits. This graphical representation is easy to understand and interpret. In addition, the visual representation of information is the fastest way for the human brain to assimilate it. You can use the matplotlib and seaborn libraries together to build the diagrams you need, because seaborn, built on top of matplotlib, offers many special themes and additional graphs. Matplotlib is a superset of seaborn, and both libraries are equally necessary for good visualization.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.barplot(x=feature_imp, y=feature_imp.index)
plt.xlabel('Importance traits')
plt.title('Visualize important traits')
Алгоритм классификации Random Forest на Python

Re-generating the model with the selected attributes

Next we remove the “sepal width” and use the remaining 3 features, since the sepal width is very unimportant.
from sklearn.model_selection import train_test_split
X = data['petal length', 'petal width', 'sepal length']]
y = data['species']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=85)
After splitting, you will generate a random forest model for the selected traits of the training sample, perform a prediction on the test set, and compare the actual and predicted values.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
y_pred = clf.predict(X_test)
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
Accuracy: 0.9619047619047619
You can see that after removing the least important indicators (tepal width), the accuracy has increased because we have eliminated misleading data and reduced noise. In addition, by limiting the number of significant features, we reduced the learning time of the model.

Advantages of Random Forest:

  • Random forest is considered a highly accurate and reliable method because multiple decision trees are involved in the prediction process.
  • Random forest does not suffer from the problem of overtraining. The main reason is that Random forest uses the average of all predictions, which eliminates bias.
  • RF can be used in both types of problems (classification and regression problems).
  • The random forest can also handle missing values. There are two ways to solve such a problem in Random forest. The first uses the median value to fill in the continuous variables, and the second calculates a weighted average of the missing values.
  • RF also calculates the relative importance of the indicators, which helps in selecting the most significant features for the classifier.

Disadvantages of Random Forest:

  • The random forest is quite slow, because the algorithm uses many trees to work: each tree in the forest is given the same input data, based on which it must return its prediction. After that, it also votes on the resulting predictions. This whole process takes a long time.
  • The Random Forest model is more difficult to interpret compared to a decision tree, where you easily determine the outcome by following the path in the tree.


You have learned about the Random forest algorithm and the principle of its operation, the search for important features, the main differences between the random forest and the decision tree, the advantages and disadvantages of this method. We also learned how to create and evaluate models, to find the most important indicators in the model. Don’t stop there! I encourage you to try RF on different data sets and read more about the error matrix.

Related Posts