How to Create a Decision Tree Classifier in Python using sklearn

In this article, we show how to create a random forest classifier in Python using the sklearn module.

According to the scikit-learn.org website, "A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control of overfitting".

Thus, if you are working with a fairly large data sample with various variables that affect the predicted outcome, then random forest classifiers can be very effective in accurately predicting outcomes, possibly be even more so than a decision tree classifier.

Like decision tree classifiers, random forest classifiers are a predictor in machine learning that is a form of supervised learning in which the computer programs predicts what will happen based on past occurrences.

For example, if it's raining outside and the rain has caused children in the past to not play outside, then if we know that it's raining, the likely result is that children are not playing outside. If it is sunny outside and children normally play while it is sunny, we can predict that children are playing outside.

So using a training set of data, a machine learning program can predict to fairly well accurately what will occur given the circumstances.

So below, we will use a random forest classifier to classify outcomes.

So our scenario is, we want to decide if it is likely that kids will play outside given the weather conditions: the temperature, humidity, and whether or not it is windy.

We put our data in a CSV file. This file can be found at the following link: Play.csv

Below is the Python code that uses a random forest classifier to classify the outcome whether it is likely the children play or not, given the temperature, humidity, and whether it is windy.

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns df= pd.read_csv('Play.csv') from sklearn.model_selection import train_test_split X= df.drop(columns=['Played'], axis=1) y= df['Played'] X_train, X_test, y_train, y_test= train_test_split(X,y,test_size= 0.3) from sklearn.ensemble import RandomForestClassifier rf= RandomForestClassifier() rf.fit(X_train,y_train) predictions= rf.predict(X_test) from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test,predictions)) print('\n') print(classification_report(y_test,predictions))

The first thing we have to do is import our modules, including pandas, numpy, matplotlib, seaborn, and sklearn.

We create a variable, df, and set it equal to, pd.read_csv('Play.csv'), which reads the contents of the "Play.csv" file.

We create a variable, X, which will contain all columns of a dataframe object except the column that represents the outcome, which is whether the children went out to play.

We then create a variable, y, which represents the column of whether the children played or not.

The line, X_train, X_test, y_train, y_test= train_test_split(X,y,test_size= 0.3), gives us x training data, x testing data, y training data, and y testing data. This is done using t the train_test_split() function. It allows us to have training data and testing data.

We then create a variable, rf, and set it equal to RandomForestClassifier()

We then train the model using the fit() function. We feed it the training data.

We then create a variable, predictions, which works to predict the results of the test data.

We then want to see the metrics of how well the model predicted data from the test set.

We do a confusion_matrix and a classification report to show the results of how well the model predicted outcomes.

The results are shown below.

[[5 0] [0 2]] precision recall f1-score support 0 1.00 1.00 1.00 5 1 1.00 1.00 1.00 2 accuracy 1.00 7 macro avg 1.00 1.00 1.00 7 weighted avg 1.00 1.00 1.00 7

The confusion matrix can tell us information about true negatives, false positives, false negatives, and true positives. In this case, there were no false negatives or false positives. There were only true positives and true negatives.

The classification report showed 100% precision with the machine learning model.

So a random forest classifier can be used to help predict outcomes based on certain given conditions.

The more training data you feed into the machine learning model, the more accurate the model will be. The more it will learn from the training data to be able to accurately predict test data. So keep in mind that you want to give it a good amount of training data. The more training data it has, the more accurate it can be.

And this is how to create a random forest classifier in Python using the sklearn module.

Related Resources

How to Randomly Select From or Shuffle a List in Python

HTML Comment Box is loading comments...

Learning about Electronics

How to Create a Decision Tree Classifier in Python using sklearn