How to Create a Classifier with Supervised Learning with scikit-learn in Python


In this article, we show how to create a classifier with supervised learning with scikit-learn in Python.

A classifier is a software program that is capable of classifying objects based on some defined characteristics. It is a subfield of machine learning. With supervised learning, this classifier predicts the object type based on training data given to it.

Let's say we want a classifier that is capable of classifying whether an object is an apple or an orange.

What we can do is we can use supervised learning, which is giving the program examples of characteristics of apples and oranges, so that it can differentiate to see which one a given object most closely fits.

So what we do is we give the program what is called training data. Training data is data that allows the program to know what characteristics apples have and what characteristics oranges have. Otherwise, the program wouldn't know how to classify an object. With enough training data, scikit-learn has built-in functionality that allows new data to then be classified as an object based on the training data. This is what supervised learning is. With training, the program is taught (or supervised) on how to classify new data.

So once we have trained our software by using training data, then we can feed it new data, and it can output the classification for this new data (apple or orange, wine or beer, chicken or turkey, etc.)

So let's now go into our program on building a classifier using the supervised learning method in machine learning.

So the first thing you have to do, if you haven't done it already, is you have to install scikit-learn.

In order to install scikit-learn, you can do so through pip with the following command, pip install scikit-learn

Once that is installed, you're all ready to go.

Let's say we're going to use 2 characteristics to differentiate apples from oranges. These 2 characteristics are the weight of the fruit and whether the fruit has a smooth or bumpy texture.

Apples tend to weigh more than oranges and apples tend to have a smooth outer texture, while oranges tend to be more bumpy in texture.

So we have the following table below of data.

Weight (grams) Texture
140 Smooth
130 Smooth

So the above table is the table for apples.

Below is the data table for oranges.

Weight (grams) Texture
150 Bumpy
170 Bumpy

So the above represents our training data for the software that we want to classify whether an object is an apple or an orange.

So let's now go and write our code.

So let's go over the code and make sense of it.

So the first thing we have to do is import the module scikit-learn, which we do through the line, import sklearn

We then have to create our training data, which we store in the variable, features.

When we give training data to scikit-learn, you cannot give it string values. It has to be numerical values. This means that if you have some characteristic that is a string value, such as "smooth" or "bumpy" texture, as in fruits, you have to assign "smooth" to a numerical value and "bumpy" to a numerical value. Otherwise, you will get errors.

In this case, I give "smooth" a value of 0 and I give "bumpy" a value of 1.

So we create a list of arrays into this feature variable, which gives values to all the characteristics that we have for the fruit.

Next, we must classify these data arrays into what they are: either apples or oranges.

Again, just like with the features variable, you cannot put string values into the labels variable. So again, we give numerical values to "apples" and "oranges".

In this case, an "apple" is 0 and an "orange" is 1.

We use the scikit-learn DecisionTreeClassifier() to classify whether the new data we feed into a program is an apple or an orange. So it's a decision tree that classifies whether new data points is an apple or an organge.

Next, we training our program with the function. We feed into this function the features and labels variables.

Once this runs, our program is trained.

The last thing to do now is to feed new data values into our program, and our program will output whether this new data is an apple or an orange.

As an example, I feed the training data 160 and 1 into the clf.predict() function. So the new data is 160 grams in weight and is bumpy in texture.

Remember that an output of 0 represents an apple and an output of 1 represents an orange.

So I write an if statement that if the output is 0, then the object is an apple.

If the output is 1, then the object is an orange.

In this case, the output, based on the new data, is an orange, because of the weight and texture.

So this is the power of machine learning.

Feel free to change the values and play around with the values. Also feel free to increase the amount of training provided. In our case, in this programs, there was only 4 data sets of training data. This can be increased to more. The more, the better, because the more data the program has to train from.

With machine learning, we no longer have to hardcode everything with if statements and else statements. We can just give the program some training data and then have to predict what classification an object should get based off of the training data.

And this is how to create a classifier with supervised learning in Python with the scikit-learn module.

Related Resources

How to Randomly Select From or Shuffle a List in Python

HTML Comment Box is loading comments...