One of the most popular applications of machine learning is solving classification problems. Classification tasks are situations where you have a set of data and you want to classify observations from that set into a certain category. A well-known example is the spam filter for email. Gmail uses machine learning techniques with a learner to automatically place emails in the spam folder based on their content, subject, and other characteristics. Two machine learning models do most of the work when it comes to classification tasks:
- K-Nearest Neighbor Method
- K-mearest neighbors method
From this tutorial, you’ll learn how to apply K-nearest neighbor and K-means algorithms to Python code.
K-Nearest Neighbor Models
The K-nearest neighbor algorithm is one of the most popular among ML models for solving classification problems. A common exercise for machine learning students is to apply the K-nearest neighbor algorithm to a dataset whose categories are unknown. A real-world example of such a situation would be when you need to make predictions using ML models trained on classified government data. In this tutorial, you will learn the K-nearest neighbor machine learning algorithm and write an implementation of it in Python. We will work with an anonymous dataset, as in the situation described above.
The dataset to use
The first thing you need to do is to download the dataset we will be using in this tutorial. You can download it from Gitlab. Next, you need to move the downloaded dataset file to your working directory. After that, open the Jupyter Notebook – now we can start writing Python code!
The libraries we need
To write the K-nearest neighbor algorithm, we will take advantage of many open-source Python libraries, including NumPy, pandas, and scikit-learn. Get started by adding the following import instructions:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
The next step is to add the classified_data.csv file to our Python code. The pandas library makes it fairly easy to import data into a DataFrame. Since the dataset is stored in a csv file, we will use the
raw_data = pd.read_csv(‘classified_data.csv’)
By displaying the resulting DataFrame in the Jupyter Notebook, you will see what our data represents: It’s worth noting that the table starts with an unnamed column whose values are equal to the row numbers of the DataFrame. We can fix this by slightly modifying the command that imported our dataset into the Python script:
raw_data = pd.read_csv(‘classified_data.csv’, index_col = 0)
Then let’s look at the indicators (attributes) contained in this dataset. You can output a list of column names using the following instruction:
Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ', Since this set contains secret data, we have no idea what any of these columns mean. At this point, it is sufficient to recognize that each column is numeric in nature and therefore well suited for modeling with machine learning techniques.
Since the K-nearest neighbor algorithm makes predictions about a data point (sampler) using the closest observations to it, the existing scale of the indicators in the dataset is important. Because of this, machine learning experts typically standardize the dataset, which means adjusting each
x value so that they are roughly in the same range. Fortunately, the scikit-learn library lets you do this without much trouble. First, we will need to import the
StandardScaler class from scikit-learn. To do that, add the following command to your Python script:
from sklearn.preprocessing import StandardScaler
This class is a lot like the
LogisticRegression classes we used earlier in this course. We need to create an instance of
StandardScaler, and then use that object to transform our data. First, let’s create an instance of the
StandardScaler class named
scaler with the following instruction:
Now we can train the scaler on our dataset using the fit method:
scaler = StandardScaler()
Now we can apply the
scaler.fit(raw_data.drop('TARGET CLASS', axis=1))
transform method to standardize all the features so that they have roughly the same scale. We will save the transformed samples in the
As a result, we got the NumPy array with all the data points from the dataset, but we would like to convert it to the DataFrame format of the pandas library. Fortunately, this is quite easy to do. We simply wrap the
scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))
scaled_features variable in the
pd.DataFrame method and assign this
DataFrame to a new
scaled_data variable with an appropriate argument to specify column names:
Now that we have imported our dataset and standardized its metrics, we are ready to split this dataset into training and test samples.
scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)
Dividing the dataset into training and test data
We will use the
train_test_split function of the scikit-learn library in conjunction with list unpacking to create training and test datasets from our secret dataset. First, you need to import
train_test_split from the
model_validation module of the scikit-learn library:
from sklearn.model_selection import train_test_split
Then we need to specify the
y values to be passed to the
train_test_split function. The x values represent the DataFrame
scaled_data we created earlier. The y values are stored in the
"TARGET CLASS" column of our original
raw_data table. You can create these variables as follows:
Then you need to run the
x = scaled_data
y = raw_data['TARGET CLASS']
train_test_split function using these two arguments and a reasonable
test_size. We will use
test_size 30%, which gives the following function parameters:
Now that our dataset is divided into data for training and data for testing, we are ready to start training our model!
x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)
Training the K-nearest neighbor model
We start by importing
the KNeighborsClassifier from scikit-learn:
from sklearn.neighbors import KNeighborsClassifier
Then let’s create an instance of the
KNeighborsClassifier class and assign it to the variable
model. This requires passing the parameter
n_neighbors, which is equal to the
K-value of the K-nearest neighbors algorithm you choose. First, let’s specify
n_neighbors = 1:
Now we can train our model using the
model = KNeighborsClassifier(n_neighbors = 1)
fit method and the variables
Now let’s make some predictions with the resulting model!
Making predictions using the K-nearest neighbor algorithm
The method for generating predictions with the K-nearest neighbor algorithm is the same as the linear and logistic regression models we built earlier in this course: just call the
predict method by passing in the variable
x_test_data to make the predictions. In particular, this is how you can make predictions and assign them to the
Let’s see how accurate our predictions are in the next section of this tutorial.
predictions = model.predict(x_test_data)
Evaluating the accuracy of our model
In the logistic regression tutorial, we saw that scikit-learn comes with built-in features that make it easy to measure the performance of classification-based machine learning models. We start by importing the two functions
confusion_matrix into our report:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
Now let’s work with each of them in turn, starting with
classification_report. You can use it to create a report as follows:
The resulting output:
precision recall f1-score support Looking at these performance metrics, it looks like our model is already quite efficient. But it can still be improved. The next section will show how we can affect the performance of the K-nearest neighbor model by choosing a better value for
0 0.92 0.91 0.91 148
1 0.91 0.92 0.92 152
accuracy 0.91 300
macro avg 0.91 0.91 0.91 300
weighted avg 0.91 0.91 0.91 300
In the same way you can generate an error matrix:
# [[134 14]
# [ 12 140]]
Choosing the optimal value for K using the “Elbow” method
In this section, we will use the elbow method to choose the optimal value of
K for our K-nearest neighbor algorithm. The elbow method involves iterating over different values of
K and choosing the value with the lowest error rate when applied to our test data. First, we create an empty list of
error_rates. We will walk through the different
K values and add their error rates to this list.
Then we need to create a Python loop that goes through the different values
error_rates = 
of K that we want to test, and at each iteration does the following:
- Creates a new instance of the
KNeighborsClassifierclass from scikit-learn.
- Trains this model using our training data.
- Makes predictions based on our test data.
- Calculates the fraction of incorrect predictions (the lower it is, the more accurate our model is).
Implement the described loop for K values from 1 to 100:
for i in np.arange(1, 101):
new_model = KNeighborsClassifier(n_neighbors = i)
new_predictions = new_model.predict(x_test_data)
error_rates.append(np.mean(new_predictions != y_test_data))
Let’s visualize how error rates change with different
K's using matplotlib –
plt.plot(error_rates): As you can see from the plot, we reach the minimum error rate at a K value of approximately 35. This means that 35 is an appropriate choice for
K, which combines simplicity and prediction accuracy. You can find all the code in the notebook at GitLab: https://gitlab.com/PythonRu/notebooks/-/blob/master/sklearn_kmeans_and_knn.ipynb
K-means clustering models
The K-means clustering algorithm is usually the first teacherless machine learning model that students learn. It allows machine learning professionals to create groups of data points with similar quantitative characteristics in a dataset. This is useful for tasks such as generating customer segments or identifying high-crime urban areas. In this section, you will learn how to create your first K-means clustering algorithm in Python.
The dataset used
In this tutorial, we will use a dataset created using scikit-learn. Let’s import the
make_blobs function from scikit-learn to generate the necessary data. Open the Jupyter Notebook and run your Python script with the following instructions:
from sklearn.datasets import make_blobs
Now let’s use the
make_blobs function to get dummy data! Specifically, here’s how you can create a dataset of 200 samples that has 2 indicators and 4 cluster centers. The standard deviation for each cluster will be 1.8.
raw_data = make_blobs(
n_samples = 200,
n_features = 2,
centers = 4,
cluster_std = 1.8
If you print the
raw_data object, you will notice that it is actually a Python tuple. Its first element is the NumPy array with 200 observations. Each observation contains 2 features (as we specified in our
make_blobs function). Now that our data has been created, we can move on to importing other necessary open source libraries into our Python script.
This tutorial will use a number of popular open source Python libraries, including pandas, NumPy, and matplotlib. Let's continue writing the script by adding the following imports:
The first group of libraries in this block of code is intended for working with large data sets. The second group of libraries is intended to visualize the results. Now let's move on to creating a visual representation of our dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Visualization of dataset
make_blobs function, we specified that our dataset should have 4 cluster centers. The best way to make sure that this is indeed the case is to create some simple dot plots. To do this, we use the function
plt.scatter, passing into it all the values from the first column of our dataset as
X and the corresponding values from the second column as
Y: Note: your dataset will be different from mine, since its data are randomly generated. The image shown seems to indicate that there are only three clusters in our dataset. We think so because the two clusters are very close to each other. To fix this, we have to refer to the second element of the
raw_data tuple, which is the NumPy array: it contains the index of the cluster to which each observation belongs. If we use a unique color for each cluster when constructing, we can easily distinguish the 4 groups of observations. Here is the code for this:
Now we see that we have four unique clusters in our dataset. Let's move on to building our model based on the K-means method in Python!
plt.scatter(raw_data[:,0], raw_data[:,1], c=raw_data);
Creating and training a K-means clustering model
To start using the K-means method, import the appropriate class from scikit-learn. To do this, add the following command to your script:
from sklearn.cluster import KMeans
Then let’s create an instance of the
KMeans class with the parameter
n_clusters=4 and assign it to the variable
Now train our model by calling the fit method on it and passing the first element of our
model = KMeans(n_clusters=4)
In the next section, we look at how to make predictions using the K-means clustering model. Before moving on, I would like to point out one difference that you may have noticed between the model-building process using the K-means method (which is a teacherless clustering algorithm) and the machine learning algorithms with a teacher that we worked with earlier in this course. The difference is that we don’t need to partition the data set into a training sample and a test sample. This is an important distinction, because you never need to split the dataset in this way when building teacherless machine learning models!
Applying our K-means clustering model to get predictions
Machine learning professionals typically use clustering algorithms to make two types of predictions:
- Which cluster each data point belongs to.
- Where the center of each cluster is located.
Now that our model is trained, we can easily generate these predictions. First, let’s predict which cluster each data point belongs to. To do this, let’s access the
labels_ attribute from the
model object using the point operator:
In this way we get a NumPy array with predictions for each sample:
array([3, 2, 1, 1, 1, 3, 2, 1, 0, 0, 0, 0, 0, 0, 3, 2, 1, 2, 1, 3, 3, 3, 3, 1, To find out where the center of each cluster is, refer to the
1, 1, 2, 2, 3, 1, 3, 2, 1, 0, 1, 3, 1, 1, 3, 2, 0, 1, 3, 2, 3, 3,
0, 3, 2, 2, 3, 0, 0, 0, 1, 1, 2, 1, 2, 0, 1, 2, 2, 1, 2, 3, 0, 3,
0, 2, 0, 0, 1, 1, 0, 3, 2, 3, 2, 0, 1, 2, 0, 2, 0, 3, 3, 0, 3, 3,
0, 3, 2, 3, 2, 1, 2, 1, 3, 3, 2, 2, 0, 2, 0, 2, 0, 2, 1, 0, 0, 2,
3, 2, 1, 2, 3, 0, 1, 1, 1, 3, 2, 2, 3, 3, 2, 1, 3, 0, 0, 3, 0, 1,
1, 3, 1, 0, 1, 1, 0, 3, 2, 0, 3, 0, 1, 2, 1, 2, 1, 2, 2, 3, 2, 1,
0, 2, 3, 3, 2, 0, 1, 3, 3, 2, 0, 0, 0, 3, 1, 2, 0, 2, 3, 3, 2, 2,
3, 1, 0, 1, 2, 3, 1, 3, 1, 1, 0, 2, 1, 0, 2, 1, 3, 1, 3, 3, 1, 3,
cluster_centers_ attribute in a similar way:
We get a two-dimensional array NumPy, which contains the coordinates of the center of each cluster. It will look like this:
array([[ 5.2662658 , -8.20493969],
[ 8.78032251, 5.1722511 ],
[ 2.40247618, -2.78480268]])
Visualizing the accuracy of the model predictions
The last thing we will do in this tutorial is to visualize the accuracy of our model. To do this, we can use the following code:
It generates two dot plots. The first shows the clusters using the actual labels from our dataset, and the second is based on the predictions made by our model. Here is what the result looks like: Although the coloring of the two graphs is different, you can see that the model created did a pretty good job of predicting clusters in our dataset. You may also notice that the model is not perfect: the data points at the edges of the clusters are misclassified in some cases. One last thing worth mentioning when talking about estimating the accuracy of our model. In this example, we knew which cluster each observation belonged to because we created this dataset ourselves. Such a situation is extremely rare. The K-means method is usually applied when neither the number of clusters nor their inherent qualities are known. Thus, machine learning experts use this algorithm to discover patterns in a dataset that they do not yet know anything about. You can find all the code in the notebook at GitLab: https://gitlab.com/PythonRu/notebooks/-/blob/master/sklearn_kmeans_and_knn.ipynb
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))
In this tutorial, you learned how to create machine learning models in Python using K-nearest neighbor and K-means methods. Here’s a summary of what you learned about K-nearest neighbor models in Python:
- How classified data is a common tool for teaching students to solve K-nearest neighbor problems.
- Why it is important to standardize the dataset when building K-nearest neighbor models.
- How to split a dataset into training and testing samples using the
- How to train your first K-nearest neighbor model and how to get its predictions.
- How to evaluate the effectiveness of the K-nearest neighbor model.
- The elbow method for choosing the optimal K value in a K-nearest neighbor model.
And here’s a summary of what you learned about K-means clustering models in Python:
- How to generate dummy data in scikit-learn using
- How to create and train a K-means clustering model.
- The fact that teacherless ML methods don’t require that you split the dataset into training data and testing data.
- How to create and train a K-means clustering model using scikit-learn.
- How to visualize the effectiveness of the K-means algorithm if you initially have information about clusters.