K-nearest neighbor and K-means algorithms in Python

by Alex
K-nearest neighbor and K-means algorithms in Python

One of the most popular applications of machine learning is solving classification problems. Classification tasks are situations where you have a set of data and you want to classify observations from that set into a certain category. A well-known example is the spam filter for email. Gmail uses machine learning techniques with a learner to automatically place emails in the spam folder based on their content, subject, and other characteristics. Two machine learning models do most of the work when it comes to classification tasks:

  • K-Nearest Neighbor Method
  • K-mearest neighbors method

From this tutorial, you’ll learn how to apply K-nearest neighbor and K-means algorithms to Python code.

K-Nearest Neighbor Models

The K-nearest neighbor algorithm is one of the most popular among ML models for solving classification problems. A common exercise for machine learning students is to apply the K-nearest neighbor algorithm to a dataset whose categories are unknown. A real-world example of such a situation would be when you need to make predictions using ML models trained on classified government data. In this tutorial, you will learn the K-nearest neighbor machine learning algorithm and write an implementation of it in Python. We will work with an anonymous dataset, as in the situation described above.

The dataset to use

The first thing you need to do is to download the dataset we will be using in this tutorial. You can download it from Gitlab. Next, you need to move the downloaded dataset file to your working directory. After that, open the Jupyter Notebook – now we can start writing Python code!

The libraries we need

To write the K-nearest neighbor algorithm, we will take advantage of many open-source Python libraries, including NumPy, pandas, and scikit-learn. Get started by adding the following import instructions:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Import dataset

The next step is to add the classified_data.csv file to our Python code. The pandas library makes it fairly easy to import data into a DataFrame. Since the dataset is stored in a csv file, we will use the read_csv method:
raw_data = pd.read_csv(‘classified_data.csv’)
By displaying the resulting DataFrame in the Jupyter Notebook, you will see what our data represents: Импорт датасета It’s worth noting that the table starts with an unnamed column whose values are equal to the row numbers of the DataFrame. We can fix this by slightly modifying the command that imported our dataset into the Python script:
raw_data = pd.read_csv(‘classified_data.csv’, index_col = 0)
Then let’s look at the indicators (attributes) contained in this dataset. You can output a list of column names using the following instruction:
We get: Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ',
Since this set contains secret data, we have no idea what any of these columns mean. At this point, it is sufficient to recognize that each column is numeric in nature and therefore well suited for modeling with machine learning techniques.

Dataset standardization

Since the K-nearest neighbor algorithm makes predictions about a data point (sampler) using the closest observations to it, the existing scale of the indicators in the dataset is important. Because of this, machine learning experts typically standardize the dataset, which means adjusting each x value so that they are roughly in the same range. Fortunately, the scikit-learn library lets you do this without much trouble. First, we will need to import the StandardScaler class from scikit-learn. To do that, add the following command to your Python script:
from sklearn.preprocessing import StandardScaler
This class is a lot like the LinearRegression and LogisticRegression classes we used earlier in this course. We need to create an instance of StandardScaler, and then use that object to transform our data. First, let’s create an instance of the StandardScaler class named scaler with the following instruction:
scaler = StandardScaler()
Now we can train the scaler on our dataset using the fit method:
scaler.fit(raw_data.drop('TARGET CLASS', axis=1))
Now we can apply the transform method to standardize all the features so that they have roughly the same scale. We will save the transformed samples in the scaled_features variable:
scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))
As a result, we got the NumPy array with all the data points from the dataset, but we would like to convert it to the DataFrame format of the pandas library. Fortunately, this is quite easy to do. We simply wrap the scaled_features variable in the pd.DataFrame method and assign this DataFrame to a new scaled_data variable with an appropriate argument to specify column names:
scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)
Now that we have imported our dataset and standardized its metrics, we are ready to split this dataset into training and test samples.

Dividing the dataset into training and test data

We will use the train_test_split function of the scikit-learn library in conjunction with list unpacking to create training and test datasets from our secret dataset. First, you need to import train_test_split from the model_validation module of the scikit-learn library:
from sklearn.model_selection import train_test_split
Then we need to specify the x and y values to be passed to the train_test_split function. The x values represent the DataFrame scaled_data we created earlier. The y values are stored in the "TARGET CLASS" column of our original raw_data table. You can create these variables as follows:
x = scaled_data
y = raw_data['TARGET CLASS']
Then you need to run the train_test_split function using these two arguments and a reasonable test_size. We will use test_size 30%, which gives the following function parameters:
x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)
Now that our dataset is divided into data for training and data for testing, we are ready to start training our model!

Training the K-nearest neighbor model

We start by importing the KNeighborsClassifier from scikit-learn:
from sklearn.neighbors import KNeighborsClassifier
Then let’s create an instance of the KNeighborsClassifier class and assign it to the variable model. This requires passing the parameter n_neighbors, which is equal to the K-value of the K-nearest neighbors algorithm you choose. First, let’s specify n_neighbors = 1:
model = KNeighborsClassifier(n_neighbors = 1)
Now we can train our model using the fit method and the variables x_training_data and y_training_data:
model.fit(x_training_data, y_training_data)
Now let’s make some predictions with the resulting model!

Making predictions using the K-nearest neighbor algorithm

The method for generating predictions with the K-nearest neighbor algorithm is the same as the linear and logistic regression models we built earlier in this course: just call the predict method by passing in the variable x_test_data to make the predictions. In particular, this is how you can make predictions and assign them to the predictions variable:
predictions = model.predict(x_test_data)
Let’s see how accurate our predictions are in the next section of this tutorial.

Evaluating the accuracy of our model

In the logistic regression tutorial, we saw that scikit-learn comes with built-in features that make it easy to measure the performance of classification-based machine learning models. We start by importing the two functions classification_report and confusion_matrix into our report:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
Now let’s work with each of them in turn, starting with classification_report. You can use it to create a report as follows:
print(classification_report(y_test_data, predictions))
The resulting output: precision recall f1-score support
0 0.92 0.91 0.91 148
1 0.91 0.92 0.92 152
accuracy 0.91 300
macro avg 0.91 0.91 0.91 300
weighted avg 0.91 0.91 0.91 300
In the same way you can generate an error matrix:
print(confusion_matrix(y_test_data, predictions))
# Output:
# [[134 14]
# [ 12 140]]
Looking at these performance metrics, it looks like our model is already quite efficient. But it can still be improved. The next section will show how we can affect the performance of the K-nearest neighbor model by choosing a better value for K.

Choosing the optimal value for K using the “Elbow” method

In this section, we will use the elbow method to choose the optimal value of K for our K-nearest neighbor algorithm. The elbow method involves iterating over different values of K and choosing the value with the lowest error rate when applied to our test data. First, we create an empty list of error_rates. We will walk through the different K values and add their error rates to this list.
error_rates = []
Then we need to create a Python loop that goes through the different values of K that we want to test, and at each iteration does the following:

  • Creates a new instance of the KNeighborsClassifier class from scikit-learn.
  • Trains this model using our training data.
  • Makes predictions based on our test data.
  • Calculates the fraction of incorrect predictions (the lower it is, the more accurate our model is).

Implement the described loop for K values from 1 to 100:
for i in np.arange(1, 101):
new_model = KNeighborsClassifier(n_neighbors = i)
new_model.fit(x_training_data, y_training_data)
new_predictions = new_model.predict(x_test_data)
error_rates.append(np.mean(new_predictions != y_test_data))
Let’s visualize how error rates change with different K's using matplotlib – plt.plot(error_rates): Выбор оптимального значения для K с помощью метода As you can see from the plot, we reach the minimum error rate at a K value of approximately 35. This means that 35 is an appropriate choice for K, which combines simplicity and prediction accuracy. You can find all the code in the notebook at GitLab: https://gitlab.com/PythonRu/notebooks/-/blob/master/sklearn_kmeans_and_knn.ipynb

K-means clustering models

The K-means clustering algorithm is usually the first teacherless machine learning model that students learn. It allows machine learning professionals to create groups of data points with similar quantitative characteristics in a dataset. This is useful for tasks such as generating customer segments or identifying high-crime urban areas. In this section, you will learn how to create your first K-means clustering algorithm in Python.

The dataset used

In this tutorial, we will use a dataset created using scikit-learn. Let’s import the make_blobs function from scikit-learn to generate the necessary data. Open the Jupyter Notebook and run your Python script with the following instructions:
from sklearn.datasets import make_blobs
Now let’s use the make_blobs function to get dummy data! Specifically, here’s how you can create a dataset of 200 samples that has 2 indicators and 4 cluster centers. The standard deviation for each cluster will be 1.8.
raw_data = make_blobs(
n_samples = 200,
n_features = 2,
centers = 4,
cluster_std = 1.8
If you print the raw_data object, you will notice that it is actually a Python tuple. Its first element is the NumPy array with 200 observations. Each observation contains 2 features (as we specified in our make_blobs function). Now that our data has been created, we can move on to importing other necessary open source libraries into our Python script.

Imported libraries

This tutorial will use a number of popular open source Python libraries, including pandas, NumPy, and matplotlib. Let's continue writing the script by adding the following imports:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
The first group of libraries in this block of code is intended for working with large data sets. The second group of libraries is intended to visualize the results. Now let's move on to creating a visual representation of our dataset.

Visualization of dataset

In the make_blobs function, we specified that our dataset should have 4 cluster centers. The best way to make sure that this is indeed the case is to create some simple dot plots. To do this, we use the function plt.scatter, passing into it all the values from the first column of our dataset as X and the corresponding values from the second column as Y: Визуализация датасета Note: your dataset will be different from mine, since its data are randomly generated. The image shown seems to indicate that there are only three clusters in our dataset. We think so because the two clusters are very close to each other. To fix this, we have to refer to the second element of the raw_data tuple, which is the NumPy array: it contains the index of the cluster to which each observation belongs. If we use a unique color for each cluster when constructing, we can easily distinguish the 4 groups of observations. Here is the code for this:
plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1]);
Визуализация датасета Now we see that we have four unique clusters in our dataset. Let's move on to building our model based on the K-means method in Python!

Creating and training a K-means clustering model

To start using the K-means method, import the appropriate class from scikit-learn. To do this, add the following command to your script:
from sklearn.cluster import KMeans
Then let’s create an instance of the KMeans class with the parameter n_clusters=4 and assign it to the variable model:
model = KMeans(n_clusters=4)
Now train our model by calling the fit method on it and passing the first element of our raw_data tuple:
In the next section, we look at how to make predictions using the K-means clustering model. Before moving on, I would like to point out one difference that you may have noticed between the model-building process using the K-means method (which is a teacherless clustering algorithm) and the machine learning algorithms with a teacher that we worked with earlier in this course. The difference is that we don’t need to partition the data set into a training sample and a test sample. This is an important distinction, because you never need to split the dataset in this way when building teacherless machine learning models!

Applying our K-means clustering model to get predictions

Machine learning professionals typically use clustering algorithms to make two types of predictions:

  • Which cluster each data point belongs to.
  • Where the center of each cluster is located.

Now that our model is trained, we can easily generate these predictions. First, let’s predict which cluster each data point belongs to. To do this, let’s access the labels_ attribute from the model object using the point operator:
In this way we get a NumPy array with predictions for each sample: array([3, 2, 1, 1, 1, 3, 2, 1, 0, 0, 0, 0, 0, 0, 3, 2, 1, 2, 1, 3, 3, 3, 3, 1,
1, 1, 2, 2, 3, 1, 3, 2, 1, 0, 1, 3, 1, 1, 3, 2, 0, 1, 3, 2, 3, 3,
0, 3, 2, 2, 3, 0, 0, 0, 1, 1, 2, 1, 2, 0, 1, 2, 2, 1, 2, 3, 0, 3,
0, 2, 0, 0, 1, 1, 0, 3, 2, 3, 2, 0, 1, 2, 0, 2, 0, 3, 3, 0, 3, 3,
0, 3, 2, 3, 2, 1, 2, 1, 3, 3, 2, 2, 0, 2, 0, 2, 0, 2, 1, 0, 0, 2,
3, 2, 1, 2, 3, 0, 1, 1, 1, 3, 2, 2, 3, 3, 2, 1, 3, 0, 0, 3, 0, 1,
1, 3, 1, 0, 1, 1, 0, 3, 2, 0, 3, 0, 1, 2, 1, 2, 1, 2, 2, 3, 2, 1,
0, 2, 3, 3, 2, 0, 1, 3, 3, 2, 0, 0, 0, 3, 1, 2, 0, 2, 3, 3, 2, 2,
3, 1, 0, 1, 2, 3, 1, 3, 1, 1, 0, 2, 1, 0, 2, 1, 3, 1, 3, 3, 1, 3,
0, 3])
To find out where the center of each cluster is, refer to the cluster_centers_ attribute in a similar way:
We get a two-dimensional array NumPy, which contains the coordinates of the center of each cluster. It will look like this: array([[ 5.2662658 , -8.20493969],
[-9.39837945, -2.36452588],
[ 8.78032251, 5.1722511 ],
[ 2.40247618, -2.78480268]])

Visualizing the accuracy of the model predictions

The last thing we will do in this tutorial is to visualize the accuracy of our model. To do this, we can use the following code:
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))
ax1.set_title('Our predictions')
ax1.scatter(raw_data[0][:,0], raw_data[0][:,1],c=model.labels_)
ax2.set_title('real values')
ax2.scatter(raw_data[0][:,0], raw_data[0][:,1],c=raw_data[1]);
It generates two dot plots. The first shows the clusters using the actual labels from our dataset, and the second is based on the predictions made by our model. Here is what the result looks like: Визуализация точности предсказаний модели Although the coloring of the two graphs is different, you can see that the model created did a pretty good job of predicting clusters in our dataset. You may also notice that the model is not perfect: the data points at the edges of the clusters are misclassified in some cases. One last thing worth mentioning when talking about estimating the accuracy of our model. In this example, we knew which cluster each observation belonged to because we created this dataset ourselves. Such a situation is extremely rare. The K-means method is usually applied when neither the number of clusters nor their inherent qualities are known. Thus, machine learning experts use this algorithm to discover patterns in a dataset that they do not yet know anything about. You can find all the code in the notebook at GitLab: https://gitlab.com/PythonRu/notebooks/-/blob/master/sklearn_kmeans_and_knn.ipynb

Final Thoughts

In this tutorial, you learned how to create machine learning models in Python using K-nearest neighbor and K-means methods. Here’s a summary of what you learned about K-nearest neighbor models in Python:

  • How classified data is a common tool for teaching students to solve K-nearest neighbor problems.
  • Why it is important to standardize the dataset when building K-nearest neighbor models.
  • How to split a dataset into training and testing samples using the train_test_split function.
  • How to train your first K-nearest neighbor model and how to get its predictions.
  • How to evaluate the effectiveness of the K-nearest neighbor model.
  • The elbow method for choosing the optimal K value in a K-nearest neighbor model.

And here’s a summary of what you learned about K-means clustering models in Python:

  • How to generate dummy data in scikit-learn using make_blobs.
  • How to create and train a K-means clustering model.
  • The fact that teacherless ML methods don’t require that you split the dataset into training data and testing data.
  • How to create and train a K-means clustering model using scikit-learn.
  • How to visualize the effectiveness of the K-means algorithm if you initially have information about clusters.

Related Posts