One of the most popular applications of machine learning is solving classification problems. Classification tasks are situations where you have a set of data and you want to classify observations from that set into a certain category. A well-known example is the spam filter for email. Gmail uses machine learning techniques with a learner to automatically place emails in the spam folder based on their content, subject, and other characteristics. Two machine learning models do most of the work when it comes to classification tasks:

- K-Nearest Neighbor Method
- K-mearest neighbors method

From this tutorial, you’ll learn how to apply K-nearest neighbor and K-means algorithms to Python code.

Table of Contents

## K-Nearest Neighbor Models

The K-nearest neighbor algorithm is one of the most popular among ML models for solving classification problems. A common exercise for machine learning students is to apply the K-nearest neighbor algorithm to a dataset whose categories are unknown. A real-world example of such a situation would be when you need to make predictions using ML models trained on classified government data. In this tutorial, you will learn the K-nearest neighbor machine learning algorithm and write an implementation of it in Python. We will work with an anonymous dataset, as in the situation described above.

### The dataset to use

The first thing you need to do is to download the dataset we will be using in this tutorial. You can download it from Gitlab. Next, you need to move the downloaded dataset file to your working directory. After that, open the Jupyter Notebook – now we can start writing Python code!

### The libraries we need

To write the K-nearest neighbor algorithm, we will take advantage of many open-source Python libraries, including NumPy, pandas, and scikit-learn. Get started by adding the following import instructions:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

### Import dataset

The next step is to add the classified_data.csv file to our Python code. The pandas library makes it fairly easy to import data into a DataFrame. Since the dataset is stored in a csv file, we will use the `read_csv`

method:

raw_data = pd.read_csv(‘classified_data.csv’)

By displaying the resulting DataFrame in the Jupyter Notebook, you will see what our data represents: It’s worth noting that the table starts with an unnamed column whose values are equal to the row numbers of the DataFrame. We can fix this by slightly modifying the command that imported our dataset into the Python script:

raw_data = pd.read_csv(‘classified_data.csv’, index_col = 0)

Then let’s look at the indicators (attributes) contained in this dataset. You can output a list of column names using the following instruction:

We get:

raw_data.columns

`Index(['WTT', 'PTI', 'EQW', 'SBI', 'LQE', 'QWG', 'FDJ', 'PJF', 'HQE', 'NXJ',`

Since this set contains secret data, we have no idea what any of these columns mean. At this point, it is sufficient to recognize that each column is numeric in nature and therefore well suited for modeling with machine learning techniques.

'TARGET CLASS'],

dtype='object')

### Dataset standardization

Since the K-nearest neighbor algorithm makes predictions about a data point (sampler) using the closest observations to it, the existing scale of the indicators in the dataset is important. Because of this, machine learning experts typically standardize the dataset, which means adjusting each `x`

value so that they are roughly in the same range. Fortunately, the scikit-learn library lets you do this without much trouble. First, we will need to import the `StandardScaler`

class from scikit-learn. To do that, add the following command to your Python script:

from sklearn.preprocessing import StandardScaler

This class is a lot like the `LinearRegression`

and `LogisticRegression`

classes we used earlier in this course. We need to create an instance of `StandardScaler`

, and then use that object to transform our data. First, let’s create an instance of the `StandardScaler`

class named `scaler`

with the following instruction:

Now we can train the scaler on our dataset using the fit method:

scaler = StandardScaler()

Now we can apply the

scaler.fit(raw_data.drop('TARGET CLASS', axis=1))

`transform`

method to standardize all the features so that they have roughly the same scale. We will save the transformed samples in the `scaled_features`

variable:

As a result, we got the NumPy array with all the data points from the dataset, but we would like to convert it to the DataFrame format of the pandas library. Fortunately, this is quite easy to do. We simply wrap the

scaled_features = scaler.transform(raw_data.drop('TARGET CLASS', axis=1))

`scaled_features`

variable in the `pd.DataFrame`

method and assign this `DataFrame`

to a new `scaled_data`

variable with an appropriate argument to specify column names:

Now that we have imported our dataset and standardized its metrics, we are ready to split this dataset into training and test samples.

scaled_data = pd.DataFrame(scaled_features, columns = raw_data.drop('TARGET CLASS', axis=1).columns)

### Dividing the dataset into training and test data

We will use the `train_test_split`

function of the scikit-learn library in conjunction with list unpacking to create training and test datasets from our secret dataset. First, you need to import `train_test_split`

from the `model_validation`

module of the scikit-learn library:

from sklearn.model_selection import train_test_split

Then we need to specify the `x`

and `y`

values to be passed to the `train_test_split`

function. The x values represent the DataFrame `scaled_data`

we created earlier. The y values are stored in the `"TARGET CLASS"`

column of our original `raw_data`

table. You can create these variables as follows:

Then you need to run the

x = scaled_data

y = raw_data['TARGET CLASS']

`train_test_split`

function using these two arguments and a reasonable `test_size`

. We will use `test_size`

30%, which gives the following function parameters:

Now that our dataset is divided into data for training and data for testing, we are ready to start training our model!

x_training_data, x_test_data, y_training_data, y_test_data = train_test_split(x, y, test_size = 0.3)

### Training the K-nearest neighbor model

We start by importing `the KNeighborsClassifier`

from scikit-learn:

from sklearn.neighbors import KNeighborsClassifier

Then let’s create an instance of the `KNeighborsClassifier`

class and assign it to the variable `model`

. This requires passing the parameter `n_neighbors`

, which is equal to the `K-value`

of the K-nearest neighbors algorithm you choose. First, let’s specify `n_neighbors = 1`

:

Now we can train our model using the

model = KNeighborsClassifier(n_neighbors = 1)

`fit`

method and the variables `x_training_data`

and `y_training_data`

:

Now let’s make some predictions with the resulting model!

model.fit(x_training_data, y_training_data)

### Making predictions using the K-nearest neighbor algorithm

The method for generating predictions with the K-nearest neighbor algorithm is the same as the linear and logistic regression models we built earlier in this course: just call the `predict`

method by passing in the variable `x_test_data`

to make the predictions. In particular, this is how you can make predictions and assign them to the `predictions`

variable:

Let’s see how accurate our predictions are in the next section of this tutorial.

predictions = model.predict(x_test_data)

### Evaluating the accuracy of our model

In the logistic regression tutorial, we saw that scikit-learn comes with built-in features that make it easy to measure the performance of classification-based machine learning models. We start by importing the two functions `classification_report`

and `confusion_matrix`

into our report:

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

Now let’s work with each of them in turn, starting with `classification_report`

. You can use it to create a report as follows:

The resulting output:

print(classification_report(y_test_data, predictions))

` precision recall f1-score support`

Looking at these performance metrics, it looks like our model is already quite efficient. But it can still be improved. The next section will show how we can affect the performance of the K-nearest neighbor model by choosing a better value for

0 0.92 0.91 0.91 148

1 0.91 0.92 0.92 152

accuracy 0.91 300

macro avg 0.91 0.91 0.91 300

weighted avg 0.91 0.91 0.91 300

In the same way you can generate an error matrix:

print(confusion_matrix(y_test_data, predictions))

# Output:

# [[134 14]

# [ 12 140]]

`K`

.

### Choosing the optimal value for K using the “Elbow” method

In this section, we will use the elbow method to choose the optimal value of `K`

for our K-nearest neighbor algorithm. The elbow method involves iterating over different values of `K`

and choosing the value with the lowest error rate when applied to our test data. First, we create an empty list of `error_rates`

. We will walk through the different `K`

values and add their error rates to this list.

Then we need to create a Python loop that goes through the different values

error_rates = []

`of K`

that we want to test, and at each iteration does the following:

- Creates a new instance of the
`KNeighborsClassifier`

class from scikit-learn. - Trains this model using our training data.
- Makes predictions based on our test data.
- Calculates the fraction of incorrect predictions (the lower it is, the more accurate our model is).

Implement the described loop for K values from 1 to 100:

for i in np.arange(1, 101):

new_model = KNeighborsClassifier(n_neighbors = i)

new_model.fit(x_training_data, y_training_data)

new_predictions = new_model.predict(x_test_data)

error_rates.append(np.mean(new_predictions != y_test_data))

Let’s visualize how error rates change with different `K's`

using matplotlib – `plt.plot(error_rates)`

: As you can see from the plot, we reach the minimum error rate at a K value of approximately 35. This means that 35 is an appropriate choice for `K`

, which combines simplicity and prediction accuracy. You can find all the code in the notebook at GitLab: https://gitlab.com/PythonRu/notebooks/-/blob/master/sklearn_kmeans_and_knn.ipynb

## K-means clustering models

The K-means clustering algorithm is usually the first teacherless machine learning model that students learn. It allows machine learning professionals to create groups of data points with similar quantitative characteristics in a dataset. This is useful for tasks such as generating customer segments or identifying high-crime urban areas. In this section, you will learn how to create your first K-means clustering algorithm in Python.

### The dataset used

In this tutorial, we will use a dataset created using scikit-learn. Let’s import the `make_blobs`

function from scikit-learn to generate the necessary data. Open the Jupyter Notebook and run your Python script with the following instructions:

from sklearn.datasets import make_blobs

Now let’s use the `make_blobs`

function to get dummy data! Specifically, here’s how you can create a dataset of 200 samples that has 2 indicators and 4 cluster centers. The standard deviation for each cluster will be 1.8.

raw_data = make_blobs(

n_samples = 200,

n_features = 2,

centers = 4,

cluster_std = 1.8

)

If you print the `raw_data`

object, you will notice that it is actually a Python tuple. Its first element is the NumPy array with 200 observations. Each observation contains 2 features (as we specified in our `make_blobs`

function). Now that our data has been created, we can move on to importing other necessary open source libraries into our Python script.

` `### Imported libraries

This tutorial will use a number of popular open source Python libraries, including pandas, NumPy, and matplotlib. Let's continue writing the script by adding the following imports:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

The first group of libraries in this block of code is intended for working with large data sets. The second group of libraries is intended to visualize the results. Now let's move on to creating a visual representation of our dataset.

### Visualization of dataset

In the `make_blobs`

function, we specified that our dataset should have 4 cluster centers. The best way to make sure that this is indeed the case is to create some simple dot plots. To do this, we use the function `plt.scatter`

, passing into it all the values from the first column of our dataset as `X`

and the corresponding values from the second column as `Y`

: Note: your dataset will be different from mine, since its data are randomly generated. The image shown seems to indicate that there are only three clusters in our dataset. We think so because the two clusters are very close to each other. To fix this, we have to refer to the second element of the `raw_data`

tuple, which is the NumPy array: it contains the index of the cluster to which each observation belongs. If we use a unique color for each cluster when constructing, we can easily distinguish the 4 groups of observations. Here is the code for this:

plt.scatter(raw_data[0][:,0], raw_data[0][:,1], c=raw_data[1]);

Now we see that we have four unique clusters in our dataset. Let's move on to building our model based on the K-means method in Python!

### Creating and training a K-means clustering model

`To start using the K-means method, import the appropriate class from scikit-learn. To do this, add the following command to your script: `

Then let’s create an instance of the

from sklearn.cluster import KMeans

`KMeans`

class with the parameter `n_clusters=4`

and assign it to the variable `model`

:

Now train our model by calling the fit method on it and passing the first element of our

model = KMeans(n_clusters=4)

`raw_data`

tuple:

In the next section, we look at how to make predictions using the K-means clustering model. Before moving on, I would like to point out one difference that you may have noticed between the model-building process using the K-means method (which is a teacherless clustering algorithm) and the machine learning algorithms with a teacher that we worked with earlier in this course. The difference is that we don’t need to partition the data set into a training sample and a test sample. This is an important distinction, because you never need to split the dataset in this way when building teacherless machine learning models!

model.fit(raw_data[0])

### Applying our K-means clustering model to get predictions

Machine learning professionals typically use clustering algorithms to make two types of predictions:

- Which cluster each data point belongs to.
- Where the center of each cluster is located.

Now that our model is trained, we can easily generate these predictions. First, let’s predict which cluster each data point belongs to. To do this, let’s access the `labels_`

attribute from the `model`

object using the point operator:

In this way we get a NumPy array with predictions for each sample:

model.labels_

`array([3, 2, 1, 1, 1, 3, 2, 1, 0, 0, 0, 0, 0, 0, 3, 2, 1, 2, 1, 3, 3, 3, 3, 1,`

To find out where the center of each cluster is, refer to the

1, 1, 2, 2, 3, 1, 3, 2, 1, 0, 1, 3, 1, 1, 3, 2, 0, 1, 3, 2, 3, 3,

0, 3, 2, 2, 3, 0, 0, 0, 1, 1, 2, 1, 2, 0, 1, 2, 2, 1, 2, 3, 0, 3,

0, 2, 0, 0, 1, 1, 0, 3, 2, 3, 2, 0, 1, 2, 0, 2, 0, 3, 3, 0, 3, 3,

0, 3, 2, 3, 2, 1, 2, 1, 3, 3, 2, 2, 0, 2, 0, 2, 0, 2, 1, 0, 0, 2,

3, 2, 1, 2, 3, 0, 1, 1, 1, 3, 2, 2, 3, 3, 2, 1, 3, 0, 0, 3, 0, 1,

1, 3, 1, 0, 1, 1, 0, 3, 2, 0, 3, 0, 1, 2, 1, 2, 1, 2, 2, 3, 2, 1,

0, 2, 3, 3, 2, 0, 1, 3, 3, 2, 0, 0, 0, 3, 1, 2, 0, 2, 3, 3, 2, 2,

3, 1, 0, 1, 2, 3, 1, 3, 1, 1, 0, 2, 1, 0, 2, 1, 3, 1, 3, 3, 1, 3,

0, 3])`cluster_centers_`

attribute in a similar way:

We get a two-dimensional array NumPy, which contains the coordinates of the center of each cluster. It will look like this:

model.cluster_centers_

`array([[ 5.2662658 , -8.20493969],`

[-9.39837945, -2.36452588],

[ 8.78032251, 5.1722511 ],

[ 2.40247618, -2.78480268]])

### Visualizing the accuracy of the model predictions

The last thing we will do in this tutorial is to visualize the accuracy of our model. To do this, we can use the following code:

It generates two dot plots. The first shows the clusters using the actual labels from our dataset, and the second is based on the predictions made by our model. Here is what the result looks like: Although the coloring of the two graphs is different, you can see that the model created did a pretty good job of predicting clusters in our dataset. You may also notice that the model is not perfect: the data points at the edges of the clusters are misclassified in some cases. One last thing worth mentioning when talking about estimating the accuracy of our model. In this example, we knew which cluster each observation belonged to because we created this dataset ourselves. Such a situation is extremely rare. The K-means method is usually applied when neither the number of clusters nor their inherent qualities are known. Thus, machine learning experts use this algorithm to discover patterns in a dataset that they do not yet know anything about. You can find all the code in the notebook at GitLab: https://gitlab.com/PythonRu/notebooks/-/blob/master/sklearn_kmeans_and_knn.ipynb

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(10,6))

ax1.set_title('Our predictions')

ax1.scatter(raw_data[0][:,0], raw_data[0][:,1],c=model.labels_)

ax2.set_title('real values')

ax2.scatter(raw_data[0][:,0], raw_data[0][:,1],c=raw_data[1]);

## Final Thoughts

In this tutorial, you learned how to create machine learning models in Python using K-nearest neighbor and K-means methods. Here’s a summary of what you learned about K-nearest neighbor models in Python:

- How classified data is a common tool for teaching students to solve K-nearest neighbor problems.
- Why it is important to standardize the dataset when building K-nearest neighbor models.
- How to split a dataset into training and testing samples using the
`train_test_split`

function. - How to train your first K-nearest neighbor model and how to get its predictions.
- How to evaluate the effectiveness of the K-nearest neighbor model.
- The elbow method for choosing the optimal K value in a K-nearest neighbor model.

And here’s a summary of what you learned about K-means clustering models in Python:

- How to generate dummy data in scikit-learn using
`make_blobs`

. - How to create and train a K-means clustering model.
- The fact that teacherless ML methods don’t require that you split the dataset into training data and testing data.
- How to create and train a K-means clustering model using scikit-learn.
- How to visualize the effectiveness of the K-means algorithm if you initially have information about clusters.