# ML/DL model evaluation: error matrix, Accuracy, Precision and Recall

In computer vision, object detection is the problem of locating one or more objects in an image. In addition to traditional detection methods, advanced deep learning models such as R-CNN and YOLO can provide impressive results for different types of objects. These models take the image as input and return the coordinates of the rectangle bounding the space around each found object. This tutorial discusses the error matrix and how precision, recall, and accuracy metrics are calculated. Here we discuss:

• Error matrix for binary classification.
• Error matrix for multiclass classification.
• Calculation of the error matrix using Scikit-learn.
• Accuracy, Precision and Recall.
• Precision or Recall?

## The error matrix for binary classification

In binary classification, each sample belongs to one of two classes. Usually they are assigned labels such as 1 and 0, or Positive and Negative. More specific labels for classes may also be used: malignant or benign (e.g., if the problem is related to cancer classification), success or failure (if it is a classification of student test scores). Suppose there is a binary classification problem with `positive` and `negative` classes. Here is an example of valid or reference labels for the seven samples used to train the model.

``positive, negative, negative, positive, positive, negative``

Such labels are needed primarily to make it easier for us humans to distinguish between classes. For the model, numerical evaluation is more important. Usually when you pass another set of data to the output, you will get a numerical result, not a class label. For example, when these seven samples are entered into the model, each class will be assigned the following values:

``0.6, 0.2, 0.55, 0.9, 0.4, 0.8, 0.5``

Based on the resulting scores, each sample is assigned the appropriate class. This conversion of numerical results to labels is done using a threshold value. This threshold condition is a hyperparameter of the model and can be defined by the user. For example, if the threshold is 0.5, then any score that is greater than or equal to 0.5 gets a positive label. Otherwise, it gets a negative one. Here are the classes predicted by the algorithm:

``positive (0.6), negative (0.2), positive (0.55), positive (0.9), negative (0.4), positive (0.8), positive (0.5)``

Compare the valid and received labels – we have 4 correct and 3 incorrect predictions. It is worth adding that changing the boundary condition affects the results. For example, setting the threshold to 0.6 leaves only two wrong predictions.

``````Reality: positive, negative, negative, positive, positive, negative
Predictions: positive, negative, positive, negative, positive, positive``````

A confusion matrix is used to provide additional information about the model’s performance. The confusion matrix helps us visualize whether the model is “wrong” in distinguishing between the two classes. As you can see in the following figure, it is a 2×2 matrix. The row names represent the reference labels, and the column names represent the predicted labels. The four elements of the matrix (cells in red and green) represent the four metrics that count the number of correct and incorrect predictions made by the model. Each element is given a label consisting of two words:

1. True or False.
2. Positive or Negative.

True when a correct prediction is obtained, that is, the reference and predicted class labels match, and False when they do not. Positive or Negative are names of predicted labels. Thus, whenever the prediction is wrong, the first word in the cell is False, when it is true it is True. Our goal is to maximize the metrics with the word “True” (True Positive and True Negative) and minimize the other two (False Positive and False Negative). The four metrics in the error matrix represent the following:

1. Upper Left (True Positive): how many times did the model correctly classify Positive as Positive?
2. Upper Right (False Negative): how many times did the model misclassify Positive as Negative?
3. Bottom Left (False Positive): how many times did the model misclassify Negative as Positive?
4. Bottom Right (True Negative): how many times did the model correctly classify Negative as Negative?

We can calculate these four metrics for the seven predictions we used earlier. The resulting error matrix is shown in the following figure. This is how the error matrix for the binary classification problem is calculated. Now let’s see how to solve this problem for more classes.

## Error matrix for multi-class classification

What if we have more than two classes? How do we calculate these four metrics in the error matrix for the multi-class classification problem? Very simple! Suppose there are 9 samples, each belonging to one of three classes: White, Black, or Red. Here are the valid labels for the 9 samples:

``Red, Black, Red, White, White, Red, Black, Red, White``

``Red, White, Black, White, Red, Red, Black, White, Red``

For ease of comparison, they are placed side by side here.

``Reality: Red, Black, Red, White, White, Red, Black, Red, White Predictions: Red, White, Black, White, Red, Red, Black, White, Red``

Before calculating the error matrix, we must choose a target class. Let’s assign the Red class to this role. It will be marked as Positive, and all others will be marked as Negative.

``Positive, Negative, Positive, Negative, Negative, Positive, Negative, Positive, Negative Positive, Negative, Negative, Negative, Positive, Positive, Negative, Negative, Positive``

111111111111111111111111111111111111 After the substitution only two classes (Positive and Negative) are left which allows us to calculate the error matrix as shown in the previous section. It is worth noting that the resulting matrix is for the Red class only. Then, for the White class, replace each occurrence of it with Positive, and the labels of all other classes with Negative. We will get such reliable and predicted labels:

``Negative, Negative, Negative, Positive, Positive, Negative, Negative, Positive Negative, Positive``

The following diagram shows the error matrix for the White class. In the same way, an error matrix for Black can be obtained.

## Calculating the error matrix using Scikit-Learn

The popular Python library Scikit-learn has a `metrics` module that can be used to calculate metrics in the error matrix. The `confusion_matrix()` function is used for problems with two classes. We will pass the following parameters to the function:

1. `y_true`: reference marks.
2. `y_pred`: predicted labels.

The following code computes the error matrix for the binary classification example we discussed earlier.

``````import sklearn.metrics
y_true = ["positive", "negative", "negative", "positive", "positive", "negative"]
y_pred = ["positive", "negative", "positive", "positive", "negative", "positive", "positive"]
r = sklearn.metrics.confusion_matrix(y_true, y_pred)
print(r)
``````
``````array([[1, 2],
[1, 3]], dtype=int64)``````

Note that the order of the metrics differs from that described above. For example, the True Positive metric is in the lower right corner, and the True Negative is in the upper left corner. To fix this, we can flip the matrix.

``````import numpy
r = numpy.flip(r)
print(r)
``````
``````array([[3, 1],
[2, 1]], dtype=int64)``````

To calculate the error matrix for a problem with a large number of classes, the function `multilabel_confusion_matrix()` is used, as shown below. In addition to the `y_true` and `y_pred` parameters, the third parameter `labels` takes a list of class labels.

``````import sklearn.metrics
import numpy
y_true = ["Red", "Black", "Red", "White", "White", "Red", "Black", "Red", "White"]
y_pred = ["Red", "White", "Black", "White", "Red", "Red", "Black", "White", "Red"]
r = sklearn.metrics.multilabel_confusion_matrix(y_true, y_pred, labels=["White", "Black", "Red"])
print(r)
``````
``````array([
[[4 2]
[2 1]]
[[6 1]
[1 1]]
[[3 2]
[2 2]]], dtype=int64)``````

The function calculates an error matrix for each class and returns all matrices. Their order corresponds to the order of labels in the `labels` parameter. To change the order of the metrics in the matrices, we will again use the function `numpy.flip()`.

``````print(numpy.flip(r))  # error matrix for the White class
print(numpy.flip(r))  # error matrix for class Black
print(numpy.flip(r))  # error matrix for class Red
``````
``````# error matrix for class White
[[1 2]
[2 4]]
# error matrix for class Black
[[1 1]
[1 6]]
# error matrix for the Red class
[[2 2]
[2 3]]``````

In the remainder of this text, we will focus on only two classes. The next section discusses three key metrics that are calculated from the error matrix.

## Accuracy, Precision, and Recall

As we have seen, the error matrix offers four individual metrics. From these, other metrics can be calculated that provide additional information about the behavior of the model:

1. Accuracy
2. Precision
3. Recall

The following subsections discuss each of these three metrics.

### Accuracy metric

Accuracy is a metric that describes the overall accuracy of a model’s prediction across all classes. This is especially useful when each class is equally important. It is calculated as the ratio of the number of correct predictions to the total number of predictions. Let’s calculate accuracy using Scikit-learn based on the previously obtained error matrix. The variable `acc` contains the result of dividing the sum of True Positive and True Negative metrics by the sum of all matrix values. Thus, an accuracy of 0.5714 means that the model makes a correct prediction with 57.14% accuracy.

``````import numpy
import sklearn.metrics
y_true = ["positive", "negative", "negative", "positive", "positive", "negative"]
y_pred = ["positive", "negative", "positive", "positive", "negative", "positive", "positive"]
r = sklearn.metrics.confusion_matrix(y_true, y_pred)
r = numpy.flip(r)
acc = (r + r[-1][-1]) / numpy.sum(r)
print(acc)
# output will be 0.571
``````

The sklearn.metrics module has a `precision_score()` function that can also calculate accuracy. It takes valid and predicted labels as arguments.

``````acc = sklearn.metrics.accuracy_score(y_true, y_pred)
``````

Note that accuracy metrics can be misleading. One such case is unbalanced data. Suppose we have only 600 pieces of data, 550 of which are Positive and only 50 are Negative. Since most of the samples belong to one class, the accuracy for that class will be higher than for the other. If the model made 530 correct predictions out of 550 for the Positive class, compared to 5 out of 50 for Negative, then the total accuracy is (530 + 5) / 600 = 0.8917. This means that the accuracy of the model is 89.17%. Relying on this value, you might think that for any sample (regardless of its class) the model would make a correct prediction 89.17% of the time. This is incorrect, because the model performs very poorly for class Negative.

### Precision

Precision is the ratio of the number of samples correctly classified as Positive to the total number of samples labeled Positive (recognized correctly and incorrectly). Precision measures the accuracy of the model when determining the Positive class. When the model makes many incorrect Positive classifications, it increases the denominator and decreases precision. On the other hand, precision is high when:

1. The model makes many correct predictions of the Positive class (maximizes the True Positive metric).
2. The model makes fewer incorrect Positive classifications (minimizes False Positive).

Imagine a person who is universally trusted; when he predicts something, those around him believe him. The precision metric is similar to such a character. If it is high, you can trust the model’s decision to define another sample as Positive. Thus, precision helps you know how accurate the model is when it says a sample has a Positive class. Based on the previous discussion, here is the definition of precision: Precision reflects how reliable the model is at classifying Positive labels. In the following image, a green label indicates that a green sample is classified as Positive and a red cross indicates Negative. The model correctly recognized two Positive samples, but incorrectly classified one Negative sample as Positive. It follows from this that the True Positive metric is 2 when False Positive has a value of 1, and the precision is 2 / (2 + 1) = 0.667. In other words, the percentage of confidence in the model’s decision that the sample belongs to the Positive class is 66.7%. The goal of precision is to classify all Positive samples as Positive, avoiding false definitions of Negative as Positive. According to the following figure, if all three Positive samples are predicted correctly, but one Negative sample is misclassified, precision is 3 / (3 + 1) = 0.75. Thus, the model’s claims that the sample belongs to the Positive class are correct with 75% accuracy. The only way to get 100% precision is to classify all Positive samples as Positive without classifying Negative as Positive. In Scikit-learn, the sklearn.metrics module has the `precision_score()` function, which takes the reference and predicted labels as arguments and returns precision. The `pos_label` parameter takes the Positive class label (default is 1).

``````import sklearn.metrics
y_true = ["positive", "positive", "positive", "negative", "negative"]
y_pred = ["positive", "positive", "negative", "positive", "negative", "negative"]
precision = sklearn.metrics.precision_score(y_true, y_pred, pos_label="positive")
print(precision)
``````

The output is `0.666666666666666`.

### Recall

Recall is calculated as the ratio of the number of Positive samples correctly classified as Positive to the total number of Positive samples. Recall measures the ability of the model to detect samples belonging to the Positive class. The higher the recall, the more Positive samples were found. Recall only cares about how Positive samples are classified. This metric does not depend on how Negative samples are predicted, unlike precision. When the model correctly classifies all Positive samples, recall will be 100%, even if all members of the Negative class were misclassified as Positive. Let’s look at some examples. The following image shows 4 different cases (from A to D), and they all have the same recall equal to 0.667. The presented examples differ only in the way the Negative samples are classified. For example, in case A all Negative samples are correctly identified, while in case D the opposite is true. Regardless of how the model predicts the Negative class, the recall applies only to samples belonging to Positive. Of the 4 cases shown above, only 2 Positive samples are correctly identified. Thus the True Positive metric is 2. False Negative has a value of 1, because only one Positive sample is classified as Negative. As a result the recall will be 2 / (2 + 1) = 2/3 = 0.667. Since it does not matter how objects of class Negative are predicted, it is better to simply ignore them, as shown in the following diagram. Only Positive samples should be considered when calculating recall. What does it mean when recall is high or low? If recall is high, all Positive samples are correctly classified. Therefore, the model can be trusted in its ability to detect members of the Positive class. In the following image, recall is 1.0 because all the Positive samples were correctly classified. The True Positive score is 3, and the False Negative is 0. Thus, recall is calculated as 3 / (3 + 0) = 1. This means that the model detected all Positive samples. Since recall does not take into account how members of the Negative class are predicted, there may be many incorrectly defined Negative samples (high False Positive metric). On the other hand, recall is equal to 0.0 if no Positive samples can be detected. This means that the model detected 0% of Positive samples. True Positive has a value of 0, and False Negative has a value of 3. Recall will be 0 / (0 + 3) = 0. When recall has a value from 0.0 to 1.0, this number reflects the percentage of Positive samples that the model correctly classified. For example, if there are 10 instances of Positive and recall is 0.6, it means that the model has correctly identified 60% of the Positive samples (i.e. 0.6 * 10 = 6). Like precision_score(), the repl_score() function from the sklearn.metrics module calculates recall. The following code block shows an example of its use.

``````import sklearn.metrics
y_true = ["positive", "positive", "positive", "negative", "negative"]
y_pred = ["positive", "positive", "negative", "positive", "negative", "negative"]
recall = sklearn.metrics.recall_score(y_true, y_pred, pos_label="positive")
print(recall)
``````

The output is `0.666666666666666`. After defining precision and recall, let us briefly summarize:

• Precision measures the reliability of the model in classifying Positive samples, and recall measures how many Positive samples were correctly predicted by the model.
• Precision takes into account the classification of both Positive and Negative samples. Recall, on the other hand, uses only Positive samples. In other words, Precision depends on both Negative and Positive samples, but Recall only on Positive.
• Precision takes into account when a sample is defined as Positive, but does not care about correct classification of all objects of Positive class. Recall, in turn, takes into account the correctness of prediction of all Positive samples, but does not care about misclassification of Negative representatives as Positive.
• When a model has high recall metrics but low precision, such a model correctly detects most Positive samples, but has many false positives (classifying Negative samples as Positive). If the model has high precision but low recall, it makes highly accurate predictions, determining the Positive class, but produces only a few such predictions.

Some questions to check comprehension:

• If recall is 1.0, and there are 5 objects of class Positive in the dataset, how many Positive samples were correctly classified by the model?
• Given that recall is 0.3, when there are 30 Positive samples in the dataset, how many Positive samples will be predicted correctly?
• If recall is 0.0 and there are14 Positive samples in the dataset, how many correct predictions of the Positive class were made by the model?

## Precision or Recall?

The decision of whether to use precision or recall depends on the type of problem you have. If the goal is to detect all positive samples (without worrying about whether negative samples will be classified as positive), use recall. Use precision if your problem involves complex prediction of the Positive class, i.e. considering Negative samples that were misclassified as Positive. Imagine that you were given an image and asked to identify all the cars within it. What metric would you use? Since the goal is to detect all cars, use recall. This approach may misclassify some objects as targets, but it will eventually work to predict all cars. Now suppose you are given a mammography scan and are asked to determine the presence of cancer. What indicator would you use? Because it is bound to be sensitive to misidentifying the image as malignant, we have to be sure when we classify the image as Positive (that is, with cancer). Thus, the preferred indicator in this case is precision.

## Conclusion

This tutorial discussed the error matrix, calculating its 4 metrics (true/false positive/negative) for binary and multiclass classification tasks. Using the metrics module of the Scikit-learn library, we saw how to get an error matrix in Python. Based on these 4 metrics, we moved on to discuss accuracy, precision, and recall metrics. Each of them has been defined and used in several examples. The sklearn.metrics module is used to calculate each of the above metrics.