I think most people have heard of the ROC curve or the AUC (area under the curve) before. Especially those who are interested in data science. However, what is the ROC curve and why is the area under this curve a good metric for evaluating a classification model?
The theory of the ROC curve
The full name of the ROC is Receiver Operating Characteristic. It was first created to use radar signal detection during World War II. The U.S. used ROC to improve the accuracy of radar detection of Japanese aircraft. Therefore it is called the Receiver Operating Characteristic. AUC or area under curve is simply the area under the ROC curve. Before we get to what the ROC curve is, we need to remember what an error matrix is. As you can see from the figure above, an error matrix is a combination of your prediction (1 or 0) and the actual value (1 or 0). Depending on the result of the prediction and whether or not the classification was correct, the matrix is divided into 4 parts. For example, a true positive result is the number of cases in which you correctly classify a sample as positive. And false positive is the number of cases where you incorrectly classified the sample as positive. The error matrix contains only absolute numbers. However, using them, we can get many other metrics based on percentages. True Positive Rate (TPR) and False Positive Rate (FPR) are two of them. True Positive Rate (TPR) shows what percentage of all positive is correctly predicted by the model. TPR = TP / (TP + FN). False Positive Rate (FPR): what percentage of all negative is incorrectly predicted by the model. FPR = FP / (FP + TN). Okay, now let’s move on to the ROC curve!
What is the ROC curve?
As you can see in the graph, the ROC curve is simply the ratio of TPR to FPR. Now it’s all clear to you, in conclusion.. Do you believe it? In all seriousness, you can read a lot more information from the chart. The first question I want to discuss here is that we only have one set of TPR, FPR, calculated based on the predictions made by the model. So where did that many points come from to make a whole chart? It all follows from the way the classification model works. When you build a classification model, such as a decision tree, and you want to determine whether a stock will rise or fall based on the inputs. The model will first calculate the probability of an increase or decrease using the historical data you provide. It will then decide, based on the threshold value, whether the outcome will increase or decrease. Yes, the key word here is threshold. Different thresholds create different TPRs and FPRs. They represent the very points that make up the ROC curve. You can choose “Increase” as the model prediction if the probability of growth in the stock, derived from historical data, is greater than 50%. You can also change the threshold value and display “Increase” only if the corresponding probability is greater than 90%. If you set the threshold to 90% instead of 50%, you will be more confident that the stock selected for “Increase” will actually grow. But that way you might miss out on some potentially profitable options.
What does the blue dotted line on the chart mean?
As we know, the larger the area under the curve (AUC), the better the classification. The ideal or best curve is a vertical line from (0.0) to (0.1) that extends to (1.1). This means: the model can always distinguish between positive and negative cases. However, if you choose a class randomly for each sample, TPR and FPR should increase at the same rate. The blue dashed line shows the curve of TPR and FPR when you randomly determine positive or negative for each case. For this diagonal line the area under the curve (AUC) is 0.5. What happens to the TPR, FPR, and ROC curve if you change the threshold value? Look at the two points on the ROC curve. The green dot has a very high threshold, meaning that only if you are 99% sure can you classify a case as positive. The red dot has a relatively lower threshold. This means that you can classify a case as positive if you are 90% sure. How do TPR and FPR change as you move from the green dot to the red dot? Both TPR and FPR increase. When you decrease the threshold, the model will detect more positive cases. Thus, TP increases, as does TP/(TP + FN). On the other hand, you inevitably misclassify some negative cases as positive due to lowering the threshold, and so FP and FP/(FP + TN) also increase. We see that TPR and FPR are positively correlated. You need to balance between maximizing the coverage of positive cases and minimizing the misclassification of negative cases.
How to choose the optimal point on the ROC curve?
It is difficult to determine the optimal point because you have to choose the most appropriate threshold value given the scope of the model. However, the general rule is to maximize the difference (TPR-FPR), which on the graph is represented by the vertical distance between the orange and blue dotted line.
Why is the area under the ROC curve a good metric for evaluating a classification model?
A good metric for a machine learning model should display the true and constant predictive ability of the model. This means that if I change the test dataset, it should not produce a different result. The ROC curve takes into account not only the classification results, but also the probability of predicting all classes. For example, if a result is correctly classified based on a 51% probability, it is likely to be misclassified if you use a different test dataset. In addition, the ROC curve also accounts for model performance at different thresholds. It is a comprehensive metric for assessing how well cases are separated across groups.
What is an acceptable AUC value for a classification model?
As I showed earlier, for a binary classification problem in determining classes at random, you can get 0.5 AUC. Therefore, if you are solving a binary classification problem, a reasonable AUC value should be > 0.5. A good classification model has an AUC > 0.9, but this value is highly dependent on its scope.
How do I calculate the AUC and plot the ROC curve in Python?
If you just want to calculate AUC you can use the metrics package of the sklearn library(link). If you want to plot the ROC curve for your model results, you should go here. Here is the code for plotting the ROC curve that I used in this article.
from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve, auc from sklearn.metrics import roc_auc_score from matplotlib import pyplot as plt # generate dataset for 2 classes X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # divide it into 2 samples trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # train the model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # get predictions lr_probs = model.predict_proba(testX) # store probabilities only for positive outcome lr_probs = lr_probs[:, 1] # calculate ROC AUC lr_auc = roc_auc_score(testy, lr_probs) print('LogisticRegression: ROC AUC=%.3f' % (lr_auc)) # calculate the roc-curve fpr, tpr, treshold = roc_curve(testy, lr_probs) roc_auc = auc(fpr, tpr) # plot plt.plot(fpr, tpr, color='darkorange', label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='navy', linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Example ROC-curve') plt.legend(loc='lower right') plt.show()
You need the following inputs: the actual y-value and the prediction probability. Note that the roc_curve function only requires the probability for the positive case, not for both classes. If you need to solve a multi-class classification problem, you can also use this package, and the link above has an example of how to plot.