Accuracy is not a good metric in NLP

This week, I attended a talk where the speaker showed us their classifier achieved an accuracy of ~99%. Some people could be skeptical when they see such results. This could happen for two reasons—(1) the problem that they are solving is easy  (2) they didn’t use the right metrics. This talk is not the first time that I saw accuracy being used in as a metric. On this post, I will write my opinion why accuracy could be a misleading metrics.

In the NLP community, precision, recall and f1 scores are the most widely accepted metrics. However, it is not uncommon to see  papers reporting their results in terms of accuracy. The problem with accuracy is that if the distribution of the classes is not balanced—a phenomena which is widely known as a skewed/or biased distribution, the results that you get don’t necessarily measure how good the system is. A typical examples of such data set are:

  1. Classification of sentences into neutral and polar in sentiment analysis. The proportion of neutral is close to 75%.
  2. Classification of emails into spam and non-spam. 98% of emails are believed to be spam.

Now, consider we want to develop a classifier that will classify emails into spam and non-spam. And, our classifier has a single line of code that always returns spam. If we test this classifier with 10 emails (9 of them are spam and 1 is non-spam), the accuracy will be 90%. Notice, that the accuracy is high because there are more number of spams in the test set. If we test the same classifier against 10 emails where only 1 is spam and 9 of them are non-spam, we will get 10% accuracy. The problem here is that we are considering the ratio of true positive to the total number of samples but we are not addressing false positive and false negative error types. A good metric should address all the type of errors that a classifier makes. precision, recall and F1 score metrics address these errors and should be the standards to evaluate NLP systems.

If we look into the formula of precision, recall and F1 score, one has to be able to identify what is a false positive, false negative and true positive of the system in order do the computation correctly. An important question would be if it is possible to identify what is false positive, false negative and true positive for any type of NLP problems. I would leave the answer to the reader.

 

Standard

Leave a comment