Why is my model 90% accurate and not working?

2022-02-12 10:22:28

Author | Meagvo

Translated by | Marco Wei

Planning | Liu Yan

There is one case in binary classification where two classes in the original dataset cause an unbalanced distribution of data points for the nature of the problem. For example, when dealing with predictions of market issues such as churn (where users will no longer use the company's products after a period of time), the percentage of churned users will generally be much lower than that of retained users. If the classification in this example is eight to two, only 20% of users will stop contacting the company, and the remaining 80% will continue to use the company's products.

But the problem is, this 20% churn can be very important to the company.

To take a more graphic example, a gift company has 100,000 customers, and each customer creates an average of $50 in value, which adds up to $5,000,000. If 20% of those users gave up buying the product, the company would lose $1,000,000! Accumulating these amounts over time can even make the biggest e-commerce companies or brick-and-mortar stores ashamed. Therefore, one of the main tasks of the company's marketing department is to predict the churn of customers and intervene in advance to prevent it from happening.

Machine learning for predicting customer churn

If your company has a great data science or data analytics team, congratulations, a great churn prediction model allows you to predict user loyalty one step ahead of the curve, take action before they abandon your company's products, and maybe even preserve customer resources for your company.

But when dealing with such binary classification models, two categories with unbalanced sample sizes often make things tricky, and the precision metrics that most data analysts rely on are not a panacea. To this end, this article will review the various metrics available in the various machine learning performance evaluations mentioned in another article on Towards Data Science [another article] by Koo Ping Shuang, and select the appropriate metrics for the unbalanced binary classification problem for analysis.

What is precision?

Accuracy = All correct predictions / All predictions

Precision calculates the correct proportion of all predictions, and intuition tells us that this is indeed no problem, but when it comes to unbalanced data sets, the situation becomes complicated...

For example, you get data on customer churn over the past year from the marketing department. Last year, there were a total of 100,000 customers, of which 20,000 were lost. Now, if we predict that all 100,000 customers will survive until the end of the year, it means that your accuracy is 80,000/100,000, which is 80%! But in fact, you didn't predict a single customer churn. If the classification were a little more extreme, with 90 to 10 customer retention, and we still predicted no churn, then we would have a model with 90% accuracy, but not a single churn case.

In the end, we can only "look around" with 90% of the model.

So, how do you solve this problem?

In addition to precision, we have other metrics to measure the performance of the model, and in this article we will focus on the following three:

Precision

Recall rates

F value

Accuracy = True / (True + False Positive)

Precision's algorithm is not very clear compared to precision, and precision can tell us how far away the model is from the intended target. Successful predictions will add points to the model, while failed predictions will also have some deductions. Therefore, if we successfully predict the loss of all 20,000 users, that is, the real of 20,000, but at the same time there are 20,000 customers who have not lost, they are confused by the model, then this will also be reflected in the accuracy:

No false positive: 20,000/(20,000+0) = 100%

False positive: 20,000/(20,000+20,000) = 50%

False positive statistics is often referred to as the first type of error, which refers to the error sample that is predicted to be correct. If you need to deal with unbalanced data sets and prevent false positives from appearing, precision will help a lot. For example, we want to implement risk treatment for patients diagnosed with cancer, but we have to make sure that the person receiving the treatment is really sick, because if we implement this treatment to normal people, then we will be infamous. In this case, we will hope to minimize the number of false positive samples and improve the accuracy of the model.

Recall = True / (True + False Negative)

If improving accuracy is to prevent false positive samples, then raising recall is to reduce the number of false negatives. In statistics, false negatives are referred to as the second type of error, referring to cases in which the prediction is negative and the actual is positive. Again, using the previous example, if we succeed in predicting all the churned customers and don't miss any of them, then we will have:

No false negative: 20,000/(20,000+0) = 100%

If we misestimate five thousand customers, then the recall will drop, but the denominator of the calculated formula will not change, as follows:

False negative: 15,000/(15,000+5,000) = 75%

If you have an uneven classification of the dataset you need to deal with and the need to retrieve all the problem cases, then recalls are a good criterion. In the example of our customer churn prediction, we can use this to identify the customers who are most likely to abandon their purchase and send them an email or message notification in advance.

In this case, the false positive may be to send a few more emails, and you will most likely not care that five hundred customers who are very loyal to the product will be wasted by excess emails, and we hope to be able to retain the potential customer churn through message reminders.

F1 value

Although we may not intuitively see the meaning of the F1 indicator, it should be the one that suits you best.

F1= 2 X (Precision * Recall) / (Precision + Recall)

Arguably, F1 is a combination of accuracy and recall, which helps you determine model performance and empower false positives and false negatives. If you want to learn more, you can refer to Algorithm Decomposition in Wikipedia.

If we managed to identify 15,000 out of 20,000 target samples, but 5,000 of them were misjudged as positive and negative, and five thousand positive samples were missed, then your F1 should look like this:

F1: 15,000 / (15,000+.5 (5,000+5,000) ) =75%

The best thing about the F1 algorithm is that it can find a clever balance between accuracy and recall.

Now, through an example analysis of an unbalanced data set, it is clear that accuracy is not necessarily the best criterion. An extreme example is a model with 90% accuracy, but with a zero score on recall or accuracy. Taking Python's logistic regression algorithm as an example, the following options might be worth a look:

SMOTE。 The package allows users to oversample or undersample to balance quantity differences between classifications.

Empowerment logistic regression. By choosing the weights for each category, or balancing the weights directly based on the distribution of categories, we can set the true, false positive, and false negative weight levels to have more control over the results.

Summary

Even training machine learning algorithms in R or Python can be tricky when faced with unbalanced classification problems. Hopefully, this article will help you realize potential vulnerabilities in data analysis to prevent logical fallacies. There are three main points in the article:

Precision is not a panacea

Identify your business goals

Find the Balanced Data package

https://towardsdatascience.com/why-my-model-with-90-accuracy-doesnt-work-685817a2b0e

Why is my model 90% accurate and not working?

Read on