Different Approaches to the Class Imbalance Issue

markpatterson8
Mar 4, 2021
5 min read

by Mark Patterson

Introduction

I was recently working on a project as part of my data science boot camp lesson on supervised machine learning. It was a dataset form a competition on DataDriven.com of water-points, wells, and other water sources in Tanzania. Safe drinking water is still not a guarantee in certain places in the world and that is the case in Tanzania and several other places in East Africa. This was a classification problem with the goal of being able to classify each of the water-points as working; not working; or working, but in need of repair. One of the issues I needed to address as part of this project, was how to deal with the fact that these 3 classes of water-points (the target variable) had different numbers of records – one class had 2 to 8 times as many records as the others.

What is Class Imbalance?

When an imbalance between classes exists, it can cause a number of problems. One of the potential issues is that it can cause a model to predict well only on the predominant class. And if it is a large imbalance, this can result in “the accuracy paradox,” where accuracy might be high, but only because the models is predicting the majority class.

Class imbalances are not uncommon. Think of any problem where you are trying to identify or detect something that doesn’t happen very often, like the incidence of cancer, credit card fraud, or customer churn. There are numerous ways to deal with a class imbalance, and in this blog post I am going to experiment with a few. For a more detailed explanation of some other possible approaches check the reference provided at the end of this post.

The Context and Data

The original dataset consisted of 59.4K records and 41 variables. After some data cleaning, munging, and shaping I ended up with a dataset of 35.3K records and 17 variables. Well, actually after some one-hot encoding of text and categorical variables it was more like 75 variables. The target variable was the condition of each water-point (well). Each water-point was either functional (class 0), non-functional (class 1), or functional, but in need of repair (class 2). There was a class imbalance as shown in the table below.

For my business case, I was primarily interested in accurately classifying class 1 (the non-functional water-points) and aside from overall accuracy, I was also concerned with recall for class 1 (reducing the rate of predicting the water-point was working when in fact it was not) as this may cause repair or replacement to be delayed or neglected.

The Experiment

Initially, I did little in the way of pre-processing, leaving the class imbalance as it was, and ran 4 different classification models. An XGBoost model seemed to perform the best with an accuracy of 0.75 and a recall on class 1 of 0.64. Not bad, but I couldn’t help but wonder how the models would perform if I handled the class imbalance differently… what if I tried SMOTE to make the classes the same size? What if I eliminated the 3rd class and just made it a 2-class problem? So, I decided to find out.

The Approaches

I picked out 6 different ways I could handle the class imbalance and decided to test them all using the same classification model: XGBoost. I used the default settings for the model and conducted a train-test split on my data to feed into the model and looked at model performance on the test set (accuracy and recall for class 1).

The first 3 approaches retained all 3 of the classes. (I must be writing this near dinner time as I have come up with food related names for each of the approaches.)

Approach 1: “all natural” – this was the baseline model mentioned earlier. The XGBoost model was run with the existing class imbalance amongst the 3 classes. I am calling this the “all natural” approach because we are using the data “as it is” without transformations or fabrication. The performance of this approach was accuracy of 0.75 and recall of 0.64 for class 1.

Approach 2: “artificial additives”- here I used SMOTE to bring both class 1 and class 2 up to the same number as the majority class – class 0 (note that this is only done for the training set). The concept of SMOTE makes me a little uncomfortable as we are “creating” new data (thus my naming of this approach). Although mathematically based, I am still trying to wrap my head around this idea. This approach had slightly poorer performance with accuracy of 0.73 and recall of 0.65 for class 1.

Approach 3: “fewer calories” – in this approach both class 0 and class 1 were “cut” to be the same number of records as the minority class – class 2 (so each class had just 1,800 records). The performance suffered considerably from the loss of so much data, and the accuracy was0.45 and recall for class 1 was 0.50.

For the next 3 approaches I decided to see what would happen if I simplified it down to a 2-class problem – either the water-point was working (class 0) or it wasn’t (class 1).

Approach 4: “2 for 1” – this approach grouped the original class 1 and class 2 together as one class. I think this makes sense as both of these types of water-points need attention (just some sooner than others). This resulted in class 0 being 55% and the new combo class 1 being 45%. So really, this resulted in close to balanced classes. The model accuracy was not bad, at 0.77 and the recall for class 1 was 0.65.

Approach 5: “half-off” – in this approach I simply cut out class 2. This resulted in class 0 having 59% and class 1 having 41% of the records, so a bit of imbalance, but in this approach, I left it at that. The model performance improved slightly up to an accuracy of 0.80 and recall for class 1 at 0.65.

Approach 6: “bonus pack” – one final approach was to address the imbalance in approach 5 with SMOTE (which brought class 1 up to the same size as class 0). Although the accuracy remained the same at 0.80, the recall for class 1 increased to 0.68.

The results from each of the approaches are summarized in the table below. Approach 6 resulted in the best performance for recall and accuracy.

Conclusion

One of the things I have learned in my initial exposure to data science is that there is no “one way” or “best way” to do something. This can make data science concepts challenging to learn, but the result is a ton of tools and flexibility in your toolset. Although one of the approaches I tried (Approach 6) had better model performance, the ultimate “best” answer also depends on the business context, the nature of the data (continuous, categorical, etc.), the amount of data, and countless other factors.

How to handle a class imbalance is just one piece of the puzzle in a classification problem. In my project there were other issues to address, like how to handle missing values, how to treat categorical variables, etc. But let’s save all that for a future blog post.

References:

Brownlee, J. (2015, August 19). 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset. Machine Learning Mastery. https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

My Data Science Journey

Different Approaches to the Class Imbalance Issue

Recent Posts

Comentários