My Blog | Data Science Journey

- Mar 11, 2021

Improving Data Quality with Data Cleaning and Transforming

by Mark Patterson

Introduction

They say that a large part of Data Science is obtaining and cleaning data. After completing several projects as part of my data science bootcamp, I can say from experience this is true. The phrase “cleaning” makes it sound like this is a boring, laborious process, but I do not find that to be case. It provides an opportunity to really look at your data closely, consider alternative ways of reshaping and transforming it, so that it meets requirements for eventual modeling and analysis and as a result is of the best quality possible.

On a recent project, I used a dataset representing about 27,000 responses to a public opinion survey related to the 2009 H1N1 pandemic (or swine flu). This survey included questions about behaviors, opinions, and demographics (36 variables in all). Some of the values were represented numerically, while others were categorical and represented by words or phrases (survey responses). Before I could get into classification modeling, I needed to ensure my dataset was in the correct format and of the best quality possible. So, the first step was data cleaning and transforming.

To walk you through the questions I asked, the options I considered, and the decisions I made, I decided to assemble a checklist of questions to help guide you though the process of cleaning and transforming your data.

These steps are in no particular order… and I’ve provided just a few ideas for ways to address these questions. The important thing is to go through the process of thinking about these potential issues and deciding if and how you will tackle them. Some data transformations will have a larger impact on your data quality than others, and as many things in data science – there is no “right” way to do something. But for me, I like to ensure that my data is as solid as possible before I spend time on analysis and modeling. And as the old saying goes, “garbage in, garbage out.”

Are there any duplicates in my records?

Let’s start off with an easy one. Sometimes you may find that your dataset has the exact same record listed multiple times. This may have happened from data entry errors, problems merging data together etc. Bottom line is that it’s bad to have duplicate records and this data and should be removed. There are various ways to do this in Pandas, and you can specify if you want to keep the first occurrence of the record or the last occurrence. This was not a large problem in my dataset, but I did need to remove a handful of duplicate records.

Are my variables meaningful?

Sometimes you may find that your dataset contains variables that are not going to be helpful or are not needed for analysis purposes – for example a userid. In my recent project, the dataset included several employment variables (occupations, industries). They each contained about 20 different classes, and basically were duplicative of each other. I decided to get rid of this duplication by dropping the occupations and keeping the industries. In Pandas this was a simple drop column command.

Another issue arose with some of the location variables in the dataset. The class values were obscured with random test strings. So even if my modeling and analysis indicated that location was an important factor, I wouldn’t really be able to tell where this was geographically, nor use it for any visualizations. I considered dropping this features, but ultimately decided to keep it, as I would at least be able to tell if location in general was an important factor.

Are variables the correct datatype / format?

For classification modeling, I needed all of the variables to be in numeric format. However, about a third of the variables in my dataset were still in categorical format, listing the values as text strings. There are a number of ways to address this. Initially, I wanted to keep the number of variables down to retain the one column per survey question format of the dataset. So I decided to use “Ordinal Encoder” to transform the classes into numbers. This method was suggested for producing better results with tree-based classification models, as there is more “cardinality” for the model to consider (not just a 0 or 1).

"Ordinal Encoder" however is supposed to be used for “ordinal” variables – meaning variables with a meaningful order. For some of my variables this made sense – for example education level or age group. But for something like location or employment industry, there is no order to the various classes. So to address this I decided to go back to my dataset and use “One-Hot Encoding” which is a method that creates a new variable for each class, so that each class is now represented as a 0 or 1 (for example, if record #26 has location A as its value, it gets a 1, otherwise it gets a 0). Using “one-hot-encoding” does increase the size of the dataset (mine went from 36 to 73 variables), but it also can result in improved data quality.

Do any variables need to be re-coded?

In my dataset, I noticed that some of the variables representing opinion questions in the survey were on a 0 to 4 scale, while others were on a 1 to 5 scale. They both contain 5 classes, and they are both ordinal in nature – as they represent levels of agreement with a statement. I saw no reason that these should be different, so I changed the variables with the 1 to 5 range to 0 to 4 to match the other variables. A small change, but potentially a meaningful one.

Do the variable classes make sense?

As mentioned above, sometimes the variables don’t make sense for a particular analysis, but what about the actual classes within the variable? You may come upon a dataset that has an “other” or “miscellaneous” class. Sometimes it makes sense to combine classes like this if you have them. Another issue I faced with my dataset was that the employment industry variable had 23 different classes. Since these were not interpretable anyway (being just a random string), I decided to bucketize the lower frequency classes into an “other” class. This helped to keep the number of classes small, which helped when I decided to use “one-hot encoding” on the employment industry feature.

Are there any missing values?

This is an important consideration, and one that needs to be addressed, as it may prevent certain analyses and modeling from working. In my dataset, I had several features with approximately 50% of the values missing. All told about half of my 36 variables had some missing values. One approach, which I call the “pure” or “organic” approach would be to cut out all missing values. The drawback to this approach is that the dataset would immediately shrink… likely to about 25% of its original size. Since I only had 27,000 records to start with, I did not consider this a good option. What I chose to do instead is to “impute” values for the missing data using the KNN Imputer. This method considers several records with similar values (the nearest neighbors) and creates a new value based off this calculation. This method quickly solves the missing value problem, and it does introduce some artificial values, but they are informed values. There are other ways to “impute” values, but that will need to be covered in a future post.

Are variables of similar magnitude?

Although my data did not have huge differences in magnitude from feature to feature (with most ranging from 0 to 4), there were a few variables that hade values up to 23. Still not that big of a difference. However, some machine learning models require that values be scaled in order to work correctly (typically models that utilize a “distance-based” algorithm –KNN, SVM). To address this, there are several methods that can be used for “scaling.” I used the Standard Scaler. This resulted in all the values in my dataset ranging from about -2 to +2. This scaling step was also needed for the unsupervised learning method of K-Means clustering that I used on the dataset to look for patterns and underlying groupings of respondents.

Should I consider creating new features?

Sometimes you may decide that there is a better way to represent a variable or group of variables. For example, perhaps a listing of zip codes for a county could be better represented as groups of zip codes. This might lead you to consider “banding” a feature into a smaller set of buckets. Or sometimes you may think that some combination of different features could provide a meaningful feature of its own. This could lead you to what is called “feature engineering.” On my project I considered creating several new features: one that would tally a respondents “contact level” with others – adding up various values like number of people in the household, if they were employed or not, if they were a health worker or not. Some of these features may not have been important to classification modeling on their own, but in summation it might provide a stronger signal for some models.

Conclusion

The next time you are faced with a new dataset, feel free to use these 8 questions as a guide to cleaning and shaping your data. It is important to fully explore your data and to think through the transformations your data may need in order to provide the best analysis and modeling results. As with many aspects of data science, it is an iterative process. You may find yourself doing some modeling and then having to jump back to try a different way of preparing or transforming your base dataset. That is exactly what happened to me on my project, and after doing some tweaking to how I handled missing values, and some changes to number of classes, I proceeded with modeling and obtained better results. That wont always be the case, but it is important to remember that better data quality can often yield better results.

- Mar 4, 2021

Different Approaches to the Class Imbalance Issue

by Mark Patterson

Introduction

I was recently working on a project as part of my data science boot camp lesson on supervised machine learning. It was a dataset form a competition on DataDriven.com of water-points, wells, and other water sources in Tanzania. Safe drinking water is still not a guarantee in certain places in the world and that is the case in Tanzania and several other places in East Africa. This was a classification problem with the goal of being able to classify each of the water-points as working; not working; or working, but in need of repair. One of the issues I needed to address as part of this project, was how to deal with the fact that these 3 classes of water-points (the target variable) had different numbers of records – one class had 2 to 8 times as many records as the others.

What is Class Imbalance?

When an imbalance between classes exists, it can cause a number of problems. One of the potential issues is that it can cause a model to predict well only on the predominant class. And if it is a large imbalance, this can result in “the accuracy paradox,” where accuracy might be high, but only because the models is predicting the majority class.

Class imbalances are not uncommon. Think of any problem where you are trying to identify or detect something that doesn’t happen very often, like the incidence of cancer, credit card fraud, or customer churn. There are numerous ways to deal with a class imbalance, and in this blog post I am going to experiment with a few. For a more detailed explanation of some other possible approaches check the reference provided at the end of this post.

The Context and Data

The original dataset consisted of 59.4K records and 41 variables. After some data cleaning, munging, and shaping I ended up with a dataset of 35.3K records and 17 variables. Well, actually after some one-hot encoding of text and categorical variables it was more like 75 variables. The target variable was the condition of each water-point (well). Each water-point was either functional (class 0), non-functional (class 1), or functional, but in need of repair (class 2). There was a class imbalance as shown in the table below.

For my business case, I was primarily interested in accurately classifying class 1 (the non-functional water-points) and aside from overall accuracy, I was also concerned with recall for class 1 (reducing the rate of predicting the water-point was working when in fact it was not) as this may cause repair or replacement to be delayed or neglected.

The Experiment

Initially, I did little in the way of pre-processing, leaving the class imbalance as it was, and ran 4 different classification models. An XGBoost model seemed to perform the best with an accuracy of 0.75 and a recall on class 1 of 0.64. Not bad, but I couldn’t help but wonder how the models would perform if I handled the class imbalance differently… what if I tried SMOTE to make the classes the same size? What if I eliminated the 3rd class and just made it a 2-class problem? So, I decided to find out.

The Approaches

I picked out 6 different ways I could handle the class imbalance and decided to test them all using the same classification model: XGBoost. I used the default settings for the model and conducted a train-test split on my data to feed into the model and looked at model performance on the test set (accuracy and recall for class 1).

The first 3 approaches retained all 3 of the classes. (I must be writing this near dinner time as I have come up with food related names for each of the approaches.)

Approach 1: “all natural” – this was the baseline model mentioned earlier. The XGBoost model was run with the existing class imbalance amongst the 3 classes. I am calling this the “all natural” approach because we are using the data “as it is” without transformations or fabrication. The performance of this approach was accuracy of 0.75 and recall of 0.64 for class 1.

Approach 2: “artificial additives”- here I used SMOTE to bring both class 1 and class 2 up to the same number as the majority class – class 0 (note that this is only done for the training set). The concept of SMOTE makes me a little uncomfortable as we are “creating” new data (thus my naming of this approach). Although mathematically based, I am still trying to wrap my head around this idea. This approach had slightly poorer performance with accuracy of 0.73 and recall of 0.65 for class 1.

Approach 3: “fewer calories” – in this approach both class 0 and class 1 were “cut” to be the same number of records as the minority class – class 2 (so each class had just 1,800 records). The performance suffered considerably from the loss of so much data, and the accuracy was0.45 and recall for class 1 was 0.50.

For the next 3 approaches I decided to see what would happen if I simplified it down to a 2-class problem – either the water-point was working (class 0) or it wasn’t (class 1).

Approach 4: “2 for 1” – this approach grouped the original class 1 and class 2 together as one class. I think this makes sense as both of these types of water-points need attention (just some sooner than others). This resulted in class 0 being 55% and the new combo class 1 being 45%. So really, this resulted in close to balanced classes. The model accuracy was not bad, at 0.77 and the recall for class 1 was 0.65.

Approach 5: “half-off” – in this approach I simply cut out class 2. This resulted in class 0 having 59% and class 1 having 41% of the records, so a bit of imbalance, but in this approach, I left it at that. The model performance improved slightly up to an accuracy of 0.80 and recall for class 1 at 0.65.

Approach 6: “bonus pack” – one final approach was to address the imbalance in approach 5 with SMOTE (which brought class 1 up to the same size as class 0). Although the accuracy remained the same at 0.80, the recall for class 1 increased to 0.68.

The results from each of the approaches are summarized in the table below. Approach 6 resulted in the best performance for recall and accuracy.

Conclusion

One of the things I have learned in my initial exposure to data science is that there is no “one way” or “best way” to do something. This can make data science concepts challenging to learn, but the result is a ton of tools and flexibility in your toolset. Although one of the approaches I tried (Approach 6) had better model performance, the ultimate “best” answer also depends on the business context, the nature of the data (continuous, categorical, etc.), the amount of data, and countless other factors.

How to handle a class imbalance is just one piece of the puzzle in a classification problem. In my project there were other issues to address, like how to handle missing values, how to treat categorical variables, etc. But let’s save all that for a future blog post.

References:

Brownlee, J. (2015, August 19). 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset. Machine Learning Mastery. https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

- Nov 29, 2020

Surviving My First Data Science Project

by Mark Patterson

Introduction

Back in September I enrolled in a data science boot camp. This is a 5-month program that teaches you all the necessities for becoming a data scientist. This includes things like programming in Python, statistics, math, modeling, and many other topics that I have been fearful of. One of the key aspects of the program are real world, large scale projects where you get to apply what you have learned. The first phase of the boot camp focuses on learning computer programming using the Python language. After 3 weeks of intensive reading, discussion, hands-on labs, and good doses of fear and frustration, it was time to devout a week to my first project. In this post I share some of the tools and practices that got me through the ups and downs of this emotional roller coaster.

Setting project boundaries

Having to do one of these projects can at first seem like an insurmountable task. But I found that it helps to fully understand the assignment and expectations. I read the project brief and all provided documentation. I made sure to ask questions of my instructor and project teammate. The next step was to break the project into manageable chunks and figure out a general timeline. My teammate and I tried using an online scrum board – trello.com. Trello allows you to post “cards” on a virtual scrum board to denote what tasks have not been started, what is in progress, and what is completed. It was helpful to break the project up into pieces, but we did not end up using it much after the first couple of days.

The other thing that helps set the boundaries for a project like this is to come up with a well-defined business case and related questions you are trying to answer with your analysis. My teammate and I made sure to select questions we felt were within our capabilities and were doable. Then it was a matter of staying focused and not starting away from the questions we were trying to answer.

Communicating with your teammate

For this first project, having a teammate helped. Your teammate is your first go to when you have a question. Your teammate can help keep you on track. And your teammate can help you through the rough spots and help cheer you on. From the start my partner and I set-up twice per day check-ins via zoom (morning and late afternoon). This helped to ensure we were on track, not repeating work, and we could work through particularly difficult questions. As we got to the end of the project coordination was key as we put together and posted all of our final deliverables. Go team!

Filling in the gaps with online resources

One of the key lessons they teach you from day one of boot camp is that you will need to “Google” for answers and consult documentation… a lot. And this project allowed me to practice doing that …a lot. The good news is that there are plenty of answers out there on the internet. The bad news is that they are not all clearly explained or demonstrated.

There were 2 resources online that I kept going back to and found particularly useful. I found that I picked up more understanding of concepts by watching YouTube videos. Sure it takes more time, but I have found the payoff to be better. For example, I learned a lot from the Python Pandas Q&A Series by Data School. Basic concepts were clearly explained and demonstrated with varying levels of complexity. Typically, alternative approaches were discussed. I also appreciated the format and clarity on the GeeksforGeeks.org website. This website presents answers to specific questions, and this is even reflected in the left navigation for easy reference to related topics. I find this more closely matches my student mind set as it is task based as opposed to feature based (like a lot of documentation). In general I found that I gravitated more to websites for answers when I just needed a reminder on the code.

Keeping organized

One of the marvels of our coding practice is the use of Jupyter Notebooks which are a browser based tool that is combo of code cells and text cells (markdown) that allows you to easily provide labels, explanations and other text to go along with your code. It also does a great job of showing charts and plots inline without having to worry about excessive formatting. We made extensive use of the Jupyter Notebook for both working copies of our work, as well as the final project. I got in the habit of creating a new Notebook each day, so that things did not get too overwhelming. This did not completely replace my old school ways of relying on paper and pen, as I still found it helpful to write notes about process steps and details about my progression of data frames being used in my analysis.

My teammate and I also made use of Zoom for our check-in meetings and when we needed to talk. Otherwise we exchanged questions and attachments via Slack.

Taking breaks and turning off

One of the big surprises for me was how engaged I got in the project. I was eager to get up early and start working and I often found myself working late onto the night. There is something about the chase to solve a problem that can get the adrenaline flowing. Sometimes you are close to figuring something out, and just want to keep going until you get it. Early in the program our education coach cautioned us to be sure and take breaks and not wear yourself out. During this project I learned just what she was talking about and had to force myself to take some afternoon walks to clear my head, and to watch a bit of Netflix before I could go to sleep at night. This project gave me a better understanding of the stereotype of the work engrossed software engineer. It is an interesting phenomenon to get so immersed in your work.

Knowing when to stop

Periodically my teammate and I had to remind ourselves that “OK is good enough.” Although we would have liked to continue to improve our analysis, try a few different things, we had to put on the brakes in order to get all of our deliverables completed by the end of the week. It is important to keep in mind that the project is about the journey (or the process), not the destination (the final analysis). As part of this project there were many learnings about using Jupyter Notebooks, ReadMe files, working with GitHub, creating a non-technical presentation, and working together as a team. It’s important to keep in mind that meta-goal of learning how to learn.

Conclusion

The good news is I made it through my first project. My teammate and I reached the end of project week and had something to show that we were proud of – maybe not my best work ever, but I feel it does a good job of showing my skills at an early stage of the journey. Did I learn anything? Hell yes! I recall sitting down on the couch a few days before the end of the project and jotting down pages of notes as I reflected on the experience. I look forward to 4 more projects and expect to see improvements from repetition and new learnings. And one of the best benefits of all is that I have transformed my fear of a big project into excitement and embrace the opportunity to practice, re-enforce and sharpen skills moving forward.

My Data Science Journey

Improving Data Quality with Data Cleaning and Transforming

Different Approaches to the Class Imbalance Issue

Surviving My First Data Science Project

Drop Me a Line, Let Me Know What You Think