top of page

Improving Data Quality with Data Cleaning and Transforming

by Mark Patterson





Introduction

They say that a large part of Data Science is obtaining and cleaning data. After completing several projects as part of my data science bootcamp, I can say from experience this is true. The phrase “cleaning” makes it sound like this is a boring, laborious process, but I do not find that to be case. It provides an opportunity to really look at your data closely, consider alternative ways of reshaping and transforming it, so that it meets requirements for eventual modeling and analysis and as a result is of the best quality possible.


On a recent project, I used a dataset representing about 27,000 responses to a public opinion survey related to the 2009 H1N1 pandemic (or swine flu). This survey included questions about behaviors, opinions, and demographics (36 variables in all). Some of the values were represented numerically, while others were categorical and represented by words or phrases (survey responses). Before I could get into classification modeling, I needed to ensure my dataset was in the correct format and of the best quality possible. So, the first step was data cleaning and transforming.


To walk you through the questions I asked, the options I considered, and the decisions I made, I decided to assemble a checklist of questions to help guide you though the process of cleaning and transforming your data.


These steps are in no particular order… and I’ve provided just a few ideas for ways to address these questions. The important thing is to go through the process of thinking about these potential issues and deciding if and how you will tackle them. Some data transformations will have a larger impact on your data quality than others, and as many things in data science – there is no “right” way to do something. But for me, I like to ensure that my data is as solid as possible before I spend time on analysis and modeling. And as the old saying goes, “garbage in, garbage out.”



Are there any duplicates in my records?

Let’s start off with an easy one. Sometimes you may find that your dataset has the exact same record listed multiple times. This may have happened from data entry errors, problems merging data together etc. Bottom line is that it’s bad to have duplicate records and this data and should be removed. There are various ways to do this in Pandas, and you can specify if you want to keep the first occurrence of the record or the last occurrence. This was not a large problem in my dataset, but I did need to remove a handful of duplicate records.


Are my variables meaningful?

Sometimes you may find that your dataset contains variables that are not going to be helpful or are not needed for analysis purposes – for example a userid. In my recent project, the dataset included several employment variables (occupations, industries). They each contained about 20 different classes, and basically were duplicative of each other. I decided to get rid of this duplication by dropping the occupations and keeping the industries. In Pandas this was a simple drop column command.


Another issue arose with some of the location variables in the dataset. The class values were obscured with random test strings. So even if my modeling and analysis indicated that location was an important factor, I wouldn’t really be able to tell where this was geographically, nor use it for any visualizations. I considered dropping this features, but ultimately decided to keep it, as I would at least be able to tell if location in general was an important factor.


Are variables the correct datatype / format?

For classification modeling, I needed all of the variables to be in numeric format. However, about a third of the variables in my dataset were still in categorical format, listing the values as text strings. There are a number of ways to address this. Initially, I wanted to keep the number of variables down to retain the one column per survey question format of the dataset. So I decided to use “Ordinal Encoder” to transform the classes into numbers. This method was suggested for producing better results with tree-based classification models, as there is more “cardinality” for the model to consider (not just a 0 or 1).


"Ordinal Encoder" however is supposed to be used for “ordinal” variables – meaning variables with a meaningful order. For some of my variables this made sense – for example education level or age group. But for something like location or employment industry, there is no order to the various classes. So to address this I decided to go back to my dataset and use “One-Hot Encoding” which is a method that creates a new variable for each class, so that each class is now represented as a 0 or 1 (for example, if record #26 has location A as its value, it gets a 1, otherwise it gets a 0). Using “one-hot-encoding” does increase the size of the dataset (mine went from 36 to 73 variables), but it also can result in improved data quality.


Do any variables need to be re-coded?

In my dataset, I noticed that some of the variables representing opinion questions in the survey were on a 0 to 4 scale, while others were on a 1 to 5 scale. They both contain 5 classes, and they are both ordinal in nature – as they represent levels of agreement with a statement. I saw no reason that these should be different, so I changed the variables with the 1 to 5 range to 0 to 4 to match the other variables. A small change, but potentially a meaningful one.


Do the variable classes make sense?

As mentioned above, sometimes the variables don’t make sense for a particular analysis, but what about the actual classes within the variable? You may come upon a dataset that has an “other” or “miscellaneous” class. Sometimes it makes sense to combine classes like this if you have them. Another issue I faced with my dataset was that the employment industry variable had 23 different classes. Since these were not interpretable anyway (being just a random string), I decided to bucketize the lower frequency classes into an “other” class. This helped to keep the number of classes small, which helped when I decided to use “one-hot encoding” on the employment industry feature.


Are there any missing values?

This is an important consideration, and one that needs to be addressed, as it may prevent certain analyses and modeling from working. In my dataset, I had several features with approximately 50% of the values missing. All told about half of my 36 variables had some missing values. One approach, which I call the “pure” or “organic” approach would be to cut out all missing values. The drawback to this approach is that the dataset would immediately shrink… likely to about 25% of its original size. Since I only had 27,000 records to start with, I did not consider this a good option. What I chose to do instead is to “impute” values for the missing data using the KNN Imputer. This method considers several records with similar values (the nearest neighbors) and creates a new value based off this calculation. This method quickly solves the missing value problem, and it does introduce some artificial values, but they are informed values. There are other ways to “impute” values, but that will need to be covered in a future post.


Are variables of similar magnitude?

Although my data did not have huge differences in magnitude from feature to feature (with most ranging from 0 to 4), there were a few variables that hade values up to 23. Still not that big of a difference. However, some machine learning models require that values be scaled in order to work correctly (typically models that utilize a “distance-based” algorithm –KNN, SVM). To address this, there are several methods that can be used for “scaling.” I used the Standard Scaler. This resulted in all the values in my dataset ranging from about -2 to +2. This scaling step was also needed for the unsupervised learning method of K-Means clustering that I used on the dataset to look for patterns and underlying groupings of respondents.


Should I consider creating new features?

Sometimes you may decide that there is a better way to represent a variable or group of variables. For example, perhaps a listing of zip codes for a county could be better represented as groups of zip codes. This might lead you to consider “banding” a feature into a smaller set of buckets. Or sometimes you may think that some combination of different features could provide a meaningful feature of its own. This could lead you to what is called “feature engineering.” On my project I considered creating several new features: one that would tally a respondents “contact level” with others – adding up various values like number of people in the household, if they were employed or not, if they were a health worker or not. Some of these features may not have been important to classification modeling on their own, but in summation it might provide a stronger signal for some models.



Conclusion

The next time you are faced with a new dataset, feel free to use these 8 questions as a guide to cleaning and shaping your data. It is important to fully explore your data and to think through the transformations your data may need in order to provide the best analysis and modeling results. As with many aspects of data science, it is an iterative process. You may find yourself doing some modeling and then having to jump back to try a different way of preparing or transforming your base dataset. That is exactly what happened to me on my project, and after doing some tweaking to how I handled missing values, and some changes to number of classes, I proceeded with modeling and obtained better results. That wont always be the case, but it is important to remember that better data quality can often yield better results.



Comments


bottom of page