The Great Dashboard Series: Yes, you need clean data. Yes, we can show you how to clean your data
Cleaning data is a crucial part of your data analysis. Bad quality data can impact your entire analysis. It can even make the best algorithm fail.
Data cleaning is the process of preparing the data sets for further analysis by removing incorrect entries, duplicate data, incomplete values and modifying improper formats.
Data cleaning is a vital step in generating accurate, valid, and complete data. And if you don’t take me word for it, here’s what our team has to say:
We asked our team to share their opinion on data cleaning and here is what they had to say.
Phuong: “As unsexy as it sounds, having clean, accurate data will transform your organization more than having “big data” — it’s the building block for all your KPIs and metrics. If you don’t have clean data, you don’t have reliable metrics… and nothing’s unsexier than having a report/dashboard that is completely wrong and pointing your company into the unknown abyss.”
Ben: “Clean Data = (1) data that can be used for its intended purpose and (2) for which the acquisition, storage, and management of is sustainable”
(I bet you know who is in charge of Design and who is in charge of Data Analytics in our company* wink,wink*)
How to Clean Data?
So now that you know why data cleaning is important, we have a five-step guide that you can follow whenever in doubt. Below are is our how-to-clean-data-guide:
Remove Irrelevant and Duplicate Values
Data that does not make sense to the context of your analysis, should be removed. Anything that is useless or irrelevant should be deleted i.e. if you’re looking at staff turnover, you will not need staff phone numbers. Additionally, check that you have not entered the same data twice. If you have, then delete the duplicates as you do not want your data to be skewed.
2. Fix Structural Errors
If you’re transferring data from different sources then you can easily have issues with typos, capitalization and naming conventions.
3. Correct Typos
Typos happen because of human error and they can have a large impact on the way your algorithm works. For example, if an algorithm is case sensitive, it will think that “New York” and “New york” are two different entries - giving you incorrect results.
4. Account for Outliers
If you notice that there is an entry that is way-off compared to most of the data, then you’re dealing with an outlier. If you’re certain that it is an improper entry, then you can remove it. Just be certain that by removing it, your entire analysis will not be affected.
5. Take Care of Missing Values
Missing data can influence the accuracy of your data set. You can deal with missing data data in one of the following ways:
Input missing values: you can fill out the missing data by either using a statistical approach and giving an approximate value (using linear regression, an arithmetic average etc.) or using data from a similar dataset
Highlight missing data: you can flag your data, so that your algorithm knows it’s dealing with missing value i.e. for numeric data using a 0.
And since your analysis is only as good as your data, you need to make sure that you are using clean and properly formatted data. But before you try not to submit an incorrect report to your bosses, we suggest you start to build a data-drive culture in your organization.
You can read our suggestions about this approach in our article: How to Align Your Business Strategy With Data.
The Great Dashboard Series is a tribute to the dashboards we have scraped, the countless hours we spent creating complex but beautiful graphs only to have people still ask what we were trying to show.