You’ve found the data you want to analyze and have brought it into a data mart. You know how to use visualization software to turn the data into informative pictures. You know the kind of things you want to find out from your data.
Even so, you’re not quite ready to go forward with your analysis.
Before you get started breaking your data down, you need to understand what you have and get it into good shape. Otherwise, you won’t be analyzing anything. You’ll just be sitting in front of your computer, frustrated that your visualizations aren’t telling much of a story.
This is called data cleansing. And though it may seem like an obstacle on the road to turning your data into information, you won’t be able to perform an analysis without doing it.
Here are some steps to follow in order to make sure your data is ready to go.
- Figure out the fields in your data. What is contained in each row and are there any problems with it? Each entry should have some sort of unique key associated with it. Is the key duplicated anywhere? Are there many fields that were left empty? And is there even any data there? Duplicated keys and items with too many nulls can cause headaches in your analysis. Likewise, a database that may have been started years ago but never received updates probably isn’t worth the effort to try to analyze. You can help prevent problems of unexpected missing values by adding terms like “unknown” or “not applicable” in their place. This way, you won’t miss out on counting a record because data is missing, and you’ll have more context to analyze what is there.
- How is your data formatted? This is one place where you need to be a stickler for detail. Are all dates written the same, or does your data source interchangeably use “March 21, 2016” and “3-21-16”? Are addresses and zip codes all done the same? Are there any problems with values – like birth dates of “1900-01-01,” quantities of “999999,” and zip codes of “00000”? If items don’t make sense or aren’t consistently entered in a way that your visualization software can recognize, the data is worthless for your analysis. You won’t be able to make use of date or geographical visualization functions, and your results will be thrown off by obviously incorrect outliers.
- Can you validate the data? While you may work hard to get the data into a format where you can build better visualizations and tell a better story, you should first make sure that the data you’re getting is correct. If you’re analyzing membership data, it makes sense to talk to someone from that department who should know the basics of what the data says. Were there really 325 new members added in January? If the raw data doesn’t make sense to the people who are most familiar with it, you should go back and correct the problem before moving forward to more analysis. Hidden issues like counting the number of records instead of distinct values, reporting that is based on outdated data snapshots and changes in historical business processes can make reconciling the data mart information challenging. Also, processes to change data may not universally be reflected in the data. Scripts written to change individual member status values might not align with the data ranges of orders and subscriptions.
- Is data consistently and correctly updated? Even if you do everything to fix your data in the data mart, technical processes are not always perfect. You need to monitor the data that comes into the data mart for consistency and correctness. You should maintain log tables that let you know when the data has updated and show you any problems that were encountered. You should also make use of visualization and analysis techniques to monitor ongoing data quality by tracking potential issues, like incorrect record amounts and missing values. It is important to measure accuracy on a regular basis, making sure that all changes are coming through.