Messy Data? These Tools May Help Clean It Up

by Drew Skau 2 years ago Filed Under: Data

As we recently wrote, one downside to the big open data movement is that many data providers don’t put it into good formats. When that happens, there are several tools and methods that are good for cleaning the data.

Find and Replace
The most basic of these tools is find and replace in any text editor. With some carefully crafted replacements, it is possible to get data fairly clean and into a good format. Look for patterns and repetition in a file. Work around the parts that don’t repeat, and use the existing structure of the data to your advantage.

Regular Expressions
Sometimes you see patterns in a file, but there aren’t exact character matches. In cases like this, Regular Expressions or regexes, are a great tool. One example where regexes would be useful is the following JSON excerpt where there are county names in quotes:

{
	2010: 10,
	2011: 5,
	2012: 4,
	County: "8762"
},
{
	2010: 0,
	2011: 1,
	2012: 0,
	County: "Alexander"
},
{
	2010: 0,
	2011: 0,
	2012: 0,
	County: "0913"
},
{
	2010: 0,
	2011: 3,
	2012: 0,
	County: "Anson"
},

This is a manufactured example, but let’s say that you wanted to replace the numerals in quotes with the word “Unknown” to indicate that the county name is missing. With normal find and replace, each numeral would have to be done manually, but regexes can find the pattern that we see. The regex to find a set of numerals in quotes would be:

"[0-9]*"

The quotes are treated as is when searching and are important so that we don’t also match all of the other numerals in the file. The brackets indicate a single character that will match any character defined within the brackets, and the asterisk indicates that match can be repeated for many characters. The replace string should be:

"Unknown"

The easiest way to use regexes is probably to use a text editor that supports them in the find and replace function. Sublime Text 2 is a great cross platform editor that can find and replace using regexes across multiple files simultaneously. For more information, and tutorials on how to use regexes, Regular-Expressions.info has some good resources, and some quick searching can find other tutorials.

Spreadsheets
Text editors, find and replace, and regexes are powerful tools for manipulating data, but sometimes you need the ability to calculate or perform operations on a per-column basis. For this, any spreadsheet software is the tool to turn to. Spreadsheets offer the capability to easily sort data, calculate new columns, move and delete columns, and aggregate data. One of the major advantages of a spreadsheet is the ability to do math and logic with the existing values.

Data Wrangler
Another tool to accomplish many of these tasks is Data Wrangler by the Stanford Visualization Group. Data Wrangler is a fantastic interface that can automatically find patterns in your data based on things you select, and it automatically makes suggestions of what to do with those patterns. It also learns over time, so it is constantly improving the suggestion system; the more people that use it, the better it will get. The best way to get started with it is to watch their intro video.

Data Wrangler is also great for really large data sets. You can use the GUI on a small subset of your data, and then use the generated scripts to run later on the full set. This lets you deal with huge quantities of data that are beyond the capacity of many text editors and spreadsheet programs.

Flow
Each of these tools has overlap: Data Wrangler and spreadsheets can both do column operations easily. Text editors and spreadsheets can both use find and replace. All of them can use some form of regexes to find structural patterns in the data. Each tool may have a better interface for doing a certain operation though, so picking the right one for each step can simplify your life.

Often times, a combination of all of these tools is the best way to clean your data. Copying and pasting between spreadsheets and text editors is fairly easy to do, and can help you optimize the steps to do them in the most efficient tool.

There are more complex tools for doing data cleaning and validation, and if you are dealing with intensely serious data that people’s lives depend on, it is probably wise to go with those options. But the majority of us just need to fix some crummy formatting or dirty data, and for that, a piecemeal suite of tools is often the most efficient.

 
Drew Skau is a cleansed PhD Computer Science Visualization student at UNCC, with an undergraduate degree in Architecture.