For many of the projects we do at Visually, data comes from a source that has already done some aggregation. This is both a blessing and a curse. Aggregation definitely simplifies the analysis and visualization process, but it can also greatly reduce the visualization and analysis options. This is because aggregation often destroys connections in data.
Survey data is one of the most common places where connections get destroyed by aggregation. The typical aggregated output we see from survey data looks something like this (data is manufactured):
|Do you like:|
And that can produce visualizations like this:
But this data is missing a lot of valuable information. For example, how likely is it that people who enjoy blueberries also enjoy raspberries? Is there correlation between people who like citrus and people who like berries? The data that can answer these questions is gone from the aggregated summary.
If we look at the original survey data (manufactured), we can see all of that information. People who enjoy blueberries have a 66.67% chance that they will also enjoy raspberries. There is not any significant correlation between people who like citrus and people who like berries. But these questions were answered through statistical methods only. Using visualizations built from the original data, we can see the answers to several of these questions.
There are several different visualization techniques that open up once we have the original data. Euler diagrams are one of the most basic examples. There are complex set relationships that exist in the original data, and Euler diagrams are good at showing those.
This chart shows us much more detail. We know what proportion of people like blueberries compared to what proportion like raspberries, but we also know the proportions of the subsets of those groups. We can see the proportion of people who like both, and the proportions that like just one but not the other.
Another visualization that can show more complex connections in data is parallel sets. The visualization is best experienced as an interactive version, although static versions can still be informative. The visualization shows the break-down of fruit preferences as they are connected to each other. The width of the bars at each fruit shows the overall percentages for that fruit, while the bands leaving the bars show the width for each sub-section of people who like or don’t like multiple fruits. The categories can be re-ordered, and the yes/no segments can be flipped.
(Credit to Jason Davies for the D3 port of Parallel Sets.)
The extra information that can be obtained from these visualizations is important to gaining a full understanding of the data, and it can lead to a much more interesting story, as well as far cooler visualizations.
If you’re gathering data about something, remember to dig deeper into it, and don’t just blindly aggregate it. There are lots of important connections that can happen within data that can provide knowledge beyond just a simple average or total.
Drew Skau is a Visualization Architect at Visual.ly and a PhD Computer Science Visualization student at UNCC with an undergraduate degree in Architecture. You can follow him on twitter @SeeingStructure