Technology has made it easier to collect and store vast quantities of data, but how do companies take full advantage of it? Big Data Discovery is designed to help organizations transform, explore, and analyze data to find insights for their business.
Data scientists spend 80% of their time massaging the data - finding the right data, cleaning it, wrangling it into a usable form - and only 20% modelling their data to try to get to business insights. How could we help them flip these ratios?
I was one of four product designers, a visual designer, and our design director on the design team.
I led design for the explore and discover phases of the product by collaborating with the VP of Big Data Discovery, the design director, the Explore product manager, and the development team in Shanghai. I also worked with two other designers and front-end development team in Boston on the visual design system.
Meet the data scientist, a highly sought after expert who is part statistician, programmer, and domain expert, and all unicorn, according to numerous posts trying to define these mythical beings. Like most people, they want to shorten the tedious parts of their job so they can get to the interesting stuff.
Data scientists are used to using powerful programming langauges, spreadsheets, statistical tools. However, we believed that a visual interface could surface patterns and insights in the data more quickly, especially in the intial exploratory part of the the analysis. Visualizing the data at this point wasn less about aesthetics and storytelling, and more about understanding the shape and quality of the data.
The two questions I wanted to help our users answer right away were - what attributes (variables) were in the data set, and were they in a form that is useful (was the data clean and in the right format)? I imagined the data scientist as a chef, opening up a fridge to assess what ingredients were inside, and whether it had mold on it, before composing a dish.
Unlike a normal design moodboard, in this case, we were inspired by R, a popuplar programming language for statistical computing. This is how R summarizes a data set:
This view allows you to quickly see what columns are in a data set, as well as the distribution of each. What if we could take this information, but present it visually, so it was even faster to see the distribution of data?
Just like wine and food, certain types of data and certain types of charts just go better together. This had the additional benefit of making the dataset easier to scan - for example, if the users were looking for geographical data in the set, they could easily see if there were any maps present.
ex. Boxes of cookies sold. Histogram shows distribution, while min and max help users check the range looks normal.
ex. Type of cookie. Bar chart with top values ranked by count, so most common types are first.
ex. Cookie 5 star rating. Bar chart shown with data in the order it comes in.
ex. Date cookies were sold. Line charts are best for showing changes over time.
ex. Where cookies are sold. Map with color showing count (choropleth).
"Quality" bar to surface missing or incorrectly formatted values.
Here's an example of a user was able to quickly discover a data anomaly in a log file. Read about it here: on Nodalpoint's site.
For most visualization tools, the workflow goes something like this:
1) Choose the chart type (bar, line, chart, etc.) 2) Choose the data attributes for different parts of the chart 3) Change chart settings until things look good
By making the user choose the chart first, it asssumes the user knows the right chart to choose from the very beginning. Just like with a single variable, there are best practices for pairing visualizations with different combinations of data variables. Why should the user have to think about what form the data needs to take before they even decide what data they are interested in?
Our workflow for selecting data was updated to:
1) Choose the data attribute(s) of interest to visualize 2) The best visuzalization is automatically displayed 3) Toggle between all valid visualizations 4) Visualization updates instantly as attributes are added or removed
By automating the selection of the data visualization, we were able to prevent user error and allow the user to focus on the data itself. We did a great deal of research into best visualization practices to write the set of rules to automate the choice of data visualization. This was a super quick overview of the process, so contact me to learn more!
Big Data Discovery was adopted by CERN, home to the Large Hadron Collider, to help them with their data analysis.
Would you like to begin another?