Oracle's Big Data Discovery

Helping data scientists get better acquainted with their datasets


Technology has made it easier to collect and store vast quantities of data, but how do companies take full advantage of it? Big Data Discovery is designed to help organizations transform, explore, and analyze data to find insights for their business.









The problem: modeling over massaging

Data scientists spend 80% of their time massaging the data - finding the right data, cleaning it, wrangling it into a usable form - and only 20% modelling their data to try to get to business insights. How could we help them flip these ratios?

My role

I was one of four product designers, a visual designer, and our design director on the design team.

Explore

I led design for the explore and discover phases of the product by collaborating with the VP of Big Data Discovery, the design director, the Explore product manager, and the development team in Shanghai. I also worked with two other designers and front-end development team in Boston on the visual design system.

The target user

Meet the data scientist, a highly sought after expert who is part statistician, programmer, and domain expert, and all unicorn, according to numerous posts trying to define these mythical beings. Like most people, they want to shorten the tedious parts of their job so they can get to the interesting stuff.

A visualization is worth a 1000 words

Data scientists are used to using powerful programming langauges, spreadsheets, statistical tools. However, we believed that a visual interface could surface patterns and insights in the data more quickly, especially in the intial exploratory part of the the analysis. Visualizing the data at this point wasn less about aesthetics and storytelling, and more about understanding the shape and quality of the data.

The two questions I wanted to help our users answer right away were - what attributes (variables) were in the data set, and were they in a form that is useful (was the data clean and in the right format)? I imagined the data scientist as a chef, opening up a fridge to assess what ingredients were inside, and whether it had mold on it, before composing a dish.

Designing a data set sniff test

Unlike a normal design moodboard, in this case, we were inspired by R, a popuplar programming language for statistical computing. This is how R summarizes a data set:

Explore use case

This view allows you to quickly see what columns are in a data set, as well as the distribution of each. What if we could take this information, but present it visually, so it was even faster to see the distribution of data?

Playing data sommelier

Just like wine and food, certain types of data and certain types of charts just go better together. This had the additional benefit of making the dataset easier to scan - for example, if the users were looking for geographical data in the set, they could easily see if there were any maps present.





Numerical (measures)

ex. Boxes of cookies sold. Histogram shows distribution, while min and max help users check the range looks normal.

Numerical
Categorical (dimensions)

ex. Type of cookie. Bar chart with top values ranked by count, so most common types are first.

Categorical
Ordinal

ex. Cookie 5 star rating. Bar chart shown with data in the order it comes in.

Time
Time data

ex. Date cookies were sold. Line charts are best for showing changes over time.

Time
Geographical

ex. Where cookies are sold. Map with color showing count (choropleth).

Geographical
For all types

"Quality" bar to surface missing or incorrectly formatted values.

Quality bar




A real-life use case

Here's an example of a user was able to quickly discover a data anomaly in a log file. Read about it here: on Nodalpoint's site.

Explore use case

A new flow for visualizing data

For most visualization tools, the workflow goes something like this:

1) Choose the chart type (bar, line, chart, etc.)
2) Choose the data attributes for different parts of the chart
3) Change chart settings until things look good

By making the user choose the chart first, it asssumes the user knows the right chart to choose from the very beginning. Just like with a single variable, there are best practices for pairing visualizations with different combinations of data variables. Why should the user have to think about what form the data needs to take before they even decide what data they are interested in?

Our workflow for selecting data was updated to:

1) Choose the data attribute(s) of interest to visualize
2) The best visuzalization is automatically displayed
3) Toggle between all valid visualizations
4) Visualization updates instantly as attributes are added or removed

Explore gif

By automating the selection of the data visualization, we were able to prevent user error and allow the user to focus on the data itself. We did a great deal of research into best visualization practices to write the set of rules to automate the choice of data visualization. This was a super quick overview of the process, so contact me to learn more!

Industry response

Big Data Discovery was adopted by CERN, home to the Large Hadron Collider, to help them with their data analysis.

Big Data Discovery may have been one of the factors that helped Oracle get back onto the Gartner BI Magic Quadrant in 2017.






🙌
Congrats, you made it to the end!

Would you like to begin another?


Fuze
Giving customers the power to manage themselves
Parlai
Making email suck less for sales professionals
T3
Empowering doctors to learn from the past in the ICU





Want to work together or just say hi? Get in touch at dianaye@gmail.com.