Skip to main content
Search form
  • 00:01


  • 00:15

    MARK KRYZWINSKI: When you think of data visualization,what may come to mind are a scatterplot, a bar chart, a boxplot, or a network diagram.These are all data encodings, methods that relay data valuesto the position, sizes, and shapes of lines or symbolsthat appear in your figure or on the screen.

  • 00:36

    MARK KRYZWINSKI [continued]: And there are many data encodings,so which do you choose?How do you answer this question?First, it has to answer relevant questions about the data thatare difficult or impossible to answerby staring at the data itself.The encoding itself may be of the data or some transformationof the data that addresses your questions.

  • 00:57

    MARK KRYZWINSKI [continued]: Just because you have a network doesn'tmean you should automatically draw a force-directed hairball.Second, it should accommodate uncertainty in your data.How do you incorporate uncertainty in a scatterplot?That's easy-- error bars.What about a pie chart?Well, we'll come back to that later.

  • 01:18

    MARK KRYZWINSKI [continued]: And third, it should be flexible enoughto address questions that you haven't thought of yet.This sounds vague.I realize that.What I mean is that if the encoding warps the dataor doesn't at least try to limit occlusion,this phenomenon where points overlapand hide behind each other, it's likely to be less useful.

  • 01:39

    MARK KRYZWINSKI [continued]: Here's the traditional encoding of a train schedule.Everyone knows how to read it.It makes a few things really easy--what time is my train, what track is it on.I want to go to Brussels.Any trains leaving soon?It makes some things really hard, though.How fast is my train?When will I get to my destination?What's the most efficient way to get to Brussels?Now, here's an encoding of a train schedule

  • 01:60

    MARK KRYZWINSKI [continued]: that can answer all these questions pretty easily.Stations are represented by horizontal linestime by vertical lines.The trains are the criss-cross lines.Want to go from Paris to Leon.Here's one way.Leave Paris at 11:00 to get to Dijon at 5:30,then take the 6:00 P.M train to Leon to arrive at 10:00

  • 02:20

    MARK KRYZWINSKI [continued]: But is that the fastest way?Well, if you saw this train on the schedule,then by the steep angle, you'd know that it's a lot fasterthan the previous choice.It would be worth waiting until 1:00 to spendonly three hours traveling.You'd get to Leon at 4:00.It gets more interesting.Suppose you're working for the train company,and you have the budget to add a line.

  • 02:41

    MARK KRYZWINSKI [continued]: What destinations would you pick?Well, if you look at Dijon, thereare no trains arriving here between 5:30 and 10:00.That's a big service blackout.By putting a train from Paris to arrive at about 8:00at the middle of this blackout window,you'd have a rational reason for your answer.If the encoding included more information,

  • 03:02

    MARK KRYZWINSKI [continued]: such as the average number of peoplewho travel on each leg encoded by, say, the thicknessof each line, you could address congestion.Here's a similar example using an encodingyou're likely to come across more often in your workthan train schedules unless you work for the train company.What can you say about the two pie charts?Clearly you're meant to compare them, but that's hard.

  • 03:25

    MARK KRYZWINSKI [continued]: There are a lot of numbers here, totals and intersections.There is even a non-category of data that didn't make itinto the pie chart.Pretty much the only thing you can quickly getis that the yellow circle, SAC, is quite a bit smallerin the right pie chart.How much smaller?Good luck with that.We're pretty bad at judging areas.

  • 03:46

    MARK KRYZWINSKI [continued]: Ah, but the authors did include the values, 31% and 10%.That's helpful, but emphasizes the factthat the graphical encoding isn't very quantitative.It would be great to get these numbers, or at least3-to-1 ratio, graphically.This is the so-called upsidan coding of these two pie charts.

  • 04:07

    MARK KRYZWINSKI [continued]: The counts in each intersection are shown by vertical bar plot.The horizontal bar plot shows the category totals.So now how much smaller is the SAC circle?Trivial to tell.You can read this off the bar chart.This encoding presents data visually, but in a waythat allows you to judge and compute by looking at lengths

  • 04:28

    MARK KRYZWINSKI [continued]: of the shapes, the bars.This is great and the very purpose of data encoding.But it allows you to do more.You can now reorder the categoriesby descending count in one of the scenarios.On the right is the upside encodingsorted this way for the task-modulated unitsshown in light gray.This is obviously impossible to do for the Venn diagram.

  • 04:51

    MARK KRYZWINSKI [continued]: Oh, but it gets better.Look at this.Or as I would suggest in some cases, don't look at this.Is a four-way Venn diagram.Trying to find a good use for thisis like figuring out what to do with spoiled food.I certainly wouldn't feed it to my dog.It looks like something from an Attenborough documentaryabout sexual selection and mating

  • 05:12

    MARK KRYZWINSKI [continued]: plumage of Amazonian birds.The upside encoding easily scales to this data.It doesn't even have to try.As before, we can sort by decreasing count.Look, we even get a distribution from the thing.

  • 05:41

    MARK KRYZWINSKI [continued]: Let's back up all the way to the motherof all encodings, the table.Arguably, this is as simple as data encoding--more of a presentation, really.There are no graphics, only aligned numbers.Sometimes a table is all that you need.If you're comparing a small number of values,you don't need a plot.To make comparing the numbers easier,

  • 06:02

    MARK KRYZWINSKI [continued]: always align them, one of the strategiesthat we'll cover in the segment about design.Let's explore how we can graphically representthe values in this table.Bars powerfully communicate magnitudebecause they use length, and thus the amount of inkon the page, to encode magnitude.This looks pretty good.But look how much more we're showing on the page,

  • 06:24

    MARK KRYZWINSKI [continued]: just to present five numbers.All this stuff shown here, is it really necessary?How would you go about thinking about answering this?One concept to always keep in mindis this so-called data-to-ink ratio.Ask yourself what ink on the pageis directly related to data valuesand what ink is used for labels, grids, navigational components,

  • 06:49

    MARK KRYZWINSKI [continued]: and other design elements.Then consider how do you maximizethe ink used for the data and minimized the ink usedfor everything else.And typically, the total amount of ink on the pageshould be controlled, too, as faras you can without loss of accuracy, precision,and clarity.

  • 07:10

    MARK KRYZWINSKI [continued]: Now, sometimes you're going to needto use a little bit of extra ink to clarify or avoid confusion.That's fine.Do this.The speed at which your readers understand your messageand the depth of this understandingis part of the data-to-ink ratio.So think data and its understanding-to-ink ratio.

  • 07:31

    MARK KRYZWINSKI [continued]: Now, not all the data is going to be relevant.If you can figure out which is, that's the holy grail.In fact, if you knew this, you might as well compute the dataand bypass the visualization altogether.But nevertheless, I encourage youto think about this idea of actionable data-to-ink ratio.

  • 07:52

    MARK KRYZWINSKI [continued]: If I turned the bar plot into a scatterplot,I'm more efficiently encoding the values.Using position instead of length, I'm using less ink.However, now the proportion of inkused for the data and other elements is more equal,and the data has a weaker presence on the page.Here by presence, I don't mean visibility or legibility--everything you should draw should always satisfy those--

  • 08:15

    MARK KRYZWINSKI [continued]: but visual salience.The scatter points don't stand out as much as the barsat first sight.Telling absolute differences between valuesin the scatterplot is easy.There's only one place to look--the point.In a bar plot, each bar has two ends, obviously,which may seem redundant, because the left end merelyanchors the bar to the axis.

  • 08:35

    MARK KRYZWINSKI [continued]: However, this anchoring or alignment-- another designstrategy-- powerfully establishes a reference pointthat makes making relative judgments accurate.I just talked about the usefulnessof aligning the bars.The same goes for the numbers.Instead of putting the numbers at the end of the bar,they're better positioned beside your labelsand, as in the table, right aligned.

  • 08:58

    MARK KRYZWINSKI [continued]: Notice that now, the graphic looksboth like a table on the left and a plot on the right.This kind of ancillary tabulationis very helpful in showing data.For example, you might encode the datawith some graphical methods in the center of the figureand then around it in rows and columns, place aggregatesat the statistics that precisely answer

  • 09:18

    MARK KRYZWINSKI [continued]: questions about magnitudes.The idea is that the graphic component should give youa sense of patterns, and then the numerical countsact as a lookup for the underlying data.Always be concise as possible, verbally or graphically.In this example, the vertical axis ticks and tick labels

  • 09:39

    MARK KRYZWINSKI [continued]: are unnecessary.We almost have as many tick labels as values.By removing these elements, the bars aren't obstructed.Here, it's important to balance the width of the barwith the distance between them, another design strategywe'll come to later.You can make the graphic more compactby placing the values inside the bars.

  • 10:00

    MARK KRYZWINSKI [continued]: Here I've aligned the numbers to the leftagainst the start of the bar to avoidawkward space between the start of the bar and the number.Let's add more data to our table.When I add another column--now we have sample a and sample b--patterns become much harder to assess.There are many more possibilities and many

  • 10:21

    MARK KRYZWINSKI [continued]: more numbers to compare.Like before, we can annotate the table with graphical elements.In this case, I think the horizontal axesand their labels are needed because itestablishes the vertical direction as the axes.Without them, this would be less obvious.And remember, it's important to makewhat you're doing as obvious and clear to your audience

  • 10:42

    MARK KRYZWINSKI [continued]: as possible, even if it costs you a little bit of ink.If you have enough data points, eventually you'llrun into some overlap, like the number of duplicationsand snips in this example.That occlusion is a real problem and oneof the reasons why 3D plotting is so difficult and oftenineffective.Data hides behind data, and making everything visible

  • 11:05

    MARK KRYZWINSKI [continued]: while rationally placed may actually be impossible.In this example, it's reasonable to jiggle the snippoint a little to the right to avoid it overlappingwith the duplication point.There are numerous studies that explore how accurately wejudge things like positions, lengths, areas, angles,and so on.

  • 11:27

    MARK KRYZWINSKI [continued]: We're good at lengths if they're aligned.We're terrible at areas, largely because we use lengthas a proxy.Area grows much faster with lengththan we intuitively interpret, so wetend to underestimate areas all the time.Don't even get me started about volumes.One such study looked at the so-called 45-degree banking

  • 11:49

    MARK KRYZWINSKI [continued]: and suggested that there is some benefit for setting the aspectratio of a plot in such a way to makethe average angle of the elements in it,such as our lines between the a and b samples,to be close to 45 degrees.There is mild controversy about the methodological limitationsof this study, but it's thought to be a good place to start.

  • 12:10

    MARK KRYZWINSKI [continued]: And if you deviate from it, have your reasonsand know them well.We've started with some simple examples.There's a good reason for this.Always appreciate and use to your advantagefundamental principles of data visualization.and our samples and examples embody these.Once your data size grows, a popular technique

  • 12:32

    MARK KRYZWINSKI [continued]: in visualization is called small multiples.Here you break your data down into a large numberof smaller sets and represent each with the same encoding.So you might have a matrix of scatter plots or bar plots.Each of these individual small plots, like our simple exampleshere, must be handled with care.

  • 12:55

    MARK KRYZWINSKI [continued]: So at no point do you ever reject these principles.I don't care how big your data set is.Sure, they might be more nuanced or interrelatedif you're showing a lot of stuff on the page.But fundamentally, you're still thinking about the same thing.Am I using ink responsibly?Am I drawing shapes whose position and size

  • 13:15

    MARK KRYZWINSKI [continued]: can be accurately judged?Am I making comparisons easy to make?Am I visually emphasizing the thingsthat are relevant, and are things clear and beyond doubt?[MUSIC PLAYING]

Video Info

Series Name: Essentials of Data Visualization

Publisher: University of Sydney & Canada's Michael Smith Genome Sciences Centre

Publication Year: 2017

Video Type:Tutorial

Methods: Data visualization

Keywords: communication aids; encoding; ink; visual communication

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:



Martin Krzywinski explains the fundamental principles of data visualization for simple material. As data grows Krzywinski elicits the small multiples technique to achieve the best visual results.

Looks like you do not have access to this content.

Data Encoding

Martin Krzywinski explains the fundamental principles of data visualization for simple material. As data grows Krzywinski elicits the small multiples technique to achieve the best visual results.

Copy and paste the following HTML into your website