Quantcast
Channel: Infragistics Community
Viewing all articles
Browse latest Browse all 2374

Jitter - Another Solution to Overplotting

$
0
0

Back when I discussed tricks for coping with overplotting I omitted (at least) one popular "solution": jittering the data. Jittering, is the process of adding random noise to data points so that when they are plotted, they are less likely to occupy the same space. It is most commonly used when the data being plotted is discrete. In such cases, in the absence of jitter, it's not just that the edges of data-point markers that overlap, the markers actually sit perfectly on top of each other. No amount of reduction in the size of the data point can remove this problem.

While jitter may be added to points in, for example, a box plot, it's most frequently used in 2D scatter plots. Chart A in the graphic below shows a contrived example dataset with no attempt to deal with overplotting. Both the x and y variables only take integer values. The dataset actually contains 2000 points, but there are only 780 unique points. Chart B shows the same data but with the addition of jitter. Specifically, for each point a random number drawn from the continuous uniform distribution between -0.5 and 0.5 is added to the x coordinate and another random number drawn from the same distribution is also added to the y coordinate. As noted previously, another potential solution is to make points semi-transparent (chart C) and, of course, these two options can be combined (chart D).

What conclusions are there to be drawn from these plots? Because there are only 780 unique pairs of values, 1220 (61%) of data points are completely obscured in chart A. With the addition of just jitter each point becomes unique, but there is still some degree of overlapping of the dots used to represent them. Making the points translucent certainly helps show that there are more than the 780 points visible but it's not always an acceptable option.

Because translucency and other alternatives, 2D histograms for example, aren't always acceptable solutions it might be worth thinking about what other issues can arise with jittering data. One key concern may be that of integrity. If we move the points away from their "true" positions, are we deliberately distorting the data? While chart A above may seem like the more "correct" way to plot the data, chart B is better at showing approximately where most of the data is. In chart A, all points are in exactly the right place but they are not all equally representative of the distribution of data; one visible dot can mark the position of anything from 1 to 12 data points. Without translucency or color or something else there's no way of knowing which is which.

Despite this, I'd like to point out once again that you should consider your audience. Will they be confused by non-integer values being plotted for something they know can only be integer-valued, for example? What about points at the extremes that in one dimension are no longer even in the permissable range?

I think it's also worth studying some real-world data that will look familiar if you've been reading my other articles here. The GIF below shows a 2D histogram of RGB image data from an 8-bit png image (precise details and the image from which it is extracted can be found here). As the animation progresses the length of the uniform distribution from which jitter values are drawn (the "Jitter Extent") increases in both dimensions. Here the use of jitter does allow us to see more details about which of the value pairs occur most frequently. Because the data remains in square blocks, there is still the sense of there only being a modest number of discrete values in the underlying data.

In that previous article I also looked at the distribution of blue and green values for low- and high-quality JPEGs of the same initial image. The GIF below shows the effect of adding jitter to these. Aside from the points spilling out beyond the confines of the axes (which looks weird if nothing else), the clear differences between the two scatter plots dissapears as the Jitter Factor increases. This is highly undesirable.

As with many things in data visualization, there's no clear answer to the simple question: "Should I jitter data points?". Jitter can help clarify where the bulk of the data lies but it can also distort important patterns. Where appropriate I prefer to use translucency, but sometimes — e.g. when the color of points already tells us something important — that isn't an option.

Bring high volumes of complex information to life with Infragistics WPF powerful data visualization capabilities! Download free trial now!


Viewing all articles
Browse latest Browse all 2374

Trending Articles