The Numbers Matter

on

Last week I blogged about Google’s public data visualizations, and suggested that the data lacked transparency that prevented Google’s visualizations from being used critically. Daniel Tunkelang pointed out that the source of the data is linked to the visualization. And indeed it is. So what’s the problem?

Let’s consider Google’s first example, the unemployment rate in Santa Clara county. When you do the search, it provides a nice graph with data from 1990 to 2009 for states and counties in the US, and for the US as a whole. Great!

bls-unemployment-sc-google

More Info

Where did the data come from? Well, there is a link, right at the top of the visualization, labeled “More info »”. In this case, the link takes you to the Bureau of Labor Statistics web site. Right away, there is a problem. The link is not to the data, but to a general discussion of methodology. The discussion is important to understand the data, but it is not the data itself. It does, however contain a section called “Where can people find the data?” that provides several potentially useful links:

Each month, summary statistics on unemployment and employment are published in a news release titled The Employment Situation.

Detailed information also is published in tables online and in a periodical called Employment and Earnings.

On an irregular basis, special labor force topics are addressed in articles published in the Monthly Labor Review, in a series of briefs called Issues in Labor Statistics, in a variety of special reports, and in other BLS publications.

But which one did Google use? We don’t know. The “tables online” link above takes you to yet another page with more links to other pages. Here you find links to lots of statistics, including text- and PDF-formatted files (e.g. “Employment status of the civilian noninstitutional population, 1940 to date” ), but there is no mention of a breakdown by state of by county.

Try again: Data source

If, on the other hand, you had clicked on the “Data source: U.S. Bureau of Labor Statistics” link at the bottom of the chart, you would wind up at the Overview of BLS Statistics on Unemployment page. This page, too, has many links on it, the second of which is helpfully labeled “State and Local Unemployment Rates.” That page starts off with a short introduction, followed by an announcement:

bls-announcement

So the official data is not fixed, but may change over time as errors are found and corrected. But is this the data that Google is displaying? Without a more persistent reference, we cannot know for sure what we’re looking at.

But back to our specific example: the column on the right has links associated with each state. Clicking on California, produces a nicely laid out page that allows me to select date ranges for the charts, shows me a variety of related variables (labor force, employment, unemployment, and unemployment rate), and backs this up with tabular data. I can also filter the data by month, look at net- and percent-changes over 1, 3, and 12 month intervals, and cause the system to generate text or HTML data. So this is quite useful for making serious use of the data, but there is no breakdown of data by county.

Lower on the same page, there are more tools:

bls-databases

Clicking on the green “one-screen data search” button launches a Java applet that allows me to select a state, find one or more areas, select whether the data is adjusted seasonally, and then generate a web page that is similar to the state-wide one I described above.

So finally we have the data, and can display the same thing that Google generated for us in the very beginning:

bls-unemployment-sc2

What’s the point?

So have we just taken all these steps to return to the beginning? Not exactly: although we have graphed the same data, we have also found the data itself, and a means for selecting and manipulating it. This is important for several reasons: by capturing the data, you can perform analyses on it, you can generate different statistics on it, and you can later check if the data has changed. In short, you can trust the data more, and you can use it in ways that make sense to you.

So Google does us a great service that not only locates the data in such public data sources (thereby making the data more socially useful) and provides an elegant visualization interface, but also does us a disservice by not revealing the data it actually displayed. Contrast this with the Many Eyes visualization site that requires the raw data to be published along with its visualizations. Google should provide the data it is using through its Google Docs interface, and should also provide more direct links to the original sources. Publishing data is an important standard of academic rigor that fosters trust and intellectual accountability. It would be good to see that from a company whose goal is to organize the world’s information.

Share on: 

2 Comments

  1. Point taken. I am curious if Wolfram Alpha is or will be more rigorous when it comes to data provenance. That said, I suspect that only academic researchers and STM (Scientific, Technical, and Medical) professionals will pay attention to this issue. Google historically plays for general consumers, who will certainly be oblivious to the subtleties of data provenance–at least if it’s mostly right, most of the time.

  2. My concern is that the media also fall into the “general consumers” category, and may base their reporting on the easy-to-get-at Google data rather than the possibly more up-to-date or more comprehensive primary sources. Anyway, since Google has the data, why not publish it?

Comments are closed.