visualization – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Book Review: Data Visualization: a successful design process by Andy Kirk https://blog.scraperwiki.com/2013/04/book-review-data-visualization-a-successful-design-process-by-andy-kirk/ https://blog.scraperwiki.com/2013/04/book-review-data-visualization-a-successful-design-process-by-andy-kirk/#comments Tue, 02 Apr 2013 10:43:01 +0000 http://blog.scraperwiki.com/?p=758218271 Data Visualization CoverMy next review is of Andy Kirk’s book Data Visualization: a successful design process. Those of you on Twitter might know him as @visualisingdata, where you can follow his progress around the world as he delivers training. He also blogs at Visualising Data.

Previously in this area, I’ve read Tufte’s book The Visual Display of Quantitative Information and Nathan Yau’s Visualize ThisTufte’s book is based around a theory of effective visualisation whilst Visualize This is a more practical guide featuring detailed code examples. Kirk’s book fits between the two: it contains some material on the more theoretical aspects of effective visualisation as well as an annotated list of software tools; but the majority of the book covers the end-to-end design process.

Data Vizualisation introduced me to Anscombe’s Quartet. The Quartet is four small datasets, eleven (x,y) coordinate pairs in each. The Quartet is chosen so the common statistical properties (e.g. mean values of x and y, standard deviations for same, linear regression coefficients) for each set are identical, but when plotted they look very different. The numbers are shown in the table below.

Anscombe Quartet Data

Plotted they look like this:

Anscombe's QuartetAside from set 4, the numbers look unexceptional. However, the plots look strikingly different. We can easily classify their differences visually, despite the sets having the same gross statistical properties. This highlights the power of visualisation. As a scientist, I am constantly plotting the data I’m working on to see what is going on and as a sense check: eyeballing columns of numbers simply doesn’t work. Kirk notes that the design criteria for such exploratory visualisations are quite different from those highlighting particular aspects of a dataset, more abstract “data art” presentations, or a interactive visualisations prepared for others to use.

In contrast to the books by Tufte and Yau, this book is much more about how to do data visualisation as a job. It talks pragmatically about getting briefs from the client and their demands. I suspect much of this would apply to any design work.

I liked Kirk’s “Eight Hats of data visualisation design” metaphor; which name the skills a visualiser requires: Initiator, Data Scientist, Journalist, Computer Scientist, Designer, Cognitive Scientist, Communicator and Project Manager. In part, this covers what you will require to do data visualisation, but it also gives you an idea of whom you might turn to for help  –  someone with the right hat.

The book is scattered with examples of interesting visualisations, alongside a comprehensive taxonomy of chart types. Unsurprisingly, the chart types are classified in much the same way as statistical methods: in terms of the variable categories to be displayed (i.e. continuous, categorical and subdivisions thereof). There is a temptation here though: I now want to make a Sankey diagram… even if my data doesn’t require it!

In terms of visualisation creation tools, there are no real surprises. Kirk cites Excel first, but this is reasonable: it’s powerful, ubiquitous, easy to use and produces decent results as long as you don’t blindly accept defaults or get tempted into using 3D pie charts. He also mentions the use of Adobe Illustrator or Inkscape to tidy up charts generated in more analysis-oriented packages such as R. With a programming background, the temptation is to fix problems with layout and design programmatically which can be immensely difficult. Listed under programming environments is the D3 Javascript library, this is a system I’m interested in using  –  having had some fun with Protovis, a D3 predecessor.

Data Visualization works very well as an ebook. The figures are in colour (unlike the printed book) and references are hyperlinked from the text. It’s quite a slim volume which I suspect compliments Andy Kirk’s “in-person” courses well.

]]>
https://blog.scraperwiki.com/2013/04/book-review-data-visualization-a-successful-design-process-by-andy-kirk/feed/ 1 758218271
Fine set of graphs at the Office of National Statistics https://blog.scraperwiki.com/2012/03/fine-set-of-graphs-at-the-office-of-national-statistics/ Thu, 22 Mar 2012 11:47:01 +0000 http://blog.scraperwiki.com/?p=758216643 It’s difficult to keep up. I’ve just noticed a set of interesting interactive graphs over at the Office of National Statistics (UK).

If the world is about people, then the most fundamental dataset of all must be: Where are the people? And: What stage of life are they living through?

A Population Pyramid is a straightforward way to visualize the data, like so:

This image is sufficient for determining what needs to be supplied (eg more children means more schools and toy-shops), but it doesn’t explain why.

The “why?” and “what’s going on?” questions are much more interesting, but are pretty much guesswork because they refer to layers in the data that you cannot see. For example, the number of people in East Devon of a particular age is the sum of those who have moved into the area at various times, minus those who have moved away (temporarily or permanently), plus those who were already there and have grown older but not yet died. For any bulge, you don’t know which layer it belongs to.

In this 2015 population pyramid there are bulges at 28, 50 and a pronounced spike at 68, as well as dips at 14 and 38. In terms of birth years, these correspond to 1987, 1965 and 1947 (spike), and dips at 2001 and 1977.

You can pretend they correspond to recessions, economic boom times and second wave feminism, but the 1947 post-war spike when a mass of men-folk were demobilized from the military is a pretty clean signal.

What makes this data presentation especially lovely is that it is localized, so you can see the population pyramid per city:

Cambridge, as everyone knows, is a university town, which explains the persistent spike at the age 20.

And, while it looks like there is gender equality for 20 year old university students, there is a pretty hefty male lump up to the age of 30 — possibly corresponding folks doing higher degrees. Is this because fewer men are leaving town at the appropriate age to become productive members of society, or is there an influx of foreign grad students from places where there is less of a gender equality? The data set of student origins and enrollments would give you the story.

As to the pyramid on the right hand side, I have no idea what is going on in Camden to account for that bulge in 30 year olds. What is obvious, though, is that the bulge in infants must be related. In fact, almost all the children between the ages of 0 and 16 years will have corresponding parents higher up the same pyramid. Also, there is likely to be a pairwise cross-gender correspondence between individuals of the same generation living together.

These internal links, external data connections, sub-cohorts and new questions raised the more you look at it means that it is impossible to create a single all-purpose visualization application that could serve all of these. We can wonder as to whether an interface which worked via javascript-generated SQL calls (rather than flash and server-side queries) would have enabled someone with the right skills to roll their own queries and, for example, immediately find out which city and age group has the greatest gender disparity, and whether all spikes at the 20-year-old age bracket can be accounted for by universities.

For more, see An overview of ONS’s population statistics.

As it is, someone is going to have to download/scrape, parse and load at least one year of source data into a data hub of their choice in order to query this (we’ve started on 2010’s figures here on ScraperWiki – take a look). Once that’s done, you’d be able to sort the cities by the greatest ratio between number of 20 year olds and number of 16 year olds, because that’s a good signal of student influx.

I don’t have time to get onto the Population projection models, where it really gets interesting. There you have all the clever calculations based on guestimates of migration, mortality and fertility.

What I would really like to see are these calculations done live and interactively, as well as combined with economic data. Is the state pension system going to go bankrupt because of the “baby boomers”? Who knows? I know someone who doesn’t know: someone who’s opinion does not rely (even indirectly) on something approaching a dynamic data calculation. I mean, if the difference between solvency and bankruptcy is within the margin of error in the estimate of fertility rate, or 0.2% in the tax base, then that’s not what I’d call bankrupt. You can only find this out by tinkering with the inputs with an element of curiosity.

Privatized pensions ought to be put into the model as well, to give them the macro-economic context that no pension adviser I’ve ever known seems capable of understanding. I mean, it’s evident that the stock market (in which private pensions invest) does happen to yield a finite quantity of profit each year. Ergo it can support a finite number of pension plans. So a national policy which demands more such pension plans than this finite number is inevitably going to leave people hungry.

Always keep in mind the long term vision of data and governance. In the future it will all come together like transport planning, or the procurement of adequate rocket fuel to launch a satellite into orbit; a matter of measurements and predictable consequences. Then governance will be a science, like chemistry, or the prediction of earthquakes.

But don’t forget: we can’t do anything without first getting the raw data into a usable format. Dave McKee’s started on 2010’s data here … fancy helping out?

]]>
758216643
Take a Look at This https://blog.scraperwiki.com/2011/04/take-a-look-at-this/ Mon, 18 Apr 2011 15:55:29 +0000 http://blog.scraperwiki.com/?p=758214672 Here at ScraperWiki we’re not just about scraping but also about viewing the work you’ve scraped. In this case, ‘The Julian’, didn’t even scrape! He got it from The Guardian Data Blog. They posted an animated history of UK aid 1960-2009 mapped. Spanish design house Bestiaro produced it using OECD aid data. All ‘The Julian’ had to do was download the data and get playing with Protoviz to see what he could conjure up. And here it is. The Y-axis ia £M. So you can see the changing total amount but also the change in proportions of that amount each country receives.

The original looks like so, with time bar and country click isolation.

The main port of information is the treemap in which you can drill down to the individual countries. ‘The Julian’ turned this on it’s head by putting all the data onto the graph (I think it looks much prettier!). Here, the amount is the thickness of the line so you can see how it adds up to the totals in time and which country gets the largest proportion for each moment in time. Here is the UK aid flow:

Check out which counties are what line on the original. What this type of visualization allows over the Bestario one is to compare the aid being given by different countries (rather than to). The view at the top is aid given by both the UK and France, the french aid appearing in red. So the visual is then leaning towards finding out about the givers and not just the receivers. As you can see quite easily by which colour is more predominant. This more geographical map shows how data can be spun along different angles.

Here at ScraperWiki we’re all about finding the different angles to the data!

]]>
758214672