Dispatches – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 600 Lines of Code, 748 Revisions = A Load of Bubbles https://blog.scraperwiki.com/2011/03/600-lines-of-code-748-revisions-a-load-of-bubbles/ https://blog.scraperwiki.com/2011/03/600-lines-of-code-748-revisions-a-load-of-bubbles/#comments Tue, 08 Mar 2011 16:22:45 +0000 http://blog.scraperwiki.com/?p=758214401 When Channel 4’s Dispatches came across 1,100 pages of PDFs, known as the National Asset Register, they knew they had a problem on their hands. All that data, caged in a pixelated prison.

So ScraperWiki let loose ‘The Julian’. What ‘The Stig’ is to Top Gear, ‘The Julian’ is to ScraperWiki. That and our CTO.

‘The Julian’ did not like the PDFs. After scraping 10 pages of Defence assets, he got angry. The register may as well been glued together by trolls. The 5 year old data copied and pasted by Luddites from the previous Government was worse then useless.

So the ScraperWiki team set about rebuilding the register. Using good old-fashioned man power (i.e. me) and a PDF cropper we built a database of names, values and hierarchies that link directly to the PDFs.

Then Julian set about coding; 600 lines and 748 revisions! He made the bubbles the size of the asset values and got them to orbit around their various parent bubbles. This required such functions as ‘MakeOtherBranchAggregationsRecurse(cluster)’.

This scared our designer Zarino a little, who nevertheless made it much more user-friendly. This is where ScraperWiki’s powers of viewing live edits, chatting and collaboration became useful. The result was rounds of debugging interspersed with a healthy dose of cursing.

We then tried using it. We wanted the source of the data to hold provenance. We wanted to give the users the ability to explore the data. We wanted them to be able to see the bubbles that were too small. We prodded ‘The Julian’.

He hard coded the smaller bubbles to get into a ‘More…’ bubble orbit. This made the whole article on Channel 4 News thing a lot clearer and changed the navigation from jumping to orbits to drilling down and finding out which assets are worth a similar amount.

He then got it to drill down to the source PDFs. ‘The Julian’ outdid himself and stayed up all night making a PDF annotator of the data. We have plans for this.

Oh, and we also made a brownfield map on the Channel 4 News site. The scraper can be found here. And the code for the visual here. The 25000 data points were in Excel form and so much easier to work with. This was nice data with lots of fields. The result: a very friendly application that allows users to type a post code and to see what land their local authority has up for redevelopment. But due to the new government coming in, the Homes and Communities Agency have not yet finished collecting the 2009 data.

NAR and NLUD – you’ve been ScraperWikied!

]]>
https://blog.scraperwiki.com/2011/03/600-lines-of-code-748-revisions-a-load-of-bubbles/feed/ 7 758214401