datajournalism – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Diggers and Dinosaurs – Scraping at the Mozilla Festival https://blog.scraperwiki.com/2011/10/diggers-and-dinosaurs-scraping-at-the-mozilla-festival/ https://blog.scraperwiki.com/2011/10/diggers-and-dinosaurs-scraping-at-the-mozilla-festival/#comments Mon, 17 Oct 2011 15:40:33 +0000 http://blog.scraperwiki.com/?p=758215654 In a complete paradigm shift of the epic battle between Godzilla and Mothra we are turning our backs on the old claymation medium and embracing the digital age where dinosaurs and diggers (yes, I am aware we are a machine and not a moth) can roam free across the lawless plains of web 2.0.

Both can be found at the Mozilla Festival park in London on 4-6 November. If you’re lucky you might even spot a wily firefox. There will be an inclosure on the Friday from 18:00 where our tamed digger driver, Francis Irving, can give you some driving lessons.

As part of the Data Journalism Workshop on the Saturday, 10:00-17:00, we’ll be hosting a ‘Scraping 101’ session. There will be a host of data trackers to guide you through the web wilderness including Open Knowledge Foundation‘s Jonathan Gray and the European Journalism Centre‘s Liliana Bounegru. There will be herds of other data/web beasts roaming the plains so we suggest you stay inside or close to your digger.

If you’re interested in a close encounter of the data kind sign up for the event here.

So watch out Mozilla Festival – you’re being ScraperWikied!

]]>
https://blog.scraperwiki.com/2011/10/diggers-and-dinosaurs-scraping-at-the-mozilla-festival/feed/ 1 758215654
600 Lines of Code, 748 Revisions = A Load of Bubbles https://blog.scraperwiki.com/2011/03/600-lines-of-code-748-revisions-a-load-of-bubbles/ https://blog.scraperwiki.com/2011/03/600-lines-of-code-748-revisions-a-load-of-bubbles/#comments Tue, 08 Mar 2011 16:22:45 +0000 http://blog.scraperwiki.com/?p=758214401 When Channel 4’s Dispatches came across 1,100 pages of PDFs, known as the National Asset Register, they knew they had a problem on their hands. All that data, caged in a pixelated prison.

So ScraperWiki let loose ‘The Julian’. What ‘The Stig’ is to Top Gear, ‘The Julian’ is to ScraperWiki. That and our CTO.

‘The Julian’ did not like the PDFs. After scraping 10 pages of Defence assets, he got angry. The register may as well been glued together by trolls. The 5 year old data copied and pasted by Luddites from the previous Government was worse then useless.

So the ScraperWiki team set about rebuilding the register. Using good old-fashioned man power (i.e. me) and a PDF cropper we built a database of names, values and hierarchies that link directly to the PDFs.

Then Julian set about coding; 600 lines and 748 revisions! He made the bubbles the size of the asset values and got them to orbit around their various parent bubbles. This required such functions as ‘MakeOtherBranchAggregationsRecurse(cluster)’.

This scared our designer Zarino a little, who nevertheless made it much more user-friendly. This is where ScraperWiki’s powers of viewing live edits, chatting and collaboration became useful. The result was rounds of debugging interspersed with a healthy dose of cursing.

We then tried using it. We wanted the source of the data to hold provenance. We wanted to give the users the ability to explore the data. We wanted them to be able to see the bubbles that were too small. We prodded ‘The Julian’.

He hard coded the smaller bubbles to get into a ‘More…’ bubble orbit. This made the whole article on Channel 4 News thing a lot clearer and changed the navigation from jumping to orbits to drilling down and finding out which assets are worth a similar amount.

He then got it to drill down to the source PDFs. ‘The Julian’ outdid himself and stayed up all night making a PDF annotator of the data. We have plans for this.

Oh, and we also made a brownfield map on the Channel 4 News site. The scraper can be found here. And the code for the visual here. The 25000 data points were in Excel form and so much easier to work with. This was nice data with lots of fields. The result: a very friendly application that allows users to type a post code and to see what land their local authority has up for redevelopment. But due to the new government coming in, the Homes and Communities Agency have not yet finished collecting the 2009 data.

NAR and NLUD – you’ve been ScraperWikied!

]]>
https://blog.scraperwiki.com/2011/03/600-lines-of-code-748-revisions-a-load-of-bubbles/feed/ 7 758214401