Ben Harris – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Meet the User – Ben Harris https://blog.scraperwiki.com/2011/07/meet-the-user-ben-harris/ Fri, 22 Jul 2011 15:53:30 +0000 http://scraperwiki.wordpress.com/?p=758215193 Ben Harris is one of the few ScraperWikians I’ve come across that actually codes for a day job (I’m sure there’s lots, send them pizza). He’s a sysadmin (surprised a ninja/evangelist isn’t in some way attached to the title), which means he writes quite a lot of little hacky scripts to do useful things (just what we like!).  A couple of years ago he started trying to find (and publish) the traffic regulation orders applying to Cambridge, and this led him into the world of Freedom of Information.  Now he helps maintain the list of public authorities on WhatDoTheyKnow, which involves pulling together information from legislation, official registers and lots of websites.  They have a wiki, FOIwiki, to help them keep track of things.

He says:

The concept of a wiki for running code is brilliant, and ScraperWiki does a pretty good job of implementing it.  While the scraping facilities are obviously good, the important thing is having somewhere public to keep (and link to) runnable code and the databases that go with it – Ben Harris

One of his experiments has been to write a ScraperWiki view that suggests updates to a page on FOIwiki based on the output of a scraper, automatically generating links to WhatDoTheyKnow where they already know about authorities.  The idea here is that an FOIwiki page would mirror a ScraperWiki dataset but with added notes, links and general explanation. He says it needs a lot more work, but it’s already proved useful for spotting new NHS Foundation Trusts.

A long-term project is to scrape contact details for English parish councils from various district council websites.  As for the data landscape along his ScraperWiki digger road trip, surprise surprise, there’s very little consistency! He’s written scrapers that ingest HTML, PDF, CSV and UK Legislation XML.  So having them all in the same kind of table in ScraperWiki means that he can then apply a common set of tools to the results.  A consequence of the multiplicity of sources is that they often use variant names for the same public authority, so he’s trying to improve our fuzzy matching rules.

Sadly, a lot of the information he wants is only available on paper, and ScraperWiki can’t help.  In consequence, he’s spent hours in the offices of Cambridgeshire County Council with a scanner, and dug through the collections of legislation in Cambridge University Library, the British Library, and the National Archives.

He says the most accessible information is probably that from the Scottish Information Commissioner, who provides a big CSV file with the email addresses of most of the Schedule 1 Scottish Public Authorities.  Of course he’s imported it into ScraperWiki.

Strangely enough, many of the other things he spends his free time on also involve collecting and cataloguing things, be they grid squares, kilometres cycled, or railway stations (a friend’s project, but he helps out).

For being a collecting, cataloguing, council document scanning crazed ScraperWikian (there’s a title!) we salute you Ben Harris (*honking of digger horn*)

]]>
758215193