Tony Hirst – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 An Ouseful Person to Know – Tony Hirst https://blog.scraperwiki.com/2011/07/an-ouseful-person-to-know-tony-hirst/ Fri, 15 Jul 2011 12:58:10 +0000 http://blog.scraperwiki.com/?p=758215128 We love to teach at ScraperWiki but we love people who teach even more. That being said, may I introduce you to Tony Hirst, Open University Lecturer and owner of this ‘ouseful‘ blog. He teaches at the  Department of Communication and Systems and recently worked on course units relating to information skills, data visualisation and game design.

What are your particular interests when it comes to collecting data?

I spend a lot of time trying to make sense of the web, particularly in how we can appropriate and combine web applications in interesting and useful ways, as well as trying to identify applications and approaches that might be relevant to higher and distance education on the one hand, “data journalism” and civic engagement on the other. I’m an open data and open education advocate, and try to make what contribution I can by identifying tools and techniques that lower the barrier to entry in terms of making use of public open data for people who aren’t blessed with the time that enables them to try out such things. I guess what I’m trying to do is contribute towards making data, and its analysis, more accessible than it often tends to be.

How have you found using ScraperWiki and what do you find it useful for?  

ScraperWiki removed the overhead of having to set up and host a scraping environment, and associated data store, and provided me with a perfect gateway for creating my own scrapers (I use the Python route). I haven’t (yet) started playing with ScraperWiki views, but that’s certainly on my to do list. I find it really useful for “smash and grab” raids, though I have got a few scrapers that I really should tweak to run as scheduled scrapers. Being able to browse other scrapers, as well as put out calls for help to the ScraperWiki community, is a great way to bootstrap problem solving when writing scrapers, though I have to admit as often as not I resort to StackOverflow (and on occasion GetTheData) for my QandA help. [Disclaimer: I help set up GetTheData with the OKF‘s Rufus Pollock]

Are there any data projects you’re working on at the moment?

I have an idea in mind for a database of audio/podcast/radio interviews and discussions around books that would allow a user to look up a (probably recent) book by ISBN and then find book talks and author interviews associated with it. I’ve started working on several different scrapers that separately pull book and audio data from various sites Tech Nation (IT Conversations), Authors@Google (YouTube), and various BBC programmes (though I’m not sure of rights issues there!). I now really need to revisit them all to see if I can come up with some sort of normalised view over the data I might be able to get from each of those sources, and a set of rules from trying to parse out book and author data from free text descriptions that contain that information. [Read his blog post here]

I’m also tempted to start scraping university course prospectus web pages to try and build up a catalogue of courses from UK universities, in part because UCAS seem reluctant to release their aggregation of this data, and in part because the universities seem to be taking such a long time to get round to releasing course data in a structured way using the XCRI course marketing information XML standard.

Anything else you’d like to add? A little about your passions and hobbies?

I’ve started getting into the world of motorsport data, and built a set of scripts to parse the FIA/Formula 1 timing and results sheets (PDFs). Looking back over the scraper code, I’d wish I’d documented it… I think I must have been “in the flow” when I wrote it! Every couple of weeks, I go in and run each script separately by hand. I really should automate it to give me a one click dump everything, but I take a guilty pleasure in scraping each document separately! I copy each set of data as a Python array by hand and put it into a text file, that I then process using a series of other Python scripts and ultimately dump into CSV files. I’m not sure why I don’t just try to process the data in Scraperwiki and pop it into the database there… Hmmm…?!

]]>
758215128