The Guardian – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Which witch is which? Wed, 02 Nov 2011 17:35:50 +0000 I hope you all had a spooky Halloween and have gorged yourselves on candy. In keeping with the festive cheer, The Guardian Datastore has hopped aboard and with a digger in resident, Chris Blower, and scraped the court cases of Scottish witches.

Thanks to Lisa Evans we can now see that the most common reason for a court case to be held against a suspected witch arises from the defendant being implicated by another women convicted as, or suspected to be, a witch. Shocking! (Note: our digger is gender neutral!).

So hats off (pun intended) to the digger, the digital journalist and the data!

A Bonny Wee Hack Day at #hhhglas Mon, 28 Mar 2011 16:42:47 +0000 For our first venture to Scotland where better to be than BBC Scotland! We had 8 teams of hacks and hackers digging around the Scottish data beat. For this very special occasion the ScraperWiki digger has donned tartan! With this special digger, fire incidents, planning applications, public-owned property and gifts councillors’ received have been mined. Here’s a word from our Francis Irving:


Now check out the projects:

Fire Bugs – This project scrapes the data from the Central Scotland Fire Service’s Recorded Incidents log, creating an alert when new incidents are logged. It also retrieves historic data.

The team consisted of 1 hack (Chris Sleight, from BBC Scotland) and 2 hackers (Ben Lyons and Paul Miller, from IRISS).

Central Scotland Fire Service put a lot of data on their website but as is usual, it was not in a very useful form. 60 incidents are put on the site but if you dig down you get over 15,000 buried records. For one day’s work, Fire Bugs scraped the records and decided to look at malicious false alarms. Luckily for them the language and structure of the records were consistent. They found that 3.5% of all calls were malicious false alarms. They even made a tree map on ScrpaerWiki using protoviz.  Fire Bugs have clearly opened up a huge amount of potential with this data.

Edinburgh Planning App Map – This is Edinburgh’s first automated map of local planning applications! This is a popular theme for our hack days and on ScraperWiki in general. Open Australia are using ScraperWiki for their planning alerts.

This team consists of 1 hack (Michael MacLeod, beatblogger for Guardian Edinburgh on the right) and one hacker (Robert McWilliam, from Blueflow on the left).

As Michael MacLeod pointed out, people dont’ know how to use the local council website. You can’t just type in your postcode to find applications near you. There’s a map online but it’s truly awful! The team scraped the site and made a map which updates everyday rather than just every week like on the council site. Michael used this new tool to take a closer look at his beat and found a planning application for urban paintball. What he duly noted was that the Facebook page was trying to be secretive about the location! Using the map, he found it was going to be right behind a block of flats. He wil be talking to residents!

Hide by the Clyde – This project creates a map to allow the user to compare exam results in different areas and correlates this against measures of social deprivation.

The team consists of 1 hack (Bruce Munro from BBC Scotland) and 3 hackers (Nicola Osborne from Edina, Sean Carroll from BBC Scotland and Bob Kerr from Open Street Map).

They looked at data from Learning and Teaching Scotland and scraped the search for schools form. Here is the map. From this freed data they were able to make a heat map of free school meals registration in Scotland and compare education statistics between Glasgow and Argyll & Bute for example. A major project would be to put all this information on one site in a user friendly format.

Public Buildings for Sale – is a tool to show all publicly-owned property that is for sale/rent. This will be a Scottish sister to ScraperWiki’s brownfield’s sites map. The project aims to answer the question: How much public land is being sold without our knowledge?

This team consisted of 1 hack (Peter Mackay from BBC Scotland on the right) and 1 hacker (Martyn Inglis from The Guardian on the left).

The data they wanted is on this horrible website on property sales and lettings from Scotland’s public sector. The nested html tables are very difficult to scrape. They managed to scrape this far and plan to remodel the data to make it searchable by postcode. From this, they want to glean more information about council’s buy and sell strategies.

Crash Test Dummies – This project takes three separate approaches to looking at Scottish road accident data.

  1. What accidents get reported?
  2. Are you safe on the roads?
  3. What affect do road safety measures make on your journey

The team consisted of 2 hacks (David Eyre and Brendon Crowther from BBC Scotland) and 2 hackers (Ali Craigmile and Mo McRoberts from BBC Scotland).

In just one day they managed to built a prototype for a “Mind how you go!” BBC Scotland site. They used road traffic accident reports based on 2005-2009 data to create a form that showed how likely you were to survive your journey depending on your age, sex and where you’re going! They built a spreadsheet from even more data so that the site had even more potential to go beyond the records. They also scraped Google searches of reported road traffic accidents and mapped the reports from BBC scotland from 2010.

BME ScotlandThis project aims to find out what are the effects of the recession on education! Is education a route to the ghetto? It aims to compare BME educational achievement with unemployment statistics to find out which areas of Scotland are economic no-gos.

The team consisted of 1 hack (Fin Wycherley) and 1 hacker (Paul McNally).

The lesson learnt here was that sometimes there’s not enough data to go around. What they know is that the African population is doing exceedingly well in education in Scotland. However, they also have a relatively high level of unemployment. The result was a call out for better data collection as none of the information fitted in a way that would help answer the question: Why?.

Edinburgh CouncilThis project searches Edinburgh Councillors gifts and expenses!

The team consists of 2 hacks (Paola Di Maoi and Anand Ramkissoon) and 1 hacker (James Baster).

As it turner out, the Council website is easy to scrape. The structure of the site is consistent and clean. ScraperWiki likes this! And so here is the scraper. As James pointed out, the data needs to be double checked for misspelled entries, etc. But the preliminary data shows that Lothian buses gave the most gifts and Phil Wheeler received the most gifts.

Magners CiderThis project aims to scrape the Magners League Rugby scores. The team consisted of a hack/hacker pair of Paul McNally (again, we love eager hackers!) and Tony Sinclair of BBC Scotland (who had to keep up the day job and so was not around for a picture). Apparently, a graphics operator had to input the information from the site by hand into the graphics system to produce the league tables you see on screen. Seeing as the graphics software can access spreadsheets, Tony thought “Why not automate the process by scraping?”. And this is what they did. So the scores have gone from ScraperWiki to TV!

And the winners are… (drum roll please)

  • 1st Prize: Edinburgh Planning Map App
  • 2nd Prize: Fire Bugs
  • 3rd Prize: Magners Cider
  • Best Scraper: Fire Bugs

A big shout out

Our judges, Jon Jacob from BBC College of Journalism, Allan Donald from STV and Huw Owen, Editor of Good Morning Scotland.

Our sponsors BBC Scotland, BBC College of Journalism and The Guardian Open Platform.

Edinburgh planning applications, fire incidents and Rugby scores – you’ve been ScraperWikied!

The winners and the judges

]]> 5 758214491
The Web Data Revolution – a new future for journalism Fri, 12 Nov 2010 20:42:55 +0000 This event hosted by The Guardian. They say:

“The web not only gives easy access to billions of statistics on every matter – from MP’s expenses to the location of every public convenience in the UK – but also provides the tools to visualise said information, giving a clarity of voice and an equality of access to stories that pre-web could never have been told on such a scale.

But the data revolution has also brought with it the risk of confusion, misinterpretation and inaccessibility. How do you know where to look? What is credible or up to date? Official documents are often published as uneditable pdf files for example – useless for analysis except in ways already done by the organisation itself.”

This discussion will be chaired by an expert panel (people I know) consisting of David McCandless of ‘Information is Beautiful’ fame, Heather Brooke of FOI fame, Simon Rogers of Guardian DataBlog fame and Richard Pope of ScraperWiki fame.

Data journalism: our five point guide – Simon Rogers

None of this is new – need to visualize data to make a point. Table in the Guardian in May 1981 – data has always been around and needed to know the truth. If you don’t know what’s going on how can you change things in society.

Now, public spending visualizations. Beautiful but a lot of work. But then government requests it. Now we all have the tools. A lot doesn’t even involve hard core programming. Need to be inspired by telling stories. Story needs to drive the editorial need to use data.

Only computers will know what to ask e.g. Wikileaks data. Technical skills and design needed but can be built upon. Not all data is interesting. Need to have a nose for data to learn what will be good for a data driven story. Raw data is just numbers without the design to make it beautiful.

It’s about sharing. Data needs to be made as open as possible! People out there have much better knowledge than journalists sitting in the office. We need to harness that knowledge.

Information is Beautiful – David McCandless

You need to see patterns and connections that matter in the data. That is data journalism. You need to orientate your audience, take them on a journey.

Data is abstract. You need to contextualize to understand what it means. Need to make it relevant. If you make it beautiful/interesting everyone will love it. Looking at graph of most common break up time according to Facebook.

We’re saturated with data. Data is the new soil. Visualizations are the earthy blossoms!

We are saturated by data but if we use the right journalistic inkling we can grow beautiful stories. Our fears visualized using Google Insights. Check it out at Columbine shooting and violent video games co-dependent?

Data as a prism – use it to correct your vision. Can take all the other top ten military budgets and fit it into America’s. But it’s a vastly rich country it can fit in all the other four top economies. So military budget as % of GDP? Myanmar is the biggest. Biggest arny = China. But as % population = North Korea.

The internet is a visualization design medium. we’ve been drenched in it. We’re constantly hunting for patterns in a sea of information. We’ve all been trained by our use of the web. We’re all information curious.

Heather Brooke

“The only way I could get answers to my questions to public bodies was through data”. Police in her local area were not turning up, she wanted to know was it just her. Only way you could tell was through officials logs and not their word.

Once you ask data starts trickling out. But needed around 50 requests! And in the form of a complex spreadsheet. Riven with factual inaccuracies. Data is only as good and usable as the person who gathers/inputs it. The pubic can’t be trusted with the raw data – attitude got from public bodies. Need Freedom of Information Act.

Open data needs to start from the top – MPs expenses. A democratic state has a right to openness. We need true open data.

MPs expenses shifted everyone’s notion of who the government were actually working for. MPs felt their expenses were their data, not ours.

Simon Jefferies

Different structured forms are needed for different data. The structure gives in power. Data within data within context. Very rich stories. A new way of journalism. All users to interrogate data themselves. Information architecture!

You have to be sure your fact is right!

Richard Pope – ScraperWiki

Data is rarely useable for journalists. Data is collected with journalists or the public interest in mind. ScraperWiki wants to make data useable and collaborative.

There’s a blending of skills needed to do datajournalism. We need to democratise these skills to break a story.

These are early days but we can see that journalism is changing. A computer is another tool. When a journalist makes a call it’s not called ‘telephone-assisted-reporting’. It’s not new, we just need to learn to use more and more data. And we need to understand it.

This will not be a specialised area, it will just be reporting! It all comes down to asking the right questions.

]]> 1 758214015