hacks/hackers – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Hacks/Hackers London https://blog.scraperwiki.com/2010/11/hackshackers-london/ Wed, 24 Nov 2010 19:11:26 +0000 http://blog.scraperwiki.com/?p=758214028 First of all, the Iraq War Logs:

Round One – The Cleaning

Documents, records and words all hugely intimidating in their vastness. But some tools to help are MySQL, Ultraedit and Google Refine. But this stage is incredibly frustrating.

Round Two – The Problem

How do you tackle the types of documents? There was even small PDF files. Had to build a basic web interface for everyday queries. Needed multiple fields, this part is extremely difficult. Especially when you need to explain it to an editor. You have to have a healthy mistrust of the data. Asking the right questions is crucial. Asking something which the data is not structured to ask is the real problem.

Round Three – What We Did

Looked at key incidences and names of interest which the media had previously reported. Trick was to try and find what we didn’t know. First start by looking at categories of deaths by time. Found that it was murders rather than weapons fire that killed the most. It was the own civilian in-fighting. Use Tableau. Up to 100,000 records. Also had to get researches to sift through reports and manually verify what the data meant. Make sure if you do that that you organise a system that everyone uses to categories, calculate and tabulate. Can then use Exel and filter. Quicker with Access.

Data was used as part of research not just to make loads of charts. Visual maps tell a story. Quite powerful to an audience. Maps can be used for newsgathering. Asked journalists which areas they were interested in and sent them the reports geocoded. They could read up on the plane all the reports in the area they were heading to. Can also link a big story to it’s log. Prove it to be true. The log can validate a report, so you can use it.

What Did it Take?

10 week. 25 people. 30,000 reports. 5,000 reports manually recounted. More than one 18-hour day.

ScraperWiki

A lot of really useful information is not easily available on the web. Writing a web scraper not only makes searching and viewing information better but it can bring stories to light which were hidden in the mass of digital structures.

When you do this you become data journalists in a strange sense in that you’re making data e.g The Public Whip. MPs have written in complaining. You then need discussion.

What is now journalism? Look at PlanningAlerts.com. You can make it (journalism?) with scrapers. Difficult to maintain scrapers. So ScraperWiki can be the backend to these sites where a community maintains them. Crowdsource your coding and site maintenance. For data journalism of this sort there’s a kind of commitment that you don’t find when reporting stories.

]]>
758214028
Hacks/Hackers London meetup to discuss Iraq War logs https://blog.scraperwiki.com/2010/10/hackshackerslondon/ Tue, 26 Oct 2010 10:02:50 +0000 http://blog.scraperwiki.com/?p=758213968 Scraperwiki will be supporting the November Hacks/Hackers London meetup at 7pm on Wednesday 24th November 2010 at The Irish Club, 2-4 Tudor Street, EC4Y 0AA, London. A few tickets are still available, but places are filling fast.

Schedule

  • 7.00pm: The data journalism behind the Iraq War Logs James Ball, Bureau of Investigative Journalism

James, Development Producer for the Bureau of Investigative Journalism and Chief Data Analyst on the TBIJ/Channel 4 Dispatches investigation into the Iraq War Logs, will explain how data journalism powered the process.

  • 7.30pm: TBC
  • 8pm: Social!
]]>
758213968