Iraq War Logs – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Hacks/Hackers London https://blog.scraperwiki.com/2010/11/hackshackers-london/ Wed, 24 Nov 2010 19:11:26 +0000 http://blog.scraperwiki.com/?p=758214028 First of all, the Iraq War Logs:

Round One – The Cleaning

Documents, records and words all hugely intimidating in their vastness. But some tools to help are MySQL, Ultraedit and Google Refine. But this stage is incredibly frustrating.

Round Two – The Problem

How do you tackle the types of documents? There was even small PDF files. Had to build a basic web interface for everyday queries. Needed multiple fields, this part is extremely difficult. Especially when you need to explain it to an editor. You have to have a healthy mistrust of the data. Asking the right questions is crucial. Asking something which the data is not structured to ask is the real problem.

Round Three – What We Did

Looked at key incidences and names of interest which the media had previously reported. Trick was to try and find what we didn’t know. First start by looking at categories of deaths by time. Found that it was murders rather than weapons fire that killed the most. It was the own civilian in-fighting. Use Tableau. Up to 100,000 records. Also had to get researches to sift through reports and manually verify what the data meant. Make sure if you do that that you organise a system that everyone uses to categories, calculate and tabulate. Can then use Exel and filter. Quicker with Access.

Data was used as part of research not just to make loads of charts. Visual maps tell a story. Quite powerful to an audience. Maps can be used for newsgathering. Asked journalists which areas they were interested in and sent them the reports geocoded. They could read up on the plane all the reports in the area they were heading to. Can also link a big story to it’s log. Prove it to be true. The log can validate a report, so you can use it.

What Did it Take?

10 week. 25 people. 30,000 reports. 5,000 reports manually recounted. More than one 18-hour day.

ScraperWiki

A lot of really useful information is not easily available on the web. Writing a web scraper not only makes searching and viewing information better but it can bring stories to light which were hidden in the mass of digital structures.

When you do this you become data journalists in a strange sense in that you’re making data e.g The Public Whip. MPs have written in complaining. You then need discussion.

What is now journalism? Look at PlanningAlerts.com. You can make it (journalism?) with scrapers. Difficult to maintain scrapers. So ScraperWiki can be the backend to these sites where a community maintains them. Crowdsource your coding and site maintenance. For data journalism of this sort there’s a kind of commitment that you don’t find when reporting stories.

]]>
758214028