data journalism – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 “the impact on our industry only begins this weekend” says Susan E McGregor, Professor at the world’s foremost school of journalism https://blog.scraperwiki.com/2012/02/the-impact-on-our-industry-only-begins-this-weekend-says-susan-mcgregor-professor-at-the-worlds-foremost-school-of-journalism/ Wed, 01 Feb 2012 18:15:23 +0000 http://blog.scraperwiki.com/?p=758216134 This is a guest blog post by Susan E. McGregor – Assistant Professor at the Tow Center for Digital Journalism Columbia University

The Tow Center for Digital Journalism at Columbia University Graduate School of Journalism is proud to be partnering with Knight News Challenge winner ScraperWiki this Friday and Saturday for their first Journalism Data Camp in the U.S. This event provides us with an opportunity to host a wide range of programmers, journalists and educators interested in expanding access to essential data sets, while connecting those communities to one another. We are also looking forward to extending the impact of this weekend’s activities by working in conjunction with our colleagues at the Stabile Center for Investigative Journalism and The New York World to further pursue those stories related to New York accountability issues that may be touched on during this weekend’s data “liberation” activities.

As an online tool, ScraperWiki is an innovative technical platform that allows users to build, test, and execute programmatic “scrapers” that transform web pages and pdfs into more accessible, usable data formats. As an online archive and repository, ScraperWiki helps improve access to scraped data sets by making them collectively available on their website. Finally, as a web-based collaboration space, ScraperWiki helps convene journalists and programmers around projects of shared interest, in addition to fostering peer-to-peer training and support.

Each of the above features of the ScraperWiki platform resonates closely with the Tow Center’s own priorities for data journalism. Making data available in formats that can be easily parsed, analyzed, and distributed is an essential part of data transparency, and the accountability journalism it serves. Providing a public access point for that data allows both journalists and their audiences to fact-check and elaborate upon the work that their peers have done, leveraging it against future projects and creating more comprehensive resources. And of course, the knowledge sharing and collaboration that takes place between programmers and journalists through ScraperWiki echoes the Tow Center’s mandate to educate and innovate at the intersection of computer science and journalism, both through its own dual-degree program in computer science and journalism, and through such public events as this one.

While we are certain that ScraperWiki will find ready adoption in cities and newsrooms throughout the country in the months to come, we look forward to growing an ongoing relationship with ScraperWiki and its contributors here in the New York area. By hosting this event we hope to introduce many of our students and colleagues to a truly remarkable tool, one whose impact on our industry only begins this weekend.

]]>
758216134
D (ata) + J (ournalism) + Camp 2011 = #djcamp2011 https://blog.scraperwiki.com/2011/04/d-ata-j-ournalism-camp-2011-djcamp2011/ Mon, 04 Apr 2011 14:02:24 +0000 http://blog.scraperwiki.com/?p=758214546 Here at ScraperWiki we like to learn. and we also relish the opportunity to teach. Be it scraping or viewing, Ruby or Python or PHP: we want to spread the data and the scraping knowledge.

So it’ll come as no big surprise that our head professor, Francis Irving, will be lending a scraping hand at #djcamp2011.

“What is #djcamp2011?”, you ask. Here’s what you need to know:

So for the many hacks we’ve met at our Hacks and Hackers Hack Days, this is a brilliant opportunity to learn some of the hacker trade secrets! Sign up here.

]]>
758214546
The Web Data Revolution – a new future for journalism https://blog.scraperwiki.com/2010/11/the-web-data-revolution-a-new-future-for-journalism/ https://blog.scraperwiki.com/2010/11/the-web-data-revolution-a-new-future-for-journalism/#comments Fri, 12 Nov 2010 20:42:55 +0000 http://blog.scraperwiki.com/?p=758214015 This event hosted by The Guardian. They say:

“The web not only gives easy access to billions of statistics on every matter – from MP’s expenses to the location of every public convenience in the UK – but also provides the tools to visualise said information, giving a clarity of voice and an equality of access to stories that pre-web could never have been told on such a scale.

But the data revolution has also brought with it the risk of confusion, misinterpretation and inaccessibility. How do you know where to look? What is credible or up to date? Official documents are often published as uneditable pdf files for example – useless for analysis except in ways already done by the organisation itself.”

This discussion will be chaired by an expert panel (people I know) consisting of David McCandless of ‘Information is Beautiful’ fame, Heather Brooke of FOI fame, Simon Rogers of Guardian DataBlog fame and Richard Pope of ScraperWiki fame.

Data journalism: our five point guide – Simon Rogers

None of this is new – need to visualize data to make a point. Table in the Guardian in May 1981 – data has always been around and needed to know the truth. If you don’t know what’s going on how can you change things in society.

Now, public spending visualizations. Beautiful but a lot of work. But then government requests it. Now we all have the tools. A lot doesn’t even involve hard core programming. Need to be inspired by telling stories. Story needs to drive the editorial need to use data.

Only computers will know what to ask e.g. Wikileaks data. Technical skills and design needed but can be built upon. Not all data is interesting. Need to have a nose for data to learn what will be good for a data driven story. Raw data is just numbers without the design to make it beautiful.

It’s about sharing. Data needs to be made as open as possible! People out there have much better knowledge than journalists sitting in the office. We need to harness that knowledge.

Information is Beautiful – David McCandless

You need to see patterns and connections that matter in the data. That is data journalism. You need to orientate your audience, take them on a journey.

Data is abstract. You need to contextualize to understand what it means. Need to make it relevant. If you make it beautiful/interesting everyone will love it. Looking at graph of most common break up time according to Facebook.

We’re saturated with data. Data is the new soil. Visualizations are the earthy blossoms!

We are saturated by data but if we use the right journalistic inkling we can grow beautiful stories. Our fears visualized using Google Insights. Check it out at www.informationisbeautiful.net. Columbine shooting and violent video games co-dependent?

Data as a prism – use it to correct your vision. Can take all the other top ten military budgets and fit it into America’s. But it’s a vastly rich country it can fit in all the other four top economies. So military budget as % of GDP? Myanmar is the biggest. Biggest arny = China. But as % population = North Korea.

The internet is a visualization design medium. we’ve been drenched in it. We’re constantly hunting for patterns in a sea of information. We’ve all been trained by our use of the web. We’re all information curious.

Heather Brooke

“The only way I could get answers to my questions to public bodies was through data”. Police in her local area were not turning up, she wanted to know was it just her. Only way you could tell was through officials logs and not their word.

Once you ask data starts trickling out. But needed around 50 requests! And in the form of a complex spreadsheet. Riven with factual inaccuracies. Data is only as good and usable as the person who gathers/inputs it. The pubic can’t be trusted with the raw data – attitude got from public bodies. Need Freedom of Information Act.

Open data needs to start from the top – MPs expenses. A democratic state has a right to openness. We need true open data.

MPs expenses shifted everyone’s notion of who the government were actually working for. MPs felt their expenses were their data, not ours.

Simon Jefferies

Different structured forms are needed for different data. The structure gives in power. Data within data within context. Very rich stories. A new way of journalism. All users to interrogate data themselves. Information architecture!

You have to be sure your fact is right!

Richard Pope – ScraperWiki

Data is rarely useable for journalists. Data is collected with journalists or the public interest in mind. ScraperWiki wants to make data useable and collaborative.

There’s a blending of skills needed to do datajournalism. We need to democratise these skills to break a story.

These are early days but we can see that journalism is changing. A computer is another tool. When a journalist makes a call it’s not called ‘telephone-assisted-reporting’. It’s not new, we just need to learn to use more and more data. And we need to understand it.

This will not be a specialised area, it will just be reporting! It all comes down to asking the right questions.

]]>
https://blog.scraperwiki.com/2010/11/the-web-data-revolution-a-new-future-for-journalism/feed/ 1 758214015