Hi! We've renamed ScraperWiki.
The product is now QuickCode and the company is The Sensible Code Company.


Journalism Data Camp NY potential data sets

Here is a review of some of the datasets that have been submitted for the Columbia Journalism Data Camp this Friday.

This list is only for backup in case not enough ideas show up with people on the day (never happens, but it’s always a fear).

1. Iowa accident reports

The site http://accidentreports.iowa.gov contains all the police roadside reports of accidents. It’s easy to scrape because the database ids are consecutive numbers:


And it contains thousands of rinky-dink diagrams of the incidents.

First step is to copy all the html from each page into one database. Second step is to scan through all these pages and progressively extract more and more data from them.

Contrast with dataset of accidents available for the UK.

2. South Dakota state budget information

Apparently complete set of expenditures, contracts and revenues disclosed on http://open.sd.gov/ in a form that is easy to scrape (some datasets even allow CSV download). Many states do this, with varying degrees of success.

Use this case to learn how to restructure and analyse financial accountancy flow information. Can you find any contracts that have suddenly been dropped in favour of another supplier?

3. New York School budgets

The site schools.nyc.gov/AboutUs/funding/schoolbudgets/GalaxyAllocationFY2012 requires a school code. Try “M411”.

Apparently there is this spreadsheet of school codes.

Is there anything interesting to plot across all schools, such as the PSAL SNAPPLE FUNDS?

4. New York Lobbying registers

Lobbying at the state and city level. Some of this is challenging.

Is there a cross-over between the jurisdictions? Can you uniquely identify the corporate interests and relate them to the legislative or regulatory program?

5. Court case information

Go to https://www.dccourts.gov/cco/ (Try “Lockheed”). Not obvious where the information is.

The New York City courts are behind a captcha. Maybe better luck with the New York State courts.

Court datasets are usually very difficult to obtain and jealously protected. The legal process resists modernization and is universally paper based. Electronic documents (contracts, settlements, filings) almost always turn out to be image scans of papers.

6. New York City Police crime data

There are weekly PDFs for each police precinct. These are taken down and replaced by the next one, so there is no historical record.

Luckily someone has scraped the data since 2010, though the numbers may need some processing before you map them.

7. New York State gas drilling permits

These are available but don’t seem to have been updated recently. What’s going on?

Wouldn’t it be nice to make another twitterbot to be friends with NorthSeaOil1?

Don’t forget to read the Well ownership transfers.

2 Responses to “Journalism Data Camp NY potential data sets”

  1. pallih February 3, 2012 at 8:44 pm #

    A while back I wrote a scraper for accident reports in Iceland:



  1. 100 Years of history…and I just hope that we do it justice… | ScraperWiki Data Blog - February 2, 2012

    […] Go to ScraperWiki.com → ← Journalism Data Camp NY potential data sets […]

We're hiring!