Journalism Data Camp NY potential data sets
Here is a review of some of the datasets that have been submitted for the Columbia Journalism Data Camp this Friday.
This list is only for backup in case not enough ideas show up with people on the day (never happens, but it’s always a fear).
1. Iowa accident reports
The site http://accidentreports.iowa.gov contains all the police roadside reports of accidents. It’s easy to scrape because the database ids are consecutive numbers:
http://accidentreports.iowa.gov/index.php?pgname=IDOT_IOR_MV_Accident_details&id=50070
And it contains thousands of rinky-dink diagrams of the incidents.
First step is to copy all the html from each page into one database. Second step is to scan through all these pages and progressively extract more and more data from them.
Contrast with dataset of accidents available for the UK.
2. South Dakota state budget information
Apparently complete set of expenditures, contracts and revenues disclosed on http://open.sd.gov/ in a form that is easy to scrape (some datasets even allow CSV download). Many states do this, with varying degrees of success.
Use this case to learn how to restructure and analyse financial accountancy flow information. Can you find any contracts that have suddenly been dropped in favour of another supplier?
3. New York School budgets
The site schools.nyc.gov/AboutUs/funding/schoolbudgets/GalaxyAllocationFY2012 requires a school code. Try “M411”.
Apparently there is this spreadsheet of school codes.
Is there anything interesting to plot across all schools, such as the PSAL SNAPPLE FUNDS?
4. New York Lobbying registers
Lobbying at the state and city level. Some of this is challenging.
Is there a cross-over between the jurisdictions? Can you uniquely identify the corporate interests and relate them to the legislative or regulatory program?
5. Court case information
Go to https://www.dccourts.gov/cco/ (Try “Lockheed”). Not obvious where the information is.
The New York City courts are behind a captcha. Maybe better luck with the New York State courts.
Court datasets are usually very difficult to obtain and jealously protected. The legal process resists modernization and is universally paper based. Electronic documents (contracts, settlements, filings) almost always turn out to be image scans of papers.
6. New York City Police crime data
There are weekly PDFs for each police precinct. These are taken down and replaced by the next one, so there is no historical record.
Luckily someone has scraped the data since 2010, though the numbers may need some processing before you map them.
7. New York State gas drilling permits
These are available but don’t seem to have been updated recently. What’s going on?
Wouldn’t it be nice to make another twitterbot to be friends with NorthSeaOil1?
Don’t forget to read the Well ownership transfers.
A while back I wrote a scraper for accident reports in Iceland:
https://scraperwiki.com/scrapers/umferdarslys/