html – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 5 yr old goes ‘potty’ at Devon and Somerset Fire Service (Emergencies and Data Driven Stories) Fri, 25 May 2012 07:13:33 +0000

It’s 9:54am in Torquay on a Wednesday morning:

One appliance from Torquays fire station was mobilised to reports of a child with a potty seat stuck on its head.

On arrival an undistressed two year old female was discovered with a toilet seat stuck on her head.

Crews used vaseline and the finger kit to remove the seat from the childs head to leave her uninjured.

A couple of different interests directed me to scrape the latest incidents of the Devon and Somerset Fire and Rescue Service. The scraper that has collected the data is here.

Why does this matter?

Everybody loves their public safety workers — Police, Fire, and Ambulance. They save lives, give comfort, and are there when things get out of hand.

Where is the standardized performance data for these incident response workers? Real-time and rich data would revolutionize its governance and administration, would give real evidence of whether there are too many or too few police, fire or ambulance personnel/vehicles/stations in any locale, or would enable the implementation of imaginative and realistic policies resulting from major efficiency and resilience improvements all through the system?

For those of you who want to skip all the background discussion, just head directly over to the visualization.

A rose diagram showing incidents handled by the Devon and Somerset Fire Service

The easiest method to monitor the needs of the organizations is to see how much work each employee is doing, and add more or take away staff depending on their workloads. The problem is, for an emergency service that exists on standby for unforeseen events, there needs to be a level of idle capacity in the system. Also, there will be a degree of unproductive make-work in any organization — Indeed, a lot of form filling currently happens around the place, despite there being no accessible data at the end of it.

The second easiest method of oversight is to compare one area with another. I have an example from California City Finance where the Excel spreadsheet of Fire Spending By city even has a breakdown of the spending per capita and as a percentage of the total city budget. The city to look at is Vallejo which entered bankruptcy in 2008. Many of its citizens blamed this on the exorbitant salaries and benefits of its firefighters and police officers. I can’t quite see it in this data, and the story journalism on it doesn’t provide an unequivocal picture.

The best method for determining the efficient and robust provision of such services is to have an accurate and comprehensive computer model on which to run simulations of the business and experiment with different strategies. This is what Tesco or Walmart or any large corporation would do in order to drive up its efficiency and monitor and deal with threats to its business. There is bound to be a dashboard in Tesco HQ monitoring the distribution of full fat milk across the country, and they would know to three decimal places what percentage of the product was being poured down the drain because it got past its sell-by date, and, conversely, whenever too little of the substance had been delivered such that stocks ran out. They would use the data to work out what circumstances caused changes in demand. For example, school holidays.

I have surveyed many of the documents within the Devon & Somerset Fire & Rescue Authority website, and have come up with no evidence of such data or its analysis anywhere within the organization. This is quite a surprise, and perhaps I haven’t looked hard enough, because the documents are extremely boring and strikingly irrelevant.

Under the hood – how it all works

The scraper itself has gone through several iterations. It currently operates through three functions: MainIndex(), MainDetails(), MainParse(). Data for each incident is put into several tables joined by the IncidentID value derived from the incident’s static url, eg:

MainIndex() operates their search incidents form grabbing 10 days at a time and saving URLs for each individual incident page into the table swdata.

MainDetails() downloads each of those incident pages, parsing the obvious metadata, and saving the remaining HTML content of the description into the database. (This used to attempt to parse the text, but I then had to move it into the third function so I could develop it more easily.) A good way to find the list of urls that have not been downloaded and saved into the swdetails is to use the following SQL statement:

select swdata.IncidentID, swdata.urlpage 
from swdata 
left join swdetails on swdetails.IncidentID=swdata.IncidentID 
where swdetails.IncidentID is null 
limit 5

We then download the HTML from each of the five urlpages, save it into the table under the column divdetails and repeat until no more unmatched records are retrieved.

MainParse() performs the same progressive operation on the HTML contents of divdetails, saving it into the the table swparse. Because I was developing this function experimentally to see how much information I could obtain from the free-form text, I had to frequently drop and recreate enough of the table for the join command to work:

scraperwiki.sqlite.execute("drop table if exists swparse")
scraperwiki.sqlite.execute("create table if not exists swparse (IncidentID text)")

After marking the text down (by replacing the <p> tags with linefeeds), we have text that reads like this (emphasis added):

One appliance from Holsworthy was mobilised to reports of a motorbike on fire. Crew Commander Squirrell was in charge.

On arrival one motorbike was discovered well alight. One hose reel was used to extinguish the fire. The police were also in attendance at this incident.

We can get who is in charge and what their rank is using this regular expression:

re.findall("(crew|watch|station|group|incident|area)s+(commander|manager)s*([w-]+)(?i)", details)

You can see the whole table here including silly names, misspellings, and clear flaws within my regular expression such as not being able to handle the case of a first name and a last name being included. (The personnel misspellings suggest that either these incident reports are not integrated with their actual incident logs where you would expect persons to be identified with their codenumbers, or their record keeping is terrible.)

For detecting how many vehicles were in attenence, I used this algorithm:

appliances = re.findall("(S+) (?:(fire|rescue) )?(appliances?|engines?|tenders?|vehicles?)(?: from ([A-Za-z]+))?(?i)", details)
nvehicles = 0
for scount, fire, engine, town in lappliances:
    if town and "town" not in data:
        data["town"] = town.lower(); 
    if re.match("one|1|an?|another(?i)", scount):  count = 1
    elif re.match("two|2(?i)", scount):            count = 2
    elif re.match("three(?i)", scount):            count = 3
    elif re.match("four(?i)", scount):             count = 4
    else:                                          count = 0
    nvehicles += count

And now onto the visualization

It’s not good enough to have the data. You need to do something with it. See it and explore it.

For some reason I decided that I wanted to graph the hour of the day each incident took place, and produced this time rose, which is a polar bar graph with one sector showing the number of incidents occurring each hour.

You can filter by the day of the week, the number of vehicles involved, the category, year, and fire station town. Then click on one of the sectors to see all the incidents for that hour, and click on an incident to read its description.

Now, if we matched our stations against the list of all stations, and geolocated the incident locations using the Google Maps API (subject to not going OVER_QUERY_LIMIT), then we would be able to plot a map of how far the appliances were driving to respond to each incident. Even better, I could post the start and end locations into the Google Directions API, and get journey times and an idea of which roads and junctions are the most critical.

There’s more. What if we could identify when the response did not come from the closest station, because it was over capacity? What if we could test whether closing down or expanding one of the other stations would improve the performance in response to the database of times, places and severities of each incident? What if each journey time was logged to find where the road traffic bottlenecks are? How about cross-referencing the fire service logs for each incident with the equivalent logs held by the police and ambulance services, to identify the Total Response Cover for the whole incident – information that’s otherwise balkanized and duplicated among the three different historically independent services.

Sometimes it’s also enlightening to see what doesn’t appear in your datasets. In this case, one incident I was specifically looking for strangely doesn’t appear in these Devon and Somerset Fire logs: On 17 March 2011 the Police, Fire and Ambulance were all mobilized in massive numbers towards Goatchurch Cavern – but the Mendip Cave Rescue service only heard about it via the Avon and Somerset Cliff Rescue. Surprise surprise, the event’s missing from my Fire logs database. No one knows anything of what is going on. And while we’re at it, why are they separate organizations anyway?

Next up, someone else can do the Cornwall Fire and Rescue Service and see if they can get their incident search form to work.

Scraping guides: Parsing HTML using CSS selectors Mon, 03 Oct 2011 22:20:58 +0000 We’ve added a new scraping copy-and-paste guide, so you can quickly get the lines of code you need to parse an HTML file using CSS selectors. Get to it from the documentation page:

The HTML parsing guide is available in Ruby, Python and PHP. Just as with all documentation, you can choose which at the top right of the page.

While the library used varies (lxml in Python, Nokogiri in Ruby, Simple HTML DOM in PHP), the principle is the same. You pull the text out of the page the way as you use CSS to style a page.

It’s a popular technique – for example, around 30% of Python scrapers on ScraperWiki use lxml.

]]> 2 758215489
Quickly get an HTML table Wed, 06 Jul 2011 10:37:17 +0000 As I said before the Julian sometimes adds simple things to ScraperWiki, and nobody even notices them.

He obviously learnt, as he ticketed in BitBucket that we needed to release this one.

The External API can output data in several formats. jsondict is the standard JSON format using a dictionary for each row, jsonlist an alternative JSON format which has a simple array for each row. csv is for loading into spreadsheets, or if you want just a simple row based format.

The new one is called htmltable. When you use it in the API explorer you get a convenient easy to read table. You can also open the API URl directly in your browser.

Here are a couple of examples from Nike’s innovative GreenXChange database (scraped here), which is a place for large companies to share patents that improve sustainability.

This API result page shows how many entries there are in the database for each contributing company. And this is all the innovations about gloves. Notice that both pages are just URLs from the API, and contain editable SQL queries.

]]> 1 758214973
Job advert: Product / UX lover Mon, 14 Feb 2011 15:35:38 +0000

ScraperWiki is a Silicon Valley style startup, but based in the UK. We’re changing the world of open data, and how programming is done together on the Internet.

We’re looking for a web product designer who is…

  • Able to make design decisions to launch features by themselves.
  • Capable of writing CSS and HTML, and some Javascript.

Other bits…

  • Loves to balance colour, size, order and prominence on websites.
  • Knows what a web scraper is, and would like to learn to write one.
  • Thinks that data can change the world, but only if we use it right.
  • Either good at working remotely, or willing to relocate to the North West.
  • Desirable – able to make igloos.

To apply – send the following:

  • An example of a website you’ve made that you’re proud of
  • If you have one, a visualisation you’ve made of some data (any data!)
  • Oh, and I guess we’d better see your CV

Along to with the word swjob2 in the subject.

Job advert: Web designer/programmer Wed, 05 Jan 2011 11:29:30 +0000 Care about oil spills, newspapers or lost cats?

ScraperWiki is a Silicon Valley style startup, but in the North West of England, in Liverpool. We’re changing the world of open data, and how programming is done together on the Internet.

We’re looking for a web designer/programmer who is…

  • Capable of writing standards compliant CSS and HTML, and some Javascript.
  • Loves to balance colour, size, order and prominence on websites.
  • Good enough at Photoshop to make any mockups and icons required.
  • Likes to talk to and track users, and then do what’s needed to make their experience better.
  • Server-side coding (Python) a plus but not essential.
  • Knows what a web scraper is, and would like to learn to write one.
  • Thinks that data can change the world, but only if we use it right.
  • Desirable – able to make igloos.

Some practical things…

  • We’re early stage, spending our seed funding. So be aware things will go either way – we’ll crash and burn, or you’ll be a key, senior person in a growing company.
  • We’d like this to end up a permanent position, but if you prefer we’re happy to do individual contracts to start with.
  • Must be willing to either relocate to Liverpool, or able to work from home and travel here regularly (once a week). So somewhere nearby preferred.

To apply – send the following:

  • An example of a website you’ve made that you’re proud of
  • If you have one, a visualisation of some data (any data!)

Along to with the word swjob1 in the subject.