Julian Todd – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 My time at the Autocloud https://blog.scraperwiki.com/2013/01/my-time-at-the-autocloud/ https://blog.scraperwiki.com/2013/01/my-time-at-the-autocloud/#comments Mon, 21 Jan 2013 09:20:01 +0000 http://blog.scraperwiki.com/?p=758217769 The global CADCAM behemoth known as Autodesk hoovers up another small company every two weeks — a process unlikely to diminish following a $750million bond issue last month. (Well, what else are they going to do with that money?)

It was only a matter of time before this happened to me on account of my machine tooling software — shortly before I took delivery of my corporate coloured hang-glider. What could go wrong?

Not only are corporations not people, they are also not like governments whose citizens have a right of access to information. There is very little data to go round, and no one except the too-busy-with-everything-else CEO has the authority to request it. So all we can do is webscrape the press releases.

I wonder what time of day they get put out?

SELECT substr(date, 12, 5) as time, count(*) as c
FROM parsed GROUP BY time ORDER BY c desc


Clearly the most popular times are 8am, 8:30am and 9am Eastern Standard Time, exactly on the half-hour marks, not just sometime between 8 and 9am. You can imagine people in the office on the day having a very firm idea about when is the right time to put it out for the benefit of New York Wall Street casinos so that they have as little information as possible when they start placing bets on the share price.

In big companies it’s all about the finance.

Let’s look at an item from the more detailed SEC financial filings, where my 20 years of efforts in the industry has been filed under “Other”, in favour of these schmancy far-less-work-to-build internet thingies, Vela, Socialcam and Qontext. Humph.


Basically, it’s clear to me that nobody in the primary finance system is using data or quantitative financial models to do their instant trading in response to the disclosures — or else these press releases and quarterly reports would look a whole lot different. They would be machine readable, wouldn’t they? With unique company number identifiers corresponding to structured assets of statistically approximated business models held by the investment houses that would enable the financial analysts to (a) instantly merge the known assets of the taken over company into the purchasing company, and (b) automatically value the sale-price assets of the taken over company in relation to its peers in the same market in the microseconds necessary to respond to the disclosure for the purpose of rapidly dumping or acquiring stock until the price is statistically close to the calculated fair value.

I’m a garbage-in garbage-out kind of guy. If the information ain’t effectively getting into the finance system, it ain’t going to emerge from it. The traders may be rich from shifting around your pension money for a fee, but they do not have psi powers. And their determinations are as informed as the position of Mozart on the weekly top of the charts.

Meanwhile, I have graphed the occurrences of the word “cloud” in their press releases:


It’s rapidly approaching 100% of press releases. The company says it’s moving into the internet technology.

I was then trying to do some analysis of the About Autodesk boilerplate section of each press release, as it evolved from this:

Autodesk, Inc. is a Fortune 1000 company, wholly focused on ensuring that great ideas are turned into reality. With seven million users, Autodesk is the world’s leading software and services company for the manufacturing, infrastructure, building, media and entertainment, and wireless data services fields. Autodesk’s solutions help customers create, manage and share their data and digital assets more effectively. As a result, customers turn ideas into competitive advantage, become more productive, streamline project efficiency and maximize profits.

to this

Autodesk, Inc. is a leader in 3D design, engineering and entertainment software. Customers across the manufacturing, architecture, building, construction, and media and entertainment industries — including the last 17 Academy Award winners for Best Visual Effects — use Autodesk software to design, visualize and simulate their ideas. Since its introduction of AutoCAD software in 1982, Autodesk continues to develop the broadest portfolio of state-of-the-art software for global markets. For additional information about Autodesk, visit www.autodesk.com.

… but there were too many typos in it, even though you would have thought that it would have been cut-and-pasted from the previous press release almost every single day!

When you think about it, the press release stream is actually just a very poor quality blog specifically designed to be plagiarized by over-worked journalists. I did a detailed analysis of one single press release here.

The company also hosts about 40 proper blogs, which make more interesting reading.

This large corporate adventure has given me a different perspective. At Scraperwiki I was used to making datasets from aggregations of corporate data (finance statements, government contracts, product recalls, etc), but I never really felt it was real. Behind them there are stories and deals taking place and lives turning upside down at the stroke of a CEO’s pen. How do we know if the data that has been captured contains the essence of what is going on, or if it has been carefully cleansed for the purpose of to disclosing nothing of any significance to the stock market for the first half hour of the day. For all we know these datasets could be as significant as a list of sock colours for showing what is going on out there.

https://blog.scraperwiki.com/2013/01/my-time-at-the-autocloud/feed/ 2 758217769
On-line directory tree webscraping https://blog.scraperwiki.com/2012/09/on-line-directory-tree-webscraping/ https://blog.scraperwiki.com/2012/09/on-line-directory-tree-webscraping/#comments Fri, 14 Sep 2012 16:01:39 +0000 http://blog.scraperwiki.com/?p=758217490 As you surf around the internet — particularly in the old days — you may have seen web-pages like this:

or this:

The former image is generated by Apache SVN server, and the latter is the plain directory view generated for UserDir on Apache.

In both cases you have a very primitive page that allows you to surf up and down the directory tree of the resource (either the SVN repository or a directory file system) and select links to resources that correspond to particular files.

Now, a file system can be thought of as a simple key-value store for these resources burdened by an awkward set of conventions for listing the keys where you keep being obstructed by the ‘/‘ character.

My objective is to provide a module that makes it easy to iterate through these directory trees and produce a flat table with the following helpful entries:

abspath fname name ext svnrepo rev url
/Charterhouse/ 2010/ PotOxbow/ PocketTopoFiles/ PocketTopoFiles/ PocketTopoFile / CheddarCatchment 19 http://www.cave-registry.org.uk/ svn/ CheddarCatchment/ Charterhouse/ 2010/ PotOxbow/ PocketTopoFiles/
/mmmmc/ rawscans/ MossdaleCaverns/ PolishedFloor-drawnup1of3.jpg PolishedFloor-drawnup1of3.jpg PolishedFloor-drawnup1of3 .jpg Yorkshire 2383 http://cave-registry.org.uk/svn/ Yorkshire/ mmmmc/ rawscans/ MossdaleCaverns/ PolishedFloor-drawnup1of3.jpg

Although there is clearly redundant data between the fields url, abspath, fname, name, ext, having them in there makes it much easier to build a useful front end.

The function code (which I won’t copy in here) is at https://scraperwiki.com/scrapers/apache_directory_tree_extractor/. This contains the functions ParseSVNRevPage(url) and ParseSVNRevPageTree(url), both of which return dicts of the form:

{'url', 'rev', 'dirname', 'svnrepo', 
 'contents':[{'url', 'abspath', 'fname', 'name', 'ext'}]}

I haven’t written the code for parsing the Apache Directory view yet, but for now we have something we can use.

I scraped the UK Cave Data Registry with this scraper which simply applies the ParseSVNRevPageTree() function to each of the links and glues the output into a flat array before saving it:

lrdata = ParseSVNRevPageTree(href)
ldata = [ ]
for cres in lrdata["contents"]:
    cres["svnrepo"], cres["rev"] = lrdata["svnrepo"], lrdata["rev"]
scraperwiki.sqlite.save(["svnrepo", "rev", "abspath"], ldata)

Now that we have a large table of links, we can make the cave image file viewer based on the query:
select abspath, url, svnrepo from swdata where ext=’.jpg’ order by abspath limit 500

By clicking on a reference to a jpg resource on the left, you can preview what it looks like on the right.

If you want to know why the page is muddy, a video of the conditions in which the data was gathered is here.

Image files are usually the most immediately interesting out of any unknown file system dump. And they can be made more interesting by associating meta-data with them (given that no convention for including interesting information in the EXIF sections of their file formats). This meta-data might be floating around in other files dumped into the same repository — eg in the form of links to them from html pages which relate to picture captions.

But that is a future scraping project for another time.

https://blog.scraperwiki.com/2012/09/on-line-directory-tree-webscraping/feed/ 1 758217490
Three hundred thousand tonnes of gold https://blog.scraperwiki.com/2012/07/tonnes-of-gold/ https://blog.scraperwiki.com/2012/07/tonnes-of-gold/#comments Wed, 04 Jul 2012 20:17:18 +0000 http://blog.scraperwiki.com/?p=758217195 On 2 July 2012, the US Government debt to the penny was quoted at $15,888,741,858,820.66. So I wrote this scraper to read the daily US government debt for every day back to 1996. Unfortunately such a large number overflows the double precision floating point notation in the database, and this same number gets expressed as 1.58887418588e+13.

Doesn’t matter for now. Let’s look at the graph over time:

It’s not very exciting, unless you happen to be interested in phrases such as “debasing our fiat currency” and “return to the gold standard”. In truth, one should really divide the values by the GDP, or the national population, or the cumulative inflation over the time period to scale it properly.

Nevertheless, I decided also to look at the gold price, which can be seen as a graph (click the [Graph] button, then [Group Column (x-axis)]: “date” and [Group Column (y-axis)]: “price”) on the Data Hub. They give this dataset the title: Gold Prices in London 1950-2008 (Monthly).

Why does the data stop in 2008 just when things start to get interesting?

I discovered a download url in the metadata for this dataset:


which is somewhere within the githubtm as part of the repository https://github.com/datasets/gold-prices in which there resides a 60 line Python scraper known as process.py.

Aha, something I can work with! I cut-and-pasted the code into ScraperWiki as scrapers/gold_prices and tried to run it. Obviously it didn’t work as-is — code always requires some fiddling about when it is transplanted into an alien environment. The module contained three functions: download(), extract() and upload().

The download() function didn’t work because it tries to pull from the broken link:


This is one of unavoidable failures that can befall a webscraper, and was one of the motivations for hosting code in a wiki so that such problems can be trivially corrected without an hour of labour checking out the code in someone else’s favourite version control system, setting up the environment, trying to install all the dependent modules, and usually failing to get it to work if you happen to use Windows like me.

After some looking around on the Bundesbank website, I found the Time_series_databases (Click on [Open all] and search for “gold”.) There’s Yearly average, Monthly average and Daily rates. Clearly the latter is the one to go for as the other rates are averages and likely to be derivations of the primary day rate value.

I wonder what a “Data basket” is.

Anyways, moving on. Taking the first CSV link and inserting it into that process.py code hits a snag in the extract() function:

downloaded = 'cache/bbk_WU5500.csv'
outpath = 'data/data.csv'
def extract():
    reader = csv.reader(open(downloaded))
    # trim junk from files
    newrows = [ [row[0], row[1]] for row in list(reader)[5:-1] ]

    existing = []
    if os.path.exists(outpath):
        existing = [ row for row in csv.reader(open(outpath)) ]

    starter = newrows[0]
    for idx,row in enumerate(existing):
        if row[0] == starter[0]:
            del existing[idx:]

    # and now add in new data
    outrows = existing + newrows
    csv.writer(open(outpath, 'w')).writerows(outrows)

ScraperWiki doesn’t have persistent files, and in this case they’re not helpful because all these lines of code are basically replicating the scraperwiki.sqlite.save() features through use of the following two lines:

    ldata = [ { "date":row[0], "value":float(row[1]) }  for row in newrows  if row[1] != '.' ]
    scraperwiki.sqlite.save(["date"], ldata)

And now your always-up-to-date gold price graph is yours to have at the cost of select date, value from swdata order by date –> google annotatedtimeline.

But back to the naked github disclosed code. Without its own convenient database.save feature, this script must use its own upload() function.

def upload():
    import datastore.client as c
    dsurl = 'http://datahub.io/dataset/gold-prices/resource/b9aae52b-b082-4159-b46f-7bb9c158d013'
    client = c.DataStoreClient(dsurl)

Ah, we have another problem: a dependency on the undeclared datastore.client library, which was probably so seamlessly available to the author on his own computer that he didn’t notice its requirement when he committed the code to the github where it could not be reused without this library. The library datastore.client is not available in the github/datasets account; but you can find it in the completely different github/okfn account.

I tried calling this client.py code by cut-and-pasting it into the ScraperWiki scraper, and it did something strange that looked like it was uploading the data to somewhere, but I can’t work out what’s happened. Not to worry. I’m sure someone will let me know what happened when they find a dataset somewhere that is inexplicably more up to date than it used to be.

But back to the point. Using the awesome power of our genuine data-hub system we can take the us_debt_to_the_penny, and attach the gold_prices database to perform the combined query to scales ounces of gold into tonnes:

    AS debt_gold_tonnes
FROM swdata AS debt
LEFT JOIN gold_prices.swdata as gold
  ON gold.date = debt.date
WHERE gold.date is not null
ORDER BY debt.date

and get the graph of US government debt expressed in terms of tonnes of gold.

So that looks like good news for all the gold-bugs, the US government debt in the hard currency of gold has been going steadily down by a factor of two since 2001 to around 280 thousand tonnes. The only problem with that there’s only 164 thousand tonnes of gold in the world according to the latest estimates.

Other fun charts people find interesting such as gold to oil ratio can be done once the relevant data series is loaded and made available for joining.

https://blog.scraperwiki.com/2012/07/tonnes-of-gold/feed/ 3 758217195
PDF table extraction of pagenated table https://blog.scraperwiki.com/2012/06/pdf-table-extraction-of-a-table/ https://blog.scraperwiki.com/2012/06/pdf-table-extraction-of-a-table/#comments Mon, 25 Jun 2012 11:27:32 +0000 http://blog.scraperwiki.com/?p=758216965 Got PDFs you want to get data from?
Try our web interface and API over at PDFTables.com!

The Isle of Man aircraft registry (in PDF form) has long been a target of mine waiting for the appropriate PDF parsing technology. The scraper is here.

Setting aside the GetPDF() function, which deals with copying out each new pdf file as it is updated and backing it up into the database as a base64 encoded binary blob for quicker access, let’s have a look at the what the PDF itself looks like. I have snipped out the top left and top right hand corners of the document so you can see it more clearly.

I have selected some of the text in one of the rows (in blue). Notice how the rectangle containing the first row of text crosses the vertical line between two cells of the table. This can make it rather tricky and requires you to analyze it at the character level.

It is essential to use a PDF extracting tool that gives you access to those dividing lines between the cells of the table. The only one I have found that does it is pdfminer, which is a pdf interpreter that is entirely written in Python.

Picking out the dividing lines

Extracting the dividing lines of the table is an unusual requirement (most applications simply want the raw text), so for the moment it looks like quite a hack. Fortunately that’s not a problem in ScraperWiki, and we can access the lower level components of the pdfminer functionality by importing these classes:

from pdfminer.pdfparser import PDFParser, PDFDocument, PDFNoOutlines, PDFSyntaxError
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import 
  LAParams, LTTextBox, LTTextLine, LTFigure, LTImage, LTTextLineHorizontal, 
  LTTextBoxHorizontal, LTChar, LTRect, LTLine, LTAnon
from binascii import b2a_hex
from operator import itemgetter

Next you need some boilerplate code for running the PDFParser, which takes a file stream as input and breaks it down into separate page objects through a series of transformative steps. Internally the PDF file has an ascii part containing some special PDF commands which must be interpreted, followed by a binary blob which must be unzipped in order for its PostScript contents to be tokenized and then interpreted and rendered into the page space as if simulating a plotter.

cin = StringIO.StringIO()

parser = PDFParser(cin)
doc = PDFDocument()

rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)

for n, page in enumerate(doc.get_pages()):
    layout = device.get_result()

We’re now going to look into the ParsePage(layout) function where all the rendered components from the postscript are in layout containers, of which the top level is the page. We want to pull out just the horizontal and vertical lines, as well as the text. Sometimes what looks like a horizontal or vertical line is actually a thin rectangle. It all depends on the way the software that generated the PDF chooses to draw it.

xset, yset = set(), set()
tlines = [ ]
objstack = list(reversed(layout._objs))
while objstack:
    b = objstack.pop()
    if type(b) in [LTFigure, LTTextBox, LTTextLine, LTTextBoxHorizontal]:
        objstack.extend(reversed(b._objs))  # put contents of aggregate object into stack
    elif type(b) == LTTextLineHorizontal:
    elif type(b) == LTLine:
        if b.x0 == b.x1:
        elif b.y0 == b.y1:
            print "sloped line", b
    elif type(b) == LTRect: 
        if b.x1 - b.x0 < 2.0:
        assert False, "Unrecognized type: %s" % type(b)

To make it clearer, I am writing this code differently to how it appears in the scraper. As you build up this function you will discover how the generator of the PDF document chose to lay it out. You will adapt the code based on this information.

This is what we do: we interactively build up the code to efficiently target one particular type of document — where “type” means a document series all rendered the same way. If we can generalize this to some nearby types, then all the better. What we do not attempt to do at this stage is develop a general purpose PDF table extractor that works on everything.

As programmers who have confidence in our intelligence and abilities, this is always our instinct. But it doesn’t work in this domain. There are too many exceptional cases and layout horrors in these types of files that you will always be caught out. What began as your beautifully functional table extracting system will keep coming back with failures and more failed cases over and over again that you hack and fix until it looks like a Christmas turkey at the end of a meal. Even in this extremely simple table there are problems you cannot see. For example:

The phrase “Kingdom of Saudi Arabia” exists if you copy and paste it to text, but it has managed to be truncated by the edge of the table. Why hasn’t it been wordwrapped like ever other item of text in the table? I don’t know.

Slicing the text up between the lines

Moving on through the process, we now convert our diving lines into ordered arrays of x and y values. We’re taking clear advantage that the table occupies the entirety of each page — mainly to keep this tutorial simple because you want to know the techniques, not the specifics of this particular example. There are other tricks that work for more haphazard tables.

xlist = sorted(list(xset))
ylist = sorted(list(yset))
    # initialize the output array of table text boxes
tboxes = [ [ [ ]  for xl in xlist ]  for yl in ylist ]

We also have the function which, given an x or y coordinate, returns the index of the dividing line we are between. (The boundary case of text going beyond the boundaries, like that “Kingdom of Saudi Arabia” example can be handled by discarding the last value in wlist.) This function uses a binary search.

def Wposition(wlist, w):
    ilo, ihi = 0, len(wlist)
    while ilo < ihi -1:
        imid = (ilo + ihi) / 2
        if w < wlist[imid]:
            ihi = imid
            ilo = imid
    return ilo

Normally we aren’t very interested in efficiency and would have simply done this with a for-loop, like so:

def Wposition(wlist, w):
    for i, wl in enumerate(wlist):
        if w < wl:
    return i

However, we’re going to need to call this function a lot of times (at least once for each character in the PDF file), so we’re likely to run out of CPU time.

Here is the way I put the text from each text-line into the appropriate box, splitting up the characters when that text box happens to run across a vertical table boundary.

for lt in tlines:
    y = (lt.y0+lt.y1)/2
    iy = Wposition(ylist, y)
    previx = None
    for lct in lt:
        if type(lct) == LTAnon:
            continue  # a junk element in LTTextLineHorizontal
        x = (lct.x0+lct.x1)/2
        ix = Wposition(xlist, x)
        if previx != ix:
            boxes[iy][ix].append([])  # begin new chain of characters
            previx = ix

Now the boxes contain lists of lists of characters. For example, the first two boxes in our two dimensional array of boxes is going to look like:

boxes[0][0] = [['N', 'u', 'm', 'b', 'e', 'r']] 
boxes[0][1] = [['C', 'e', 'r', 't', 'i', 'f', 'i', 'c', 'a', 't', 'e'], ['n', 'u', 'm', 'b', 'e', 'r']]

The next stage is to collapse these boxes down and extract the headers row from the table to check we have it right.

for iy in range(len(ylist)):
    for ix in range(len(xlist)):
        boxes[iy][ix] = [ "".join(s)  for s in boxes[iy][ix] ]
headers = [ " ".join(lh.strip()  for lh in h).strip()  for h in boxes.pop() ]
assert headers == ['Number', 'Certificate number', 'Registration Mark', 'Date Registered', 'Manufacturer', 'Type', 'Serial Number', 'Mode S Code', 'Change of Reg. owner', 'Registered Owner', 'Article 3 (4) or (5)', 'Previous Registration', 'De-registered', 'New State of registry']

From here on we’re just interpreting a slightly dirty CSV file and there’s not much more to learn from it. Once a few more versions of the data come in I’ll be able to see what changes happen in this tables across the months and maybe find something out from that.

The exploration and visualization of the data is equally important as (if not more than) the scraping, although it can only be done afterwards. However, one often runs out of steam having reached this stage and doesn’t get round to finishing the whole job. There’s always the question: “So why the heck did I start scraping this data in the first place?” … in this case, I have forgotten the answer! I think it’s because I perceived that the emerging global aristocracy tends to jet between their tax-havens, such as the Isle of Man, in private planes that were sometimes registered there.

There are disappointingly fewer registrations than I thought there were – about 360, and only 15 (in the table grouped by manufacturer) are Learjets. I should have estimated the size of the table first.

At least we’re left with some pretty efficient table-parsing code. Do you have any more interesting PDF tables you’d like to run through it..?

Got PDFs you want to get data from?
Try our web interface and API over at PDFTables.com!
https://blog.scraperwiki.com/2012/06/pdf-table-extraction-of-a-table/feed/ 4 758216965
5 yr old goes ‘potty’ at Devon and Somerset Fire Service (Emergencies and Data Driven Stories) https://blog.scraperwiki.com/2012/05/5-yr-old-goes-potty/ Fri, 25 May 2012 07:13:33 +0000 http://blog.scraperwiki.com/?p=758216901

It’s 9:54am in Torquay on a Wednesday morning:

One appliance from Torquays fire station was mobilised to reports of a child with a potty seat stuck on its head.

On arrival an undistressed two year old female was discovered with a toilet seat stuck on her head.

Crews used vaseline and the finger kit to remove the seat from the childs head to leave her uninjured.

A couple of different interests directed me to scrape the latest incidents of the Devon and Somerset Fire and Rescue Service. The scraper that has collected the data is here.

Why does this matter?

Everybody loves their public safety workers — Police, Fire, and Ambulance. They save lives, give comfort, and are there when things get out of hand.

Where is the standardized performance data for these incident response workers? Real-time and rich data would revolutionize its governance and administration, would give real evidence of whether there are too many or too few police, fire or ambulance personnel/vehicles/stations in any locale, or would enable the implementation of imaginative and realistic policies resulting from major efficiency and resilience improvements all through the system?

For those of you who want to skip all the background discussion, just head directly over to the visualization.

A rose diagram showing incidents handled by the Devon and Somerset Fire Service

The easiest method to monitor the needs of the organizations is to see how much work each employee is doing, and add more or take away staff depending on their workloads. The problem is, for an emergency service that exists on standby for unforeseen events, there needs to be a level of idle capacity in the system. Also, there will be a degree of unproductive make-work in any organization — Indeed, a lot of form filling currently happens around the place, despite there being no accessible data at the end of it.

The second easiest method of oversight is to compare one area with another. I have an example from California City Finance where the Excel spreadsheet of Fire Spending By city even has a breakdown of the spending per capita and as a percentage of the total city budget. The city to look at is Vallejo which entered bankruptcy in 2008. Many of its citizens blamed this on the exorbitant salaries and benefits of its firefighters and police officers. I can’t quite see it in this data, and the story journalism on it doesn’t provide an unequivocal picture.

The best method for determining the efficient and robust provision of such services is to have an accurate and comprehensive computer model on which to run simulations of the business and experiment with different strategies. This is what Tesco or Walmart or any large corporation would do in order to drive up its efficiency and monitor and deal with threats to its business. There is bound to be a dashboard in Tesco HQ monitoring the distribution of full fat milk across the country, and they would know to three decimal places what percentage of the product was being poured down the drain because it got past its sell-by date, and, conversely, whenever too little of the substance had been delivered such that stocks ran out. They would use the data to work out what circumstances caused changes in demand. For example, school holidays.

I have surveyed many of the documents within the Devon & Somerset Fire & Rescue Authority website, and have come up with no evidence of such data or its analysis anywhere within the organization. This is quite a surprise, and perhaps I haven’t looked hard enough, because the documents are extremely boring and strikingly irrelevant.

Under the hood – how it all works

The scraper itself has gone through several iterations. It currently operates through three functions: MainIndex(), MainDetails(), MainParse(). Data for each incident is put into several tables joined by the IncidentID value derived from the incident’s static url, eg:


MainIndex() operates their search incidents form grabbing 10 days at a time and saving URLs for each individual incident page into the table swdata.

MainDetails() downloads each of those incident pages, parsing the obvious metadata, and saving the remaining HTML content of the description into the database. (This used to attempt to parse the text, but I then had to move it into the third function so I could develop it more easily.) A good way to find the list of urls that have not been downloaded and saved into the swdetails is to use the following SQL statement:

select swdata.IncidentID, swdata.urlpage 
from swdata 
left join swdetails on swdetails.IncidentID=swdata.IncidentID 
where swdetails.IncidentID is null 
limit 5

We then download the HTML from each of the five urlpages, save it into the table under the column divdetails and repeat until no more unmatched records are retrieved.

MainParse() performs the same progressive operation on the HTML contents of divdetails, saving it into the the table swparse. Because I was developing this function experimentally to see how much information I could obtain from the free-form text, I had to frequently drop and recreate enough of the table for the join command to work:

scraperwiki.sqlite.execute("drop table if exists swparse")
scraperwiki.sqlite.execute("create table if not exists swparse (IncidentID text)")

After marking the text down (by replacing the <p> tags with linefeeds), we have text that reads like this (emphasis added):

One appliance from Holsworthy was mobilised to reports of a motorbike on fire. Crew Commander Squirrell was in charge.

On arrival one motorbike was discovered well alight. One hose reel was used to extinguish the fire. The police were also in attendance at this incident.

We can get who is in charge and what their rank is using this regular expression:

re.findall("(crew|watch|station|group|incident|area)s+(commander|manager)s*([w-]+)(?i)", details)

You can see the whole table here including silly names, misspellings, and clear flaws within my regular expression such as not being able to handle the case of a first name and a last name being included. (The personnel misspellings suggest that either these incident reports are not integrated with their actual incident logs where you would expect persons to be identified with their codenumbers, or their record keeping is terrible.)

For detecting how many vehicles were in attenence, I used this algorithm:

appliances = re.findall("(S+) (?:(fire|rescue) )?(appliances?|engines?|tenders?|vehicles?)(?: from ([A-Za-z]+))?(?i)", details)
nvehicles = 0
for scount, fire, engine, town in lappliances:
    if town and "town" not in data:
        data["town"] = town.lower(); 
    if re.match("one|1|an?|another(?i)", scount):  count = 1
    elif re.match("two|2(?i)", scount):            count = 2
    elif re.match("three(?i)", scount):            count = 3
    elif re.match("four(?i)", scount):             count = 4
    else:                                          count = 0
    nvehicles += count

And now onto the visualization

It’s not good enough to have the data. You need to do something with it. See it and explore it.

For some reason I decided that I wanted to graph the hour of the day each incident took place, and produced this time rose, which is a polar bar graph with one sector showing the number of incidents occurring each hour.

You can filter by the day of the week, the number of vehicles involved, the category, year, and fire station town. Then click on one of the sectors to see all the incidents for that hour, and click on an incident to read its description.

Now, if we matched our stations against the list of all stations, and geolocated the incident locations using the Google Maps API (subject to not going OVER_QUERY_LIMIT), then we would be able to plot a map of how far the appliances were driving to respond to each incident. Even better, I could post the start and end locations into the Google Directions API, and get journey times and an idea of which roads and junctions are the most critical.

There’s more. What if we could identify when the response did not come from the closest station, because it was over capacity? What if we could test whether closing down or expanding one of the other stations would improve the performance in response to the database of times, places and severities of each incident? What if each journey time was logged to find where the road traffic bottlenecks are? How about cross-referencing the fire service logs for each incident with the equivalent logs held by the police and ambulance services, to identify the Total Response Cover for the whole incident – information that’s otherwise balkanized and duplicated among the three different historically independent services.

Sometimes it’s also enlightening to see what doesn’t appear in your datasets. In this case, one incident I was specifically looking for strangely doesn’t appear in these Devon and Somerset Fire logs: On 17 March 2011 the Police, Fire and Ambulance were all mobilized in massive numbers towards Goatchurch Cavern – but the Mendip Cave Rescue service only heard about it via the Avon and Somerset Cliff Rescue. Surprise surprise, the event’s missing from my Fire logs database. No one knows anything of what is going on. And while we’re at it, why are they separate organizations anyway?

Next up, someone else can do the Cornwall Fire and Rescue Service and see if they can get their incident search form to work.

Fine set of graphs at the Office of National Statistics https://blog.scraperwiki.com/2012/03/fine-set-of-graphs-at-the-office-of-national-statistics/ Thu, 22 Mar 2012 11:47:01 +0000 http://blog.scraperwiki.com/?p=758216643 It’s difficult to keep up. I’ve just noticed a set of interesting interactive graphs over at the Office of National Statistics (UK).

If the world is about people, then the most fundamental dataset of all must be: Where are the people? And: What stage of life are they living through?

A Population Pyramid is a straightforward way to visualize the data, like so:

This image is sufficient for determining what needs to be supplied (eg more children means more schools and toy-shops), but it doesn’t explain why.

The “why?” and “what’s going on?” questions are much more interesting, but are pretty much guesswork because they refer to layers in the data that you cannot see. For example, the number of people in East Devon of a particular age is the sum of those who have moved into the area at various times, minus those who have moved away (temporarily or permanently), plus those who were already there and have grown older but not yet died. For any bulge, you don’t know which layer it belongs to.

In this 2015 population pyramid there are bulges at 28, 50 and a pronounced spike at 68, as well as dips at 14 and 38. In terms of birth years, these correspond to 1987, 1965 and 1947 (spike), and dips at 2001 and 1977.

You can pretend they correspond to recessions, economic boom times and second wave feminism, but the 1947 post-war spike when a mass of men-folk were demobilized from the military is a pretty clean signal.

What makes this data presentation especially lovely is that it is localized, so you can see the population pyramid per city:

Cambridge, as everyone knows, is a university town, which explains the persistent spike at the age 20.

And, while it looks like there is gender equality for 20 year old university students, there is a pretty hefty male lump up to the age of 30 — possibly corresponding folks doing higher degrees. Is this because fewer men are leaving town at the appropriate age to become productive members of society, or is there an influx of foreign grad students from places where there is less of a gender equality? The data set of student origins and enrollments would give you the story.

As to the pyramid on the right hand side, I have no idea what is going on in Camden to account for that bulge in 30 year olds. What is obvious, though, is that the bulge in infants must be related. In fact, almost all the children between the ages of 0 and 16 years will have corresponding parents higher up the same pyramid. Also, there is likely to be a pairwise cross-gender correspondence between individuals of the same generation living together.

These internal links, external data connections, sub-cohorts and new questions raised the more you look at it means that it is impossible to create a single all-purpose visualization application that could serve all of these. We can wonder as to whether an interface which worked via javascript-generated SQL calls (rather than flash and server-side queries) would have enabled someone with the right skills to roll their own queries and, for example, immediately find out which city and age group has the greatest gender disparity, and whether all spikes at the 20-year-old age bracket can be accounted for by universities.

For more, see An overview of ONS’s population statistics.

As it is, someone is going to have to download/scrape, parse and load at least one year of source data into a data hub of their choice in order to query this (we’ve started on 2010’s figures here on ScraperWiki – take a look). Once that’s done, you’d be able to sort the cities by the greatest ratio between number of 20 year olds and number of 16 year olds, because that’s a good signal of student influx.

I don’t have time to get onto the Population projection models, where it really gets interesting. There you have all the clever calculations based on guestimates of migration, mortality and fertility.

What I would really like to see are these calculations done live and interactively, as well as combined with economic data. Is the state pension system going to go bankrupt because of the “baby boomers”? Who knows? I know someone who doesn’t know: someone who’s opinion does not rely (even indirectly) on something approaching a dynamic data calculation. I mean, if the difference between solvency and bankruptcy is within the margin of error in the estimate of fertility rate, or 0.2% in the tax base, then that’s not what I’d call bankrupt. You can only find this out by tinkering with the inputs with an element of curiosity.

Privatized pensions ought to be put into the model as well, to give them the macro-economic context that no pension adviser I’ve ever known seems capable of understanding. I mean, it’s evident that the stock market (in which private pensions invest) does happen to yield a finite quantity of profit each year. Ergo it can support a finite number of pension plans. So a national policy which demands more such pension plans than this finite number is inevitably going to leave people hungry.

Always keep in mind the long term vision of data and governance. In the future it will all come together like transport planning, or the procurement of adequate rocket fuel to launch a satellite into orbit; a matter of measurements and predictable consequences. Then governance will be a science, like chemistry, or the prediction of earthquakes.

But don’t forget: we can’t do anything without first getting the raw data into a usable format. Dave McKee’s started on 2010’s data here … fancy helping out?

The Data Hob https://blog.scraperwiki.com/2012/03/the-data-hob/ https://blog.scraperwiki.com/2012/03/the-data-hob/#comments Tue, 06 Mar 2012 03:59:55 +0000 http://blog.scraperwiki.com/?p=758216528 Keeping with the baking metaphor, a hob is a projection or shelf at the back or side of a fireplace used for keeping food warm. The central part of a wheel into which the spokes are inserted looks kind of like a hob, and is called the hub (etymology).

Lately there has been a move to refer to certain websites as data hubs.

But what does this mean?

In transport terminology, the cities of Chicago or Paris are known as the hub train stations of the networks because all the lines run out from them to all over their respective countries, like the spokes of a wheel.

Back in the virtual world, the Open Knowledge Foundation has decided to rebrand their CKAN system as the Data Hub, describing it as a “community-run catalogue of useful sets of data on the Internet”. It also contains 3290 datasets that you can “browse, learn about and download”.

At the same time, and apparently entirely independently, Microsoft have a prototype service called Data Hub. It’s billed as an “online service for data discovery, distribution and curation”. I can’t tell how many datasets it has, as nothing is visible unless you ask them if you can register.

What about the other word? What does data mean?

I do have a working definition of “data” — it is a representation which allows for aggregations. That is, a set of information on which you can perform the SQL GROUP BY operation.

There are other crucial SQL operations, such as the WHERE and ORDER BY clauses, but these don’t creatively transform things in the way that the GROUP BY operation does.

If we go back to the UN peacekeepers database (which I talked at length about here), each record in the swdata table is:

 (country text, mission text, people integer, month text)

We can find the number of people from each country sent to all the missions in the month of June 2010 using the following query:

SELECT swdata.country,
       sum(people) as tpeople
FROM swdata
WHERE month='2010-06'
GROUP BY country
ORDER BY tpeople desc

The ORDER BY clause gives the highest numbers first:

country tpeople
Pakistan 10692
Bangladesh 10641
India 8920
Nigeria 5732
Egypt 5461
Nepal 5148
Ghana 3748
Rwanda 3654
Jordan 3599
Uruguay 2566

Of course it’s no surprise that Pakistan is on this table above Belgium, it’s got 17 times the population. What we need to see here is the per capita table.

This obviously requires the table of populations per country. Unfortunately there wasn’t one on Scraperwiki yet, so I had to create one from the Wikipedia article.

Now I would have loved to have derived it from the editable source of the wikipedia article, as I described elsewhere, but is impossible to do because it is insanely programatic:

|2||align=left|{{IND}}||1,210,193,422||March 1, 2011
||{{#expr: 1210193000/{{worldpop}}*100 round 2}}%
|3||align=left|{{USA}}||{{data United States of America|poptoday 1}}
|16||align=left|{{EGY}}||{{formatnum: {{#expr: (79602+4.738*{{Age in days|2011|1|1}}) round 0}}000}}
||{{#expr: (79602000+4738*{{Age in days|2011|1|1}})/{{worldpop}}*100 round 2}}%

As you can see, some rows contain the numbers properly, some have the numbers transcluded from a different place, and some are expressed in terms of a mathematical formula.

The Wikipedia template programming language is so dauntingly sophisticated, except for not being able to produce the ranking numbers automatically, over 200 of which would have needed to be retyped following the creation of South Sudan. Oh, and the table has to be updated into over 30 different languages.

Country identifications are tricky because they change (like Sudan) and do have a lot of different spellings.

(One thing I have never understood is why countries have different names in different languages. I mean, something like the train system around San Francisco is called BART, short for Bay Area Rapid Transit, and everyone calls it that, because that is the name it has given itself. Even the spanish call it Bart. So why do they insist on calling the US, the United States, Estados Unidos? What does that shorten to? The EU?)

My population scraper, which should really have been derived in some way from the UNstats demographics data. To make it match up with the country spellings in the peacekeeper data I had to add some extra alternative spellings:

countryalts = { u"Cxf4te d'Ivoire":["Cote d Ivoire"],
    "United States":["United States of America"],
    "Democratic Republic of the Congo":["DR Congo"],
    "South Korea":["Republic of Korea"] }

This allows us to perform a join between the two data sets together to create the per-capita peace keepers table per country:

attach: country_populations

SELECT swdata.country,
      as permillionpop,
    sum(people) as peacekeepers,
FROM swdata
LEFT JOIN poptable on swdata.country=poptable.Country
WHERE month='2010-06'
GROUP BY swdata.country
ORDER BY permillionpop desc

…which gives a much more representative country ranking that puts Uruguay, Jordan and Rwanda in the top three places.

It would have been even better to use a more fine grained demographics database so as to select only for the population of a country between the ages of 20 and 40, say.

Now, suppose I had a third data set, which was the financial contribution by each country to the UN peacekeeping efforts, and I was able to join that into the query in some meaningful way that tested the relationship. For example, do countries that contribute fewer peacekeepers contribute more money to make up for deficit, or are some nations just naturally more generous than others — in proportion of their GDP? (This looks like a fourth data set to me.)

This kind of operation only works if all the datasets have been imported into the same database or complex.

What is this starting to feel like we are doing?

Does it feel like we are fitting wooden spokes into some kind of a central object that joins them into a rigid entity which acts as a single object pivoting on an axis?

Maybe we should call it a data windmill.

https://blog.scraperwiki.com/2012/03/the-data-hob/feed/ 1 758216528
The UN peacekeeping mission contributions mostly baked https://blog.scraperwiki.com/2012/02/the-un-peacekeeping-mission-contributions-mostly-baked/ https://blog.scraperwiki.com/2012/02/the-un-peacekeeping-mission-contributions-mostly-baked/#comments Wed, 22 Feb 2012 17:04:47 +0000 http://blog.scraperwiki.com/?p=758216402 Many of the most promising webscraping projects are abandoned when they are half done. The author often doesn’t know it. “What do you want? I’ve fully scraped the data,” they say.

But it’s not good enough. You have to show what you can do with the data. This is always very hard work. There are no two ways about it. In this example I have definitely done it.

For those of you who are not so interested in the process, the completed work is here. And if you don’t think it’s done very well, come back and read what I have to say.

[By the way, the raw data is here, for you to download and — quote — “Do whatever you want with it.”]

Phase One: The scraping

At the Columbia event I was quite pleased to create a database of un_peacekeeping_statistics from a set of zip files of pdfs containing the monthly troop contributions to the various UN peacekeeping missions.

Each pdf document was around 40 pages and in a format like this:

This table says that during the month of January 2012 the government of Argentina sent 3 of its citizens to the United Nations Mission for the Referendum in Western Sahara, 741 people to the United Nations Stabilization Mission in Haiti, 265 to the United Nations Peacekeeping Force in Cyprus, and so forth.

I was very lucky with this: the format is utterly consistent because it is spat out of their database. I was able to complete it with about 160 lines of code.

After getting the code working, I cleaned it up by removing all the leftover print statements until the only thing that would be produced at runtime was a message when a new month became available in the database. The email generating code is on line 34 and it has so far worked once by sending me an email which looked like:

Subject: UN peacekeeping statistics for 2012-01

Dear friend,
There are 788 new records in the database for

after month 2011-12 to month 2012-01

Who gets this email? Those who are listed as doing so in the editors list (see the image above). Maybe if you are a a journalist with international conflicts on your beat, you ought to get on this list. The emailer technology was outlined in an earlier blog-post. There is no UI for it, so it can only be enabled by request [send your request to feedback].

Phase Two: The analysis

What we have now is a table of over 86000 records stretching back to January 2003. The important columns in the table are:

month text,
country text,
mission text,
people integer

It turns out there are hundreds of relevant timeline graphs which you can make from this data with a little bit of SQL.

For example, what are the three top countries in terms of maximum deployment to any mission? Find it using:

SELECT country, max(people) as max_people
FROM swdata
GROUP BY country
ORDER BY max_people desc

The answer is India, Bangladesh and Pakistan.

To which missions do these three neighbouring, sometimes-at-war, rival countries predominantly send their troops?

Query this by executing:

SELECT mission, sum(people) as people_months
FROM swdata
WHERE country='India' or country='Bangladesh' or
GROUP BY mission
ORDER BY people_months desc

The answer is MONUC, UNMIL, UNMIS and UNOCI.

[The reporter who encouraged me to scrape this dataset had a theory that these peacekeeping missions are a clever way for nations to get their troops battle-hardened before the inevitable conflict on their own territory. In other words, they also serve as war-training missions.]

Now let’s have a look at the just the deployment of peacekeepers from India, Bangladesh and Pakistan to MONUC (United Nations Organization Mission in the Democratic Republic of the Congo) over time.

[There is no easy way to embed this google’s dynamic javascript timeline object into a blog, so I have to present a bitmap image, which is quite annoying.]

As you can see, the pattern of deployment tends to remain at a constant quota over many years, with sudden jumps, probably due to requirements on the ground. Pakistan appeared to supply both of these peacekeeping surges, once in 2003 and once in 2005, while Bangladesh surged at one and India surged at the other.

The picture for UNOCI (United Nations Operation in Côte d’Ivoire) is different:

There is none from India, but a fixed contingent between Bangladesh and Pakistan; 600 peacekeepers were swapped between them in August 2006.

The SQL code for producing these timeline graphs goes like this:

SELECT month||'-15',  # concatenate a day to make a valid format
    sum(people*(country='India')) as people_india,
    sum(people*(country='Bangladesh')) as people_bangladesh,
    sum(people*(country='Pakistan')) as people_pakistan,
    sum(people) as people_all FROM swdata
WHERE mission='UNOCI'
GROUP BY month
ORDER BY month

Now, you could ask who are the other countries which make up the bulk of this mission, and you could answer the question by developing the necessary SQL statement yourself, but it’s a little unfair to expect everyone who is interested in this data to already have mastered SQL, isn’t it?

Phase Three: Presentation

This is the very hard part, and is usually the point where most promising projects get abandoned, because “someone else better than me at design will come along and finish it.”

Except they never ever do.

As you’re really the only one in the world who comprehends the contents and the potential of this dataset, it is your job to prove it.

Here is my attempt at a user interface for generating graphs of the queries that people might be interested in. It has taken me two hard hacking sessions to get it into this form — or twice as long as it took to write the original scraper.

It is almost as time-consuming as producing video marketing.

This is also usually the phase where all those design geniuses come out of the woodwork and start getting critical and disparaging of your efforts, so you can’t blame programmers who don’t go this far. It’s like sweating all month learning to play a new piece of music on the piano, only to get reminded again and again that you don’t have the talent.

This used to bug me big time. Until I realized that it’s actually a positive sign.

What’s infinitely worse than criticism is no criticism at all because nobody has any idea about you are trying to achieve.

Now they think they know what you are trying to do — which is why they can be critical.

The next step is for them to actually know what you are trying to do. This ought to be a small step — and if they can’t make it, and don’t even try to make it, then by definition they cannot very good designers at all.

Look, you have just got all this way starting from nothing, from finding something out in the world, to recognizing its potential, all the way to pulling in and transforming the original raw data and struggling for a way to analyse it. It’s like you have prospected for the diamonds, found them in the earth, cut a mine tunnel to it with your bare hands, separated it from the rock, roughed out its edges, glued it onto a steel washer for a ring, and oh, it doesn’t look very professional and polished now does it? Come on, give us a break! We’ve applied bags of essential skills which hardly anyone else is capable of, so why should we expect to be especially good at this phase? Does your horse have table manners? No. But it works for its hay doesn’t it?

So anyway, here is what the current result looks like:

[Question: Does the Nepalese deployment react to events that were reported in the news during the course of the Haiti mission?]

When the page initializes there are three ajax call-backs to the database to obtain the lists of countries, missions, and top contributions from countries to specific missions. You can multiple select from the countries and the missions lists to create timeline graphs of numbers of people. If you select only from the countries list it shows the troop contributions from those countries to all UN missions. If you additionally select a single mission as well it will graph those country contributions to that specific mission. And it works the other way, vice versa, for lists of missions v countries. The top contributors table helps identify who are the top countries (or missions), so you know which ones to select to make an interesting graph that is not all zeros. (eg no point in graphing the number of Italians deployed to Nepal, because there aren’t any.)

Where do the Italians go? You can find that out by selecting “Italy” from the “Contributor nations” column and clicking on the “Refresh” button on the “Top contributions” column. And you can also click on “Make timeline” to discover that Italy never sent anyone anywhere until late 2006, when they suddenly started deploying two to three thousand peacekeepers to Lebanon. What happened then? Did something change in Italian politics around that point? Maybe people who write Italian newspapers ought to know.

Okay, the user interface is not great, but it achieves the objective of facilitating the formulation of relevant questions, and answering them — which is more than can be said of a lot of artistic user interfaces that crop up around the place, like so many empty bottles of wine.

Phase four: Publishing and promoting

There is no point in doing all this work if the people who would be interested never get to see it.

This bit I cannot do at all, so I don’t even try. I do know that throwing up a long rambling technical blog about the project does not constitute effective publication. In fact, according to the news rules, “once it’s told, it’s old”, so I have just completely ruined everything, because it can now never get onto the New York Times or The Guardian on their data blog section for its 15 hours of fame, before being lost into the past archive where no one is interested at all while it steadily goes out of date through the coming months and years.

Except this dataset, with the infrastructure behind is different, because it remains in date for the foreseeable future. So it really ought to have a home somewhere, like those stock market indicators, ever present on the business pages, like the daily crossword or cartoon.

Who knows how to get this done? It’s not my bag and I am quite exhausted.

What I do know is that I had to keep looking up what all those acronyms mean until I decided I should copy them down in the code and use them for tool-tips. It took quite a bit of work, and was repetitive, and maybe should have been scraped from somewhere. But was probably extremely well worth doing, so I am repeating it here.

missiontips = { UNMIS:"United Nations Missions in Sudan", UNMIL:"United Nations Mission in Liberia",
  UNAMID:"African Union/United Nations Hybrid operation in Darfur", UNOCI:"United Nations Operation in Côte d'Ivoire",
  MINUSTAH:"United Nations Stabilization Mission in Haiti", MONUC:"United Nations Organization Mission in the Democratic Republic of the Congo",
  UNMISS:"United Nations Mission in the Republic of South Sudan", UNMIK:"United Nations Interim Administration Mission in Kosovo",
  MONUSCO:"United Nations Organization Stabilization Mission in the Democratic Republic of the Congo",
  MINURCAT:"United Nations Mission in the Central African Republic and Chad",
  ONUCI:"Opération des Nations Unies en Côte d'Ivoire", UNMEE:"United Nations Mission in Ethiopia and Eritrea",
  ONUB:"United Nations Operation in Burundi", UNIFIL:"United Nations Interim Force in Lebanon",
  UNMIT:"United Nations Integrated Mission in Timor-Leste", UNMISET:"United Nations Mission of Support in East Timor",
  UNAMSIL:"United Nations Mission in Sierra Leone", MINURSO:"United Nations Mission for the Referendum in Western Sahara",
  UNOMIG:"United Nations Observer Mission in Georgia", UNMIN:"United Nations Mission in Nepal",
  UNAMA:"United Nations Assistance Mission in Afghanistan", UNIKOM:"United Nations Iraq-Kuwait Observation Mission",
  UNFICYP:"United Nations Peacekeeping Force in Cyprus", UNISFA:"United Nations Interim Security Force for Abyei",
  UNTSO:"United Nations Truce Supervision Organization", UNOTIL:"United Nations Office in East Timor",
  MINUCI:"United Nations Mission in Côte d'Ivoire", UNIOSIL:"United Nations Integrated Office in Sierra Leone",
  BINUB:"Bureau Intégré des Nations Unies au Burundi", UNAMI:"United Nations Assistance Mission for Iraq",
  UNDOF:"United Nations Disengagement Observer Force", UNMOGIP:"United Nations Military Observer Group in India and Pakistan",
  binub:"Bureau Intégré des Nations Unies au Burundi", BNUB:"United Nations Office in Burundi",
  UNMA:"United Nations Mission in Angola" };

I’ll sign off with an image of what normally stands for an interactive index to the list of missions on the official UN website, and imagine I have done enough for someone to take it on from here.

https://blog.scraperwiki.com/2012/02/the-un-peacekeeping-mission-contributions-mostly-baked/feed/ 3 758216402
Big fat aspx pages for thin data https://blog.scraperwiki.com/2012/02/big-fat-aspx-pages-for-thin-data/ https://blog.scraperwiki.com/2012/02/big-fat-aspx-pages-for-thin-data/#comments Tue, 07 Feb 2012 06:06:03 +0000 http://blog.scraperwiki.com/?p=758216287 My work is more with the practice of webscraping, and less in the high-faluting business plans and product-market-fit leaning agility. At the end of the day, someone must have done some actual webscraping — and the harder it is the better.

During the final hours of the Columbia University hack day, I got to work on a corker in the form of the New York State Joint Committee on Public Ethics Lobbying filing system.

This is an aspx website which is truly shocking. The programmer who made it should be fired — except it looks like he probably got it to a visibly working stage, and then simply walked away from the mess he created without finding out why it was running so slowly.

1. Start on this page.
2. Click on 2. Client Query – Click here to execute Client Query.
3. Select Registration Year: 2011
4. Click the [Search] button

[ Don’t try to use the browser’s back button as there is a piece of code on the starting page that reads: <script language=”javascript”>history.forward();</script> ]

A page called LB_QReports.aspx will be downloaded, which is the same as the previous page, except it is 1.05Mbs long and renders a very small table which looks like this:

If you are able to look at the page source you will find thousands of lines of the form:

<div id="DisplayGrid_0_14_2367">
	<a id="DisplayGrid_0_14_2367_ViewBTN"

Followed by a very long section which begins like:

window.DisplayGrid = new ComponentArt_Grid('DisplayGrid');
DisplayGrid.Data = [[5400,2011,'N','11-17 ASSOCIATES, LLC','11-17 
'APR',40805,'January - June','11201',],[6596,2011,'N','114 KENMARE 
ASSOCIATES, LLC','NEW YORK','NY',11961,'APR',41521,'January - 
June','10012',],[4097,2011,'N','1199 SEIU UNITED HEALTHCARE 
40344,'January - June','10036',],...

This DisplayGrid object is thousands of lists long. So although you only get 15 records in the table at a time, your browser has been given the complete set of data at once for the javascript to pagenate.

Great, I thought. This is easy. I simply have to parse out this gigantic array as json and poke it into the database.

Unfortunately, although it can be interpreted by the javascript machine, it’s not valid json. The quotes are of the wrong type, there are trailing commas, and we need to deal with the escaped apostrophes.

mtable = re.search("(?s)DisplayGrid.Data =s*([[.*?]])", html)
jtable = mtable.group(1)
jtable = jtable.replace("\'", ";;;APOS;;;")
jtable = jtable.replace("'", '"')
jtable = jtable.replace(";;;APOS;;;", "'")
jtable = jtable.replace(",]", "]")
jdata = json.loads(jtable)

Then it’s a matter of working out the headers of the table and storing it into the database.

(Un)Fortunately, there’s more data about the lobbying disclosure than is present in this table if you click on those View links on each line, such as person names, addresses, amounts of money, and what was lobbied.

If you hover your mouse above one of these links you will see it’s of the form: javascript:__doPostBack(‘DisplayGrid_0_14_2134$ViewBTN’,”).

At this point it’s worth a recap on how to get along with an asp webpage, because that is what this is.

[The scraper I am working on is ny_state_lobby, if you want to take a look.]

Here is the code for getting this far, to the point where we can click on these View links:

cj = mechanize.CookieJar()
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

# get to the form (with its cookies)
response = br.open("https://apps.jcope.ny.gov/lrr/Menu_reports_public.aspx")
for a in br.links():
    if a.text == 'Click here to execute Client Query':
        link = a
response = br.follow_link(link)

# fill in the form
br["ddlQYear"] = ["2011"]
response = br.submit()
print response.read()  # this gives massive sets of data

The way to do those clicks onto “DisplayGrid_0_14_%d$ViewBTN” (View) buttons is with the following function that does the appropriate __doPostBack action.

def GetLobbyGrid(d, year):
    dt = 'DisplayGrid_0_14_%d$ViewBTN' % d
    br["__EVENTTARGET"] = dt
    br["__EVENTARGUMENT"] = ''
    br.find_control("btnSearch").disabled = True
    request = br.click()
    response1 = br1.open(request)
    print response1.read()

…And you will find you will have got exactly the same page as before — including that 1Mb fake json data blob.

Except it’s not quite exactly the same. There is a tiny new little section of javascript in the page, right at the bottom. (I believe I discovered it by watching the network traffic on the browser when following the link.)

<script language=javascript>var myWin;myWin=window.open(
=no,titlebar=no,location=center,directories=no, status=no,menubar=

This contains the secret new link you have to click on to get the lobbyist information.

    html1 = response1.read()
    root1 = lxml.html.fromstring(html1)
    for s in root1.cssselect("script"):
        if s.text:
            ms = re.match("var myWin;myWin=window.open('(LB_HtmlCSR.aspx?.*?)',", s.text)
            if ms:
                loblink = ms.group(1)
    uloblink = urlparse.urljoin(br1.geturl(), loblink)
    response2 = br1.open(uloblink)
    print response2.read()   # this is the page you want

So, anyway, that’s where I’m up to. I’ve started the ny_state_lobby_parse scraper to work on these pages, but I don’t have time to carry it on right now (too much blogging).

The scraper itself is going to operate very slowly because for each record it needs to download 1Mb of uselessly generated data to get the individual link to the lobbyist. And I don’t have reliable unique keys for it yet. It’s possible I could make them by associating the button name with the corresponding record from that DisplayGrid table, but that’s for later.

For now I’ve got to go and do other things. But at least we’re a little closer to having the picture of what is being disclosed into this database. The big deal, as always, is finishing it off.

https://blog.scraperwiki.com/2012/02/big-fat-aspx-pages-for-thin-data/feed/ 2 758216287
Journalism Data Camp NY potential data sets https://blog.scraperwiki.com/2012/02/journalism-data-camp-ny-potential-data-sets/ https://blog.scraperwiki.com/2012/02/journalism-data-camp-ny-potential-data-sets/#comments Thu, 02 Feb 2012 15:20:48 +0000 http://blog.scraperwiki.com/?p=758216196 Here is a review of some of the datasets that have been submitted for the Columbia Journalism Data Camp this Friday.

This list is only for backup in case not enough ideas show up with people on the day (never happens, but it’s always a fear).

1. Iowa accident reports

The site http://accidentreports.iowa.gov contains all the police roadside reports of accidents. It’s easy to scrape because the database ids are consecutive numbers:


And it contains thousands of rinky-dink diagrams of the incidents.

First step is to copy all the html from each page into one database. Second step is to scan through all these pages and progressively extract more and more data from them.

Contrast with dataset of accidents available for the UK.

2. South Dakota state budget information

Apparently complete set of expenditures, contracts and revenues disclosed on http://open.sd.gov/ in a form that is easy to scrape (some datasets even allow CSV download). Many states do this, with varying degrees of success.

Use this case to learn how to restructure and analyse financial accountancy flow information. Can you find any contracts that have suddenly been dropped in favour of another supplier?

3. New York School budgets

The site schools.nyc.gov/AboutUs/funding/schoolbudgets/GalaxyAllocationFY2012 requires a school code. Try “M411”.

Apparently there is this spreadsheet of school codes.

Is there anything interesting to plot across all schools, such as the PSAL SNAPPLE FUNDS?

4. New York Lobbying registers

Lobbying at the state and city level. Some of this is challenging.

Is there a cross-over between the jurisdictions? Can you uniquely identify the corporate interests and relate them to the legislative or regulatory program?

5. Court case information

Go to https://www.dccourts.gov/cco/ (Try “Lockheed”). Not obvious where the information is.

The New York City courts are behind a captcha. Maybe better luck with the New York State courts.

Court datasets are usually very difficult to obtain and jealously protected. The legal process resists modernization and is universally paper based. Electronic documents (contracts, settlements, filings) almost always turn out to be image scans of papers.

6. New York City Police crime data

There are weekly PDFs for each police precinct. These are taken down and replaced by the next one, so there is no historical record.

Luckily someone has scraped the data since 2010, though the numbers may need some processing before you map them.

7. New York State gas drilling permits

These are available but don’t seem to have been updated recently. What’s going on?

Wouldn’t it be nice to make another twitterbot to be friends with NorthSeaOil1?

Don’t forget to read the Well ownership transfers.

https://blog.scraperwiki.com/2012/02/journalism-data-camp-ny-potential-data-sets/feed/ 2 758216196