Fingal County Council of Ireland recently published a number of sets of Open Data, in nice clean CSV, XML and KML formats.
Unfortunately, the one set of Open Data that was difficult to obtain, was the list of sets of open data. That’s because the list was separated into four separate pages.
The important thing to observe is that Next >> link is no ordinary link. You can see something is wrong when you hover your cursor over it. Here’s what it looks like in the HTML source code:
So what it’s doing is putting the two arguments from the function call (in this example ‘lnkNext’ and ”) into two values of the hidden form called “form1” and then submitting the form back to the server as a POST request.
Let’s try to look at the form. Here is some Python code which ought to do it.
import mechanize br = mechanize.Browser() br.open("http://data.fingal.ie/ViewDataSets/") br.select_form("form1") print br.form
Unfortunately this doesn’t work, because the form is has no name. Here is how it appears in the HTML:
<form method="post" action="" id="form1"> <div class="aspNetHidden"> <input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" /> <input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" /> <input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" /> <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMjA4MT... insanely long ascii string /> ...the entire rest of the webpage... </form>
import mechanize br = mechanize.Browser() br.open("http://data.fingal.ie/ViewDataSets/") br.select_form(nr=0) print br.form
What do we get?
<POST http://data.fingal.ie/ViewDataSets/ application/x-www-form-urlencoded <HiddenControl(__VIEWSTATE=/wEPDwUKMjA4... and so on ) (readonly)> <HiddenControl(__EVENTVALIDATION=/wEWVQK... and so on ) (readonly)> <TextControl(txtSearch=Search DataSets)> <TextControl(txtSearch=Search DataSets)> <SubmitControl(btnSearch=Search) (readonly)> <SelectControl(ddlOrder=[*Title, Agency, Rating])>>
Oh dear. What has happened to the __EVENTTARGET and __EVENTARGUMENT which I am going to have to put values in when I am simulating the __doPostBack() function?
I don’t really know.
What I do know is that if you insert the following line:
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:18.104.22.168) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
just before the line that says br.open() to include some headers that are recognized by the Microsoft server software, then you get them back:
<POST http://data.fingal.ie/ViewDataSets/ application/x-www-form-urlencoded <HiddenControl(__EVENTTARGET=) (readonly)> <HiddenControl(__EVENTARGUMENT=) (readonly)> <HiddenControl(__LASTFOCUS=) (readonly)> ...
Right, so all we need to do to get to the next page is fill in their values and submit the form, like so:
br["__EVENTTARGET"] = "lnkNext" br["__EVENTARGUMENT"] = "" response = br.submit() print response.read()
Woops, that doesn’t quite work, because those two controls are readonly. Luckily there is a function in mechanize to make this problem go away, which looks like:
How about it?
It still doesn’t quite work! This stops at two pages, but you know there are four.
What is the problem?
The problem is this SubmitControl in the list of controls in the form:
You think you are submitting the form, when in fact you are clicking on the Search button, which then takes you to a page you are not expecting that has no Next >> link on it.
If you disable that particular SubmitControl before submitting the form
br.find_control("btnSearch").disabled = True
then it works.
From here on it’s plain sailing. All you need to do is parse the html, follow the normal links, and away you go!
2. Clicking the link involves copying the arguments from __doPostBack() into the __EVENTTARGET and __EVENTARGUMENT HiddenControls.
3. You must set readonly to False so you can even write to those values.
4. You must set the User-agent header or the server software doesn’t know what browser you are using and returns something that can’t possibly work.
5. You must disable all extraneous SubmitControls in the form before calling submit()
Some of these tricks have taken me a day to learn and resulted in me almost giving up for good. So I am passing on this knowledge in the hope that it can be used. There are other tricks of the trade I have picked up regarding ASP pages that there is no time to pass on here because the examples are even more involved and harder to get across.
What we need is an ASP pages working group among the ScraperWiki diggers who take on this type of work. Anyone who is faced with one of these jobs should be able to bring it to the team and we’ll take a look at it as a group of experts with the knowledge. I expect problems to be disposed of within half an hour that would take someone who hasn’t done it before to take a week or give up before they’ve even got started.
This is how we can produce results.