scraperwiki – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 ScraperWiki – Professional Services https://blog.scraperwiki.com/2013/07/scraperwiki-professional-services/ Mon, 15 Jul 2013 16:38:52 +0000 http://blog.scraperwiki.com/?p=758219039 How would you go about collecting, structuring and analysing 100,000 reports on Romanian companies?

You could use ScraperWiki to write and host you own computer code that carries out the scraping you need, and then use our other self-service tools to clean and analyse the data.

But sometimes writing your own code is not a feasible solution. Perhaps your organisation does not have the coding skills in the required area. Or maybe an internal team needs support to deploy their own solutions, or lacks the time and resources to get the job done quickly.

That’s why, alongside our new platform, ScraperWiki also offers a professional service specifically tailored to corporate customers’ needs.

Recently, for example, we acquired data from the Romanian Ministry of Finance for a client. Our expert data scientists wrote a computer program to ingest and structure the data, which was fed into the client’s private datahub on ScraperWiki. The client was then able to make use of ScraperWiki’s ecosystem of built-in tools, to explore their data and carry out their analysis…

Like the “View in a table” tool, which lets them page through the data, sort it and filter it:

romanian-table

The “Summarise this data” tool which gives them a quick overview of their data by looking at each column of the data and making an intelligent decision as to how best to portray it:

romanian-summary

And the “Query with SQL” tool, which allows them to ask sophisticated questions of their data using the industry-standard database query language, and then export the live data to their own systems:

romainian-query

Not only this, ScraperWiki’s data scientists also handcrafted custom visualisation and analysis for the client. In this case we made a web page which pulled in data from the dataset directly, and presented results as a living report. For other projects we have written and presented analysis using R, a very widely used open source statistical analysis package, and Tableau, the industry-leading business intelligence application.

The key advantage of using ScraperWiki for this sort of project is that there is no software to install locally on your computer. Inside corporate environments the desktop PC is often locked down, meaning custom analysis software cannot be easily deployed. Ongoing maintenance presents further problems. Even in more open environments, installing custom software for just one piece of analysis is not something users find convenient. Hosting data analysis functionality on the web has become an entirely practical proposition; storing data on the web has long been commonplace and it is a facility which we use every day. More recently, with developments in browser technology it has become possible to build a rich user experience which facilitates online data analysis delivery.

Combine these technologies with our expert data scientists and you get a winning solution to your data questions – all in the form of ScraperWiki Professional Services.

]]>
758219039
A visit from a minister https://blog.scraperwiki.com/2013/03/a-visit-from-a-minister/ Fri, 08 Mar 2013 16:47:58 +0000 http://blog.scraperwiki.com/?p=758218098 The team and Nick Hurd

You may have heard from twitter that last Wednesday, Nick Hurd, the Cabinet Minister for Civil Society, paid a visit to ScraperWiki HQ. Nick has been looking into government data and transparency as part of his remit, and asked if he could come and have a chat with us in Liverpool.

Joined by Sophie and Laura from the Cabinet Office, the minister spoke with the team about government transparency, and was—Francis tells me—amused to meet the makers of TheyWorkForYou! Nick asked about scraping, and about startup-life in the Northwest.

He also asked how the Government is doing with open data, and our answer was basically that the very first stage of relatively easy wins is done—the hard work now is releasing tougher datasets (e.g. text of Government contracts). And about changing Civil Service culture to structure data better and publish by default.

To illustrate open data in use, Zarino gave a demonstration of our project with Liverpool John Moores University scraping demographic data and using it to analyse ambulance accidents.

Lindsay Sharples from Open Labs at LJMU also joined us, and commented:

“We were delighted to be part of the Minister’s visit to Liverpool. Open Labs has been working with ScraperWiki to support its customer development programme and it was great to see such high level recognition of its ground breaking work.”

Nick Hurd summarised his visit:

“It was fascinating to see how the data-cleaning services of companies like ScraperWiki are supporting local and central government, business and the wider community – making government data more accessible to ordinary citizens and allowing developers and entrepreneurs to identify public service solutions and create new data-driven businesses.

“Initiatives of this sort underline the vital contribution the private sector can make to realising the potential of open data – which this government is a world leader in releasing – for fuelling social and economic growth.”

]]>
758218098
The state of Twitter: Mitt Romney and Indonesian Politics https://blog.scraperwiki.com/2012/07/the-state-of-twitter/ https://blog.scraperwiki.com/2012/07/the-state-of-twitter/#comments Mon, 23 Jul 2012 09:16:53 +0000 http://blog.scraperwiki.com/?p=758217376 It’s no secret that a lot of people use ScraperWiki to search the Twitter API or download their own timelines. Our “basic_twitter_scraper” is a great starting point for anyone interested in writing code that makes data do stuff across the web. Change a single line, and you instantly get hundreds of tweets that you can then map, graph or analyse further.

So, anyway, Tom and I decided it was about time to take a closer look at how you guys are using ScraperWiki to draw data from Twitter, and whether there’s anything we could do to make your lives easier in the process!

Getting under the hood at scraperwiki.com

As anybody who’s checkout out our source code will will know, we store a truck-load of information about each scraper and each run it’s ever made, in a MySQL database. Of 9727 scrapers that had run since the beginning of June, 601 accessed a twitter.com URL. (Our database only stores the first URL that each scraper accesses on any particular run, so it’s possible that there are scripts that accessed twitter but not as the first URL.)

Twitter API endpoints

Getting more specific, these 601 scrapers accessed one of a number of Twitter’s endpoints, normally through the nominal API. We removed the querystring from each of the URLs and then looked for commonly accessed endpoints.

It turns out that search.json is by far the most popular entry point for ScraperWiki users to get Twitter data – probably because it’s the method used by the basic_twitter_scraper that has proved so popular on scraperwiki.com. It takes a search term (like a username or a hashtag) and returns a list of tweets containing that term. Simple!

The next most popular endpoint – followers/ids.json – is a common way to find interesting user accounts to then scrape more details about. And, much to Tom’s amusement, the third endpoint, with 8 occurrences, was http://twitter.com/mittromney. We can’t quite tell whether that’s a good or bad sign for his 2012 candidacy, but if it makes any difference, only one solitary scraper searched for Barack Obama.

Searches

We also looked at what people were searching for. We found 398 search terms in the scrapers that accessed the twitter search endpoint, but only 45 of these terms were called in more than one scraper. Some of the more popular ones were “#ddj” (7 scrapers), “occupy” (3 scrapers), “eurovision” (3 scrapers) and, weirdly, an empty string (5 scrapers).

Even though each particular search term was only accessed a few times, we were able to classify the search terms into broad groups. We sampled from the scrapers who accessed the twitter search endpoint and manually categorized them into categories that seemed reasonable. We took one sample to come up with mutually exclusive categories and another to estimate the number of scrapers in each category.

A bunch of scripts made searches for people or for occupy shenanigans. We estimate that these people- and occupy-focussed queries together account for between two- and four-fifths of the searches in total.

We also invented a some smaller categories that seemed to account for few scrapers each – like global warming, developer and journalism events, towns and cities, and Indonesian politics (!?) – But really it doesn’t seem like there’s any major pattern beyond the people and occupy scripts.

Family Tree

Speaking of the basic_twitter_scraper, we thought it would also be cool to dig into the family history of a few of these scrapers. When you see a scraper you like on ScraperWiki, you can copy it, and that relationship is recorded in our database.

Lots of people copy the basic_twitter_scraper in this way, and then just change one line to make it search for a different term. With that in mind, we’ve been thinking that we could probably make some better tweet-downloading tool to replace this script, but we don’t really know what it would look like. Maybe the users who’ve already copied basic_twitter_scraper_2 would have some ideas…

After getting the scraper details and relationship data into the right format, we imported the whole lot into the open source network visualisation tool Gephi, to see how each scraper was connected to its peers.

By the way, we don’t really know what we did to make this network diagram because we did it a couple weeks ago, forgot what we did, didn’t write a script for it (Gephi is all point-and-click..) and haven’t managed to replicate our results. (Oops.) We noticed this because we repeated all of the analyses for this post with new data right before posting it and didn’t manage to come up with the sort of network diagram we had made a couple weeks ago. But the old one was prettier so we used that :-)

It doesn’t take long to notice basic_twitter_scraper_2’s cult following in the graph. In total, 264 scrapers are part of its extended family, with 190 of those being descendents, and 74 being various sorts of cousins – such as scrape10_twitter_scraper, which was a copy of basic_twitter_scraper_2’s grandparent, twitter_earthquake_history_scraper (the whole family tree, in case you’re wondering, started with twitterhistory-scraper, written by Pedro Markun in March 2011).

With the owners of all these basic_twitter_scraper(_2)’s identified, we dropped a few of them an email to find out what they’re using the data for and how we could make it easier for them to gather in the future.

It turns out that Anna Powell-Smith wrote the basic_twitter_scraper at a journalism conference and Nicola Hughes reused it for loads of ScraperWiki workshops and demonstrations as basic_twitter_scraper_2. But even that doesn’t fully explain the cult following because people still keep copying it. If you’re one of those very users, make sure to send us a reply – we’d love to hear from you!!

Explore

We’ve posted our code for this analysis on Github, along with a table of information about the 594 Twitter scrapers that aren’t in vaults (out of 601 total Twitter scrapers) in case you’re as puzzled as we are by our users’ interest in Twitter data

Now here’s video of a cat playing a keyboard.

]]>
https://blog.scraperwiki.com/2012/07/the-state-of-twitter/feed/ 2 758217376
Fine set of graphs at the Office of National Statistics https://blog.scraperwiki.com/2012/03/fine-set-of-graphs-at-the-office-of-national-statistics/ Thu, 22 Mar 2012 11:47:01 +0000 http://blog.scraperwiki.com/?p=758216643 It’s difficult to keep up. I’ve just noticed a set of interesting interactive graphs over at the Office of National Statistics (UK).

If the world is about people, then the most fundamental dataset of all must be: Where are the people? And: What stage of life are they living through?

A Population Pyramid is a straightforward way to visualize the data, like so:

This image is sufficient for determining what needs to be supplied (eg more children means more schools and toy-shops), but it doesn’t explain why.

The “why?” and “what’s going on?” questions are much more interesting, but are pretty much guesswork because they refer to layers in the data that you cannot see. For example, the number of people in East Devon of a particular age is the sum of those who have moved into the area at various times, minus those who have moved away (temporarily or permanently), plus those who were already there and have grown older but not yet died. For any bulge, you don’t know which layer it belongs to.

In this 2015 population pyramid there are bulges at 28, 50 and a pronounced spike at 68, as well as dips at 14 and 38. In terms of birth years, these correspond to 1987, 1965 and 1947 (spike), and dips at 2001 and 1977.

You can pretend they correspond to recessions, economic boom times and second wave feminism, but the 1947 post-war spike when a mass of men-folk were demobilized from the military is a pretty clean signal.

What makes this data presentation especially lovely is that it is localized, so you can see the population pyramid per city:

Cambridge, as everyone knows, is a university town, which explains the persistent spike at the age 20.

And, while it looks like there is gender equality for 20 year old university students, there is a pretty hefty male lump up to the age of 30 — possibly corresponding folks doing higher degrees. Is this because fewer men are leaving town at the appropriate age to become productive members of society, or is there an influx of foreign grad students from places where there is less of a gender equality? The data set of student origins and enrollments would give you the story.

As to the pyramid on the right hand side, I have no idea what is going on in Camden to account for that bulge in 30 year olds. What is obvious, though, is that the bulge in infants must be related. In fact, almost all the children between the ages of 0 and 16 years will have corresponding parents higher up the same pyramid. Also, there is likely to be a pairwise cross-gender correspondence between individuals of the same generation living together.

These internal links, external data connections, sub-cohorts and new questions raised the more you look at it means that it is impossible to create a single all-purpose visualization application that could serve all of these. We can wonder as to whether an interface which worked via javascript-generated SQL calls (rather than flash and server-side queries) would have enabled someone with the right skills to roll their own queries and, for example, immediately find out which city and age group has the greatest gender disparity, and whether all spikes at the 20-year-old age bracket can be accounted for by universities.

For more, see An overview of ONS’s population statistics.

As it is, someone is going to have to download/scrape, parse and load at least one year of source data into a data hub of their choice in order to query this (we’ve started on 2010’s figures here on ScraperWiki – take a look). Once that’s done, you’d be able to sort the cities by the greatest ratio between number of 20 year olds and number of 16 year olds, because that’s a good signal of student influx.

I don’t have time to get onto the Population projection models, where it really gets interesting. There you have all the clever calculations based on guestimates of migration, mortality and fertility.

What I would really like to see are these calculations done live and interactively, as well as combined with economic data. Is the state pension system going to go bankrupt because of the “baby boomers”? Who knows? I know someone who doesn’t know: someone who’s opinion does not rely (even indirectly) on something approaching a dynamic data calculation. I mean, if the difference between solvency and bankruptcy is within the margin of error in the estimate of fertility rate, or 0.2% in the tax base, then that’s not what I’d call bankrupt. You can only find this out by tinkering with the inputs with an element of curiosity.

Privatized pensions ought to be put into the model as well, to give them the macro-economic context that no pension adviser I’ve ever known seems capable of understanding. I mean, it’s evident that the stock market (in which private pensions invest) does happen to yield a finite quantity of profit each year. Ergo it can support a finite number of pension plans. So a national policy which demands more such pension plans than this finite number is inevitably going to leave people hungry.

Always keep in mind the long term vision of data and governance. In the future it will all come together like transport planning, or the procurement of adequate rocket fuel to launch a satellite into orbit; a matter of measurements and predictable consequences. Then governance will be a science, like chemistry, or the prediction of earthquakes.

But don’t forget: we can’t do anything without first getting the raw data into a usable format. Dave McKee’s started on 2010’s data here … fancy helping out?

]]>
758216643
How to stop missing the good weekends https://blog.scraperwiki.com/2012/01/how-to-stop-missing-the-good-weekends/ https://blog.scraperwiki.com/2012/01/how-to-stop-missing-the-good-weekends/#comments Fri, 20 Jan 2012 09:27:12 +0000 http://blog.scraperwiki.com/?p=758215936 The BBC's Michael Fish presenting the weather in the 80s, with a ScraperWiki tractor superimposed over LiverpoolFar too often I get so stuck into the work week that I forget to monitor the weather for the weekend when I should be going off to play on my dive kayaks — an activity which is somewhat weather dependent.

Luckily, help is at hand in the form of the ScraperWiki email alert system.

As you may have noticed, when you do any work on ScraperWiki, you start to receive daily emails that go:

Dear Julian_Todd,

Welcome to your personal ScraperWiki email update.

Of the 320 scrapers you own, and 157 scrapers you have edited, we
have the following news since 2011-12-01T14:51:34:

Histparl MP list - https://scraperwiki.com/scrapers/histparl_mp_list :
  * ran 1 times producing 0 records from 2 pages
  * with 1 exceptions, (XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<!DOCTYP')

...Lots more of the same

This concludes your ScraperWiki email update till next time.

Please follow this link to change how often you get these emails,
or to unsubscribe: https://scraperwiki.com/profiles/edit/#alerts

The idea behind this is to attract your attention to matters you may be interested in — such as fixing those poor dear scrapers you have worked on in the past and are now neglecting.

As with all good features, this was implemented as a quick hack.

I thought: why design a whole email alert system, with special options for daily and weekly emails, when we already have a scraper scheduling system which can do just that?

With the addition of a single flag to designate a scraper as an emailer (plus a further 20 lines of code), a new fully fledged extensible feature was born.

Of course, this is not counting the code that is in the Wiki part of ScraperWiki.

The default code in your emailer looks roughly like so:

import scraperwiki
emaillibrary = scraperwiki.utils.swimport("general-emails-on-scrapers")
subjectline, headerlines, bodylines, footerlines = emaillibrary.EmailMessageParts("onlyexceptions")
if bodylines:
    print "n".join([subjectline] + headerlines + bodylines + footerlines)

As you can see, it imports the 138 lines of Python from general-emails-on-scrapers, which I am not here to talk about right now.

Using ScraperWiki emails to watch the weather

Instead, what I want to explain is how I inserted my Good Weather Weekend Watcher by polling the weather forecast for Holyhead.

My extra code goes like this:

weatherlines = [ ]
if datetime.date.today().weekday() == 2:  # Wednesday
    url = "http://www.metoffice.gov.uk/weather/uk/wl/holyhead_forecast_weather.html"
    html = urllib.urlopen(url).read()
    root = lxml.html.fromstring(html)
    rows = root.cssselect("div.tableWrapper table tr")
    for row in rows:
        #print lxml.html.tostring(row)
        metweatherline = row.text_content().strip()
        if metweatherline[:3] == "Sat":
            subjectline += " With added weather"
            weatherlines.append("*** Weather warning for the weekend:")
            weatherlines.append("   " + metweatherline)
            weatherlines.append("")

What this does is check if today is Wednesday (day of the week #2 in Python land), then it parses through the Met Office Weather Report table for my chosen location, and pulls out the row for Saturday.

Finally we have to handle producing the combined email message, the one which can contain either a set of broken scraper alerts, or the weather forecast, or both.

if bodylines or weatherlines:
    if not bodylines:
        headerlines, footerlines = [ ], [ ]   # kill off cruft surrounding no message
    print "n".join([subjectline] + weatherlines + headerlines + bodylines + footerlines)

The current state of the result is:

*** Weather warning for the weekend:
  Mon 5Dec
  Day

  7 °C
  W
  33 mph
  47 mph
  Very Good

This was a very quick low-level implementation of the idea with no formatting and no filtering yet.

Email alerts can quickly become sophisticated and complex. Maybe I should only send a message out if the wind is below a certain speed. Should I monitor previous days’ weather to predict whether the sea will be calm? Or I could check the wave heights on the off-shore buoys? Perhaps my calendar should be consulted for prior engagements so I don’t get frustrated by being told I am missing out on a good weekend when I had promised to go to a wedding.

The possibilities are endless and so much more interesting than if we’d implemented this email alert feature in the traditional way, rather than taking advantage of the utterly unique platform that we happened to already have in ScraperWiki.

]]>
https://blog.scraperwiki.com/2012/01/how-to-stop-missing-the-good-weekends/feed/ 1 758215936
ScraperWiki scrapers: now 53% more useful! https://blog.scraperwiki.com/2011/11/scraperwiki-scrapers-now-53-more-useful/ https://blog.scraperwiki.com/2011/11/scraperwiki-scrapers-now-53-more-useful/#comments Wed, 16 Nov 2011 12:01:07 +0000 http://blog.scraperwiki.com/?p=758215833

It’s Christmas come early at ScraperWiki HQ as we deliver—like elves popping boxes under the data digging Christmas tree—a bunch of great new improvements to the ScraperWiki site. We’ve been working on these for a while, so it’s great to finally let you all use them!

First up: a new look for your scrapers

The most obvious change will hit you as soon as you look at a scraper – the overview page now sports a svelte, functional new layout. The roster of changes is as long as Santa’s list, so I’ll just pick out a few…

The blue header at the top of the page is now way more informative. As well as the scraper’s title and creator, you can also see the language it’s written in, the domain it scrapes, the number of records in its datastore, and its privacy status. No more hunting around the page: everything you need is there in one place. Hurrah!

Further down, you’ll notice the history and discussion pages have now been merged into the main page, meaning you’ll spend less time flicking between tabs and more time editing or investigating the scraper.

Meanwhile, the page as a whole is a lot more organised. Everything to do with runs (the current status, the last run, the pages scraped, the schedule) is up in the top left. Everything to do with the datastore (including the data preview and download options) is just below that, and everything to do with the scraper’s relationship to other scrapers (like tags, forks, copies and views) is just below that. Neat.

Speaking of which, the data preview has had some serious attention. It’s now way more interactive: you can sort on any column, alter the number of rows displayed, and page through all of the data in all of the tables, with just a few clicks. And other features like syntax-highlighted table schemas and a nifty drop-down when you have too many tabs to fit on the page, should keep ScraperWiki power-users fast and efficient.

And those are just the headline changes. There have also been a load of great tweaks, like a View Source button so you never have to worry about breaking someone’s scraper when you’re just taking a look, and an easy Share button to get your scrapers on Facebook, Twitter and Google+. So go try out the new page, and as ever, we’d really love your feedback.

Never miss a comment again

As well as moving the scraper discussion (or ‘chat’, as it’s now called) onto the main page to make it more obvious, we’ve also enabled email notifications for comments. Now, when someone comments on your scraper, you’ll get a swish new email showing you who they are, what they said, and how to reply (thanks to ScraperWiki’s new engineer, David Jones, for his input on this!).

If, however, all this conversation is a little too much for you, Ebenezer, then you can disable comment notifications by unchecking the box in your Edit Profile page.

And while we were at it: Messages!

For a while, users have grumbled that it’s far too difficult to contact other users. And quite right too – we never anticipated that our developers would be such social creatures! So, we’ve added a “Send a Message” button to everyone’s profile (kudos to Chris Hannam for helping out!). The messages are sent as emails via feedback@scraperwiki.com, meaning the other user never sees your email address – just your name, your message and a link to your profile. And, as with comment notifications, if you want to disable sending and receiving of user messages, just uncheck the box in your Edit Profile page.

]]>
https://blog.scraperwiki.com/2011/11/scraperwiki-scrapers-now-53-more-useful/feed/ 2 758215833
Solving Healthcare Problems with Open Source Software https://blog.scraperwiki.com/2011/11/solving-healthcare-problems-with-open-source-software/ Fri, 11 Nov 2011 13:05:59 +0000 http://blog.scraperwiki.com/?p=758215819 This year, EHealth Insider brought a new feature to their annual EHI Live exhibition: a healthcare skunkworks that gave visitors the chance to ask questions about how open source software can be used to solve healthcare problems.

ScraperWiki, of course had to be one of the invited guests to exhibit at the skunkworks. So as is our way, we drove an agile data mining sprint on the first day of the exhibition. The idea was to convene a small group of developers, give them coffee and an Internet connection, and see if they could create useful healthcare and NHS data sets by the end of the day. Attendees at the ScraperWiki exhibit could watch development progress on the scrapers in real time! It was thrilling!

Four developers participated in the sprint, from ScraperWiki and NHS Connecting for Health. By the end of the day, they had written multiple scrapers delivering data about:

* World Health Organisation outbreak alerts and responses

* Communicable and respiratory disease incidence data from the Royal
College of GPs

* Health information standards from the NHS Information Standards Board

* Foodborne outbreaks in the US, from the Centers for Disease Control
and Prevention

* Suppliers registered with the UK Government Procurement Service

One very lucky developer, Jacob Martin, from NHS Connecting for Health, won the coveted ScraperWiki mug for writing the most scrapers over the course of the day (*applause*).

But it’s not just about the scraping, it’s the ideals of ‘open’ that can be enlightening in such a short period of time given the will and the right equipment. As Shaun Hills, from NHS Connecting for Health, commented: “Interoperability and data exchange are important parts of healthcare IT. It was interesting and useful to see how technology like ScraperWiki can be used in this area. It was also good to brush up on my Python coding and still deliver something in a few hours.”

So watch out healthcare – you’re being ScraperWikied!

]]>
758215819
How to get along with an ASP webpage https://blog.scraperwiki.com/2011/11/how-to-get-along-with-an-asp-webpage/ https://blog.scraperwiki.com/2011/11/how-to-get-along-with-an-asp-webpage/#comments Wed, 09 Nov 2011 12:14:02 +0000 http://blog.scraperwiki.com/?p=758215794 Fingal County Council of Ireland recently published a number of sets of Open Data, in nice clean CSV, XML and KML formats.

Unfortunately, the one set of Open Data that was difficult to obtain, was the list of sets of open data. That’s because the list was separated into four separate pages.

The important thing to observe is that Next >> link is no ordinary link. You can see something is wrong when you hover your cursor over it. Here’s what it looks like in the HTML source code:

<a id="lnkNext" href="javascript:__doPostBack('lnkNext','')">Next >></a>

What it does (instead of taking the browser to the next page) is execute the javascript function __doPostBack().

Now, this could take a long time to untangle by stepping through the javascript code to the extent that it would be a hopeless waste of time, but for the fact that this is code generated by Microsoft and there are literally millions of webpages that work in exactly the same way.

This __doPostBack() javascript function is always on the page and it’s always the same, if you look at the HTML source.

<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
     theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}
//]]>
</script>

 

So what it’s doing is putting the two arguments from the function call (in this example ‘lnkNext’ and ”) into two values of the hidden form called “form1” and then submitting the form back to the server as a POST request.

Let’s try to look at the form. Here is some Python code which ought to do it.

import mechanize
br = mechanize.Browser()
br.open("http://data.fingal.ie/ViewDataSets/")
br.select_form("form1")
print br.form

Unfortunately this doesn’t work, because the form is has no name. Here is how it appears in the HTML:

<form method="post" action="" id="form1">
<div class="aspNetHidden">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKMjA4MT... insanely long ascii string />
...the entire rest of the webpage...
</form>

The javascript is selecting it by the id which, unfortunately, mechanize doesn’t allow. Fortunately there is only one form in the whole page, so we can select it as the first form on the page:

import mechanize
br = mechanize.Browser()
br.open("http://data.fingal.ie/ViewDataSets/")
br.select_form(nr=0)
print br.form

What do we get?

 <POST http://data.fingal.ie/ViewDataSets/ application/x-www-form-urlencoded
<HiddenControl(__VIEWSTATE=/wEPDwUKMjA4...  and so on ) (readonly)>
<HiddenControl(__EVENTVALIDATION=/wEWVQK...  and so on ) (readonly)>
<TextControl(txtSearch=Search DataSets)>
<TextControl(txtSearch=Search DataSets)>
&ltSubmitControl(btnSearch=Search) (readonly)>
<SelectControl(ddlOrder=[*Title, Agency, Rating])>>

Oh dear. What has happened to the __EVENTTARGET and __EVENTARGUMENT which I am going to have to put values in when I am simulating the __doPostBack() function?

I don’t really know.

What I do know is that if you insert the following line:

br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

just before the line that says br.open() to include some headers that are recognized by the Microsoft server software, then you get them back:

<POST http://data.fingal.ie/ViewDataSets/ application/x-www-form-urlencoded
<HiddenControl(__EVENTTARGET=) (readonly)>
<HiddenControl(__EVENTARGUMENT=) (readonly)>
<HiddenControl(__LASTFOCUS=) (readonly)>
...

Right, so all we need to do to get to the next page is fill in their values and submit the form, like so:

br["__EVENTTARGET"] = "lnkNext"
br["__EVENTARGUMENT"] = ""
response = br.submit()
print response.read()

Woops, that doesn’t quite work, because those two controls are readonly. Luckily there is a function in mechanize to make this problem go away, which looks like:

br.set_all_readonly(False)

So let’s put this all together, including pulling out those special values from the __doPostBack() javascript to make it more general and putting it into a loop.

How about it?

import mechanize
import re

br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open("http://data.fingal.ie/ViewDataSets/")

for i in range(10):
    html = response.read()
    print "Page %d :" % i, html

    br.select_form(nr=0)
    print br.form
    br.set_all_readonly(False)
    mnext = re.search("""<a id="lnkNext" href="javascript:__doPostBack('(.*?)','(.*?)')">Next >>""", html)
    if not mnext:
        break
    br["__EVENTTARGET"] = mnext.group(1)
    br["__EVENTARGUMENT"] = mnext.group(2)
    response = br.submit()

It still doesn’t quite work! This stops at two pages, but you know there are four.

What is the problem?

The problem is this SubmitControl in the list of controls in the form:

<SubmitControl(btnSearch=Search) (readonly)>

You think you are submitting the form, when in fact you are clicking on the Search button, which then takes you to a page you are not expecting that has no Next >> link on it.

If you disable that particular SubmitControl before submitting the form

br.find_control("btnSearch").disabled = True

then it works.

From here on it’s plain sailing. All you need to do is parse the html, follow the normal links, and away you go!

In summary

1. You need to use mechanize because these links are operated by javascript and a form

2. Clicking the link involves copying the arguments from __doPostBack() into the __EVENTTARGET and __EVENTARGUMENT HiddenControls.

3. You must set readonly to False so you can even write to those values.

4. You must set the User-agent header or the server software doesn’t know what browser you are using and returns something that can’t possibly work.

5. You must disable all extraneous SubmitControls in the form before calling submit()

Some of these tricks have taken me a day to learn and resulted in me almost giving up for good. So I am passing on this knowledge in the hope that it can be used. There are other tricks of the trade I have picked up regarding ASP pages that there is no time to pass on here because the examples are even more involved and harder to get across.

What we need is an ASP pages working group among the ScraperWiki diggers who take on this type of work. Anyone who is faced with one of these jobs should be able to bring it to the team and we’ll take a look at it as a group of experts with the knowledge. I expect problems to be disposed of within half an hour that would take someone who hasn’t done it before to take a week or give up before they’ve even got started.

This is how we can produce results.

]]>
https://blog.scraperwiki.com/2011/11/how-to-get-along-with-an-asp-webpage/feed/ 7 758215794
Meet all our users!!!! https://blog.scraperwiki.com/2011/10/meet-all-our-users/ Fri, 21 Oct 2011 17:02:22 +0000 http://blog.scraperwiki.com/?p=758215690 As I have been looking for your great guys to highlight your work, passion and scraping oddities, so your time has come to explore the craven habits of the new age data digging programmers! You can now search for people on ScraperWiki!!!! Just use our regular search box.

So all you ScraperWikians out there, I suggest you fill in your profile, put up a picture of your beautiful faces (or cats, whatever floats your boat) and introduce yourself to the data curious world. If you search for ‘scraperwiki‘ on Twitter you’ll also find conversations between developers linking to their scrapers. If you want the wider scraping community to get in touch, please include your Twitter handle or links to any of your other public account.

So let’s spread the ScraperWikiLovin’ but please no unsavoury solicitations. There are plenty of other services for that and if you are resorting to a data wrangling platform you’ve got serious issues.

]]>
758215690
ScraperWiki Tutorial Screencast for Non-Programmers https://blog.scraperwiki.com/2011/08/scraperwiki-tutorial-screencast-for-non-programmers/ https://blog.scraperwiki.com/2011/08/scraperwiki-tutorial-screencast-for-non-programmers/#comments Mon, 15 Aug 2011 17:24:45 +0000 http://blog.scraperwiki.com/?p=758215242 If you’ve been going through our first ambitious tutorial and taster session for non-coders then good for you! I hope you found it enlightening. For those of you yet to try it, here it is.

It is a step-by-step guide so please give it a go and don’t just try and follow the answers as you’ll learn more from rummaging around out site. Also check out the introductory video at the start of the tutorial if you’re not familiar with ScraperWiki. So don’t look at the answers which are in screencast form below unless you have had a go!

Here’s the twitter scraper and datastore download. This is the first part of the tutorial where you fork (make a copy of) a basic Twitter scraper, run it for your chosen query, download the data and schedule it to run at a frequency to allow the data to be refreshed and accumulated:

The next one is a SQL Query View which looks at the data with a journalistic eye in ScraperWiki. This is the second part of the tutorial where you look into a datastore using the SQL language and find out which are the top 10 publications receiving complaints from the Press Complaints Commission and also who are the top 10 making the complaints:

And last we show you how to get a live league table view that updates with a scraper. This is the final part of the tutorial where you make a live league table of the above query that refreshes when the original scraper updates:

If you have any questions please feel free to contact me nicola[at]scraperwiki.com. For full training sessions or scraping projects like OpenCorporates or AlphaGov contact aine[at]scraperwiki.com

]]>
https://blog.scraperwiki.com/2011/08/scraperwiki-tutorial-screencast-for-non-programmers/feed/ 1 758215242