wikipedia – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Hiding invisible text in Table Xtract Mon, 19 May 2014 08:09:32 +0000 As part of the my London Underground visualisation project I wanted to get data out of a table on Wikipedia, you can see it below. It contains data on every London Underground station including things like the name of the station, the opening date, which zone it is in, how many passengers travel through it and so forth.


Such tables can be copy-pasted directly into Excel or Tableau but the result is a mess with extraneous lines and so forth which needs manual editing for us to work with it. Alternatively we can use the ScraperWiki Table Xtract tool to get the data in rather cleaner form, we can see the result of doing this below. It looks pretty good, the Station name and Lines columns come out nicely and there is only one row per station, and no blank rows. But something weird is going on in the numeric and date fields, characters have been appended to the data we can see in the table.


It turns out these extra characters are a result of invisible text added to the tables to make the table sortable by those columns. This “invisible” text can be seen by inspecting the source of the HTML page. There are various ways of making text invisible on a web page but Wikipedia seems to just use one in it’s sortable tables. Once I had identified the issue it was just a case of writing some code to hide the invisible text in the Table Xtract tool. To do this I modified the messytables library on which Table Xtract is built, you can see the modification here. The stringent code review requirements at ScraperWiki meant I had two goes at making the change!

You can see the result in the screenshot below, the Opened, Mainline Opened and Usage columns now are free of extraneous text. This fix should apply across Wikipedia and also to tables on other web pages which use the same method to make text invisible.


We’re keen to incrementally improve our tools, so if there’s a little fix to any of our tools that you want us to make then please let us know!

Scraping the protests with Goldsmiths Fri, 09 Dec 2011 12:25:51 +0000 Google Map of occupy protests around the worldZarino here, writing from carriage A of the 10:07 London-to-Liverpool (the wonders of the Internet!). While our new First Engineer, drj, has been getting to grips with lots of the under-the-hood changes which’ll make ScraperWiki a lot faster and more stable in the very near future, I’ve been deploying ScraperWiki out on the frontline, with some brilliant Masters students at CAST, Goldsmiths.

I say brilliant because these guys and girls, with pretty much no scraping experience but bags of enthusiasm, managed (in just three hours) to pull together a seriously impressive map of Occupy protests around the world. Using data from no less than three individual wikipedia articles, they parsed, cleaned, collated and geolocated almost 600 protests worldwide, and then visualised them over time using a ScraperWiki view. Click here to take a look.

Okay, I helped a bit. But still, it really pushed home how perfect ScraperWiki is for diving into a sea of data, quickly pulling out what you need, and then using it to formulate bigger hypotheses, flag up possible stories, or gather constantly-fresh intelligence about an untapped field. There was this great moment when the penny suddenly dropped and these journalists, activists and sociologists realised what they’d been missing all this time.

Students like these, with tools like ScraperWiki behind them, are going to take the world by storm.

But the penny also dropped for me, when I saw how suited ScraperWiki is to a role in the classroom. The path to becoming a data science ninja is a long and steep one, and despite the amazing possibilites fresh, clean and accountable data holds for everybody from anthropologists to zoologists, getting that first foot on the ladder is a tricky task. ScraperWiki was never really built as a learning environment, but with so little else out there to guide learners, it fulfils the task surprisingly well. Students can watch their tutor editing and running a scraper in realtime, right alongside their own work, right inside their web browser. They can take their own copy, hack it, and then merge the data back into a classroom pool. They can use it for assignments, and when the built-in documentation doesn’t answer their questions, there’s a whole community of other developers on there, and a whole library of living, working examples of everything from cabinet office tweeting to global shark activity. They can try out a new language (maybe even their first language) without worrying about local installations, plugins or permissions. And then they can share what they make with their classmates, tutors, and the rest of the world.

Guys like these, with tools like ScraperWiki behind them, are going to take the world by storm. I can’t wait to see what they cook up.

]]> 1 758215974
How to scrape and parse Wikipedia Wed, 07 Dec 2011 14:50:04 +0000

Today’s exercise is to create a list of the longest and deepest caves in the UK from Wikipedia. Wikipedia pages for geographical structures often contain Infoboxes (that panel on the right hand side of the page).

The first job was for me to design an Template:Infobox_ukcave which was fit for purpose. Why ukcave? Well, if you’ve got a spare hour you can check out the discussion considering its deletion between the immovable object (American cavers who believe cave locations are secret) and the immovable force (Wikipedian editors who believe that you can’t have two templates for the same thing, except when they are in different languages).

But let’s get on with some Wikipedia parsing. Here’s what doesn’t work:

import urllib
print urllib.urlopen("").read()

because it returns a rather ugly error, which at the moment is: “Our servers are currently experiencing a technical problem.”

What they would much rather you do is go through the wikipedia api and get the raw source code in XML form without overloading their servers.

To get the text from a single page requires the following code:

import lxml.etree
import urllib

title = "Aquamole Pot"

params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
params["titles"] = "API|%s" % urllib.quote(title.encode("utf8"))
qs = "&".join("%s=%s" % (k, v)  for k, v in params.items())
url = "" % qs
tree = lxml.etree.parse(urllib.urlopen(url))
revs = tree.xpath('//rev')

print "The Wikipedia text for", title, "is"
print revs[-1].text

Note how I am not using urllib.urlencode to convert params into a query string. This is because the standard function converts all the ‘|’ symbols into ‘%7C’, which the Wikipedia api site doesn’t accept.

The result is:

{{Infobox ukcave
| name = Aquamole Pot
| photo =
| caption =
| location = [[West Kingsdale]], [[North Yorkshire]], England
| depth_metres = 113
| length_metres = 142
| coordinates =
| discovery = 1974
| geology = [[Limestone]]
| bcra_grade = 4b
| gridref = SD 698 784
| location_area = United Kingdom Yorkshire Dales
| location_lat = 54.19082
| location_lon = -2.50149
| number of entrances = 1
| access = Free
| survey = []
'''Aquamole Pot''' is a cave on [[West Kingsdale]], [[North Yorkshire]],
England wih which was first discovered from the
bottom by cave diving through 550 feet of
sump from [[Rowten Pot]] in 1974....

This looks pretty structured. All ready for parsing. I’ve written a nice complicated recursive template parser that I use in wikipedia_utils, which makes it easy to extract all the templates from the page in the following way:

import scraperwiki
wikipedia_utils = scraperwiki.swimport("wikipedia_utils")

title = "Aquamole Pot"

val = wikipedia_utils.GetWikipediaPage(title)
res = wikipedia_utils.ParseTemplates(val["text"])
print res               # prints everything we have found in the text
infobox_ukcave = dict(res["templates"]).get("Infobox ukcave")
print infobox_ukcave    # prints just the ukcave infobox

This now produces the following Python data structure that is almost ready to push into our database — after we have converted the length and depths from strings into numbers:

{0: 'Infobox ukcave', 'number of entrances': '1',
 'location_lon': '-2.50149',
 'name': 'Aquamole Pot', 'location_area': 'United Kingdom Yorkshire Dales',
 'geology': '[[Limestone]]', 'gridref': 'SD 698 784', 'photo': '',
 'coordinates': '', 'location_lat': '54.19082', 'access': 'Free',
 'caption': '', 'survey': '[]',
 'location': '[[West Kingsdale]], [[North Yorkshire]], England',
 'depth_metres': '113', 'length_metres': '142', 'bcra_grade': '4b', 'discovery': '1974'}

Right. Now to deal with the other end of the problem. Where do we get the list of pages with the data?

Wikipedia is, unfortunately, radically categorized, so Aquamole_Pot is inside Category:Caves_of_North_Yorkshire, which is in turn inside Category:Caves_of_Yorkshire which is then inside
Category:Caves_of_England which is finally inside

So, in order to get all of the caves in the UK, I have to iterate through all the subcategories and all the pages in each category and save them to my database.

Luckily, this can be done with:

lcavepages = wikipedia_utils.GetWikipediaCategoryRecurse("Caves_of_the_United_Kingdom")["title"], lcavepages, "cavepages")

All of this adds up to my current scraper wikipedia_longest_caves that extracts those infobox tables from caves in the UK and puts them into a form where I can sort them by length to create this table based on the query SELECT name, location_area, length_metres, depth_metres, link FROM caveinfo ORDER BY length_metres desc:

name location_area length_metres depth_metres
Ease Gill Cave System United Kingdom Yorkshire Dales 66000.0 137.0
Dan-yr-Ogof Wales 15500.0
Gaping Gill United Kingdom Yorkshire Dales 11600.0 105.0
Swildon’s Hole Somerset 9144.0 167.0
Charterhouse Cave Somerset 4868.0 228.0

If I was being smart I could make the scraping adaptive, that is only updating the pages that have changed since the last scraped by using all the data returned by GetWikipediaCategoryRecurse(), but it’s small enough at the moment.

So, why not use DBpedia?

I know what you’re saying: Surely the whole of DBpedia does exactly this, with their parser?

And that’s fine if you don’t want your updates to come less than 6 months, which prevents you from getting any feedback when adding new caves into Wikipedia, like Aquamole_Pot.

And it’s also fine if you don’t want to be stuck with the naïve semantic web notion that the boundaries between entities is a simple, straightforward and general concept, rather than what it really is: probably the one deep and fundamental question within any specific domain of knowledge.

I mean, what is the definition of a singular cave, really? Is it one hole in the ground, or is it the vast network of passages which link up into one connected system? How good do those connections have to be? Are they defined hydrologically by dye tracing, or is a connection defined as the passage of one human body getting itself from one set of passages to the next? In the extreme cases this can be done by cave diving through an atrocious sump which no one else is ever going to do again, or by digging and blasting through a loose boulder choke that collapses in days after one nutcase has crawled through. There can be no tangible physical definition. So we invent the rules for the definition. And break them.

So while theoretically all the caves on Leck Fell and Easgill have been connected into the Three Counties System, we’re probably going to agree to continue to list them as separate historic caves, as well as some sort of combined listing. And that’s why you’ll get further treating knowledge domains as special cases.

]]> 5 758215842