As part of the my London Underground visualisation project I wanted to get data out of a table on Wikipedia, you can see it below. It contains data on every London Underground station including things like the name of the station, the opening date, which zone it is in, how many passengers travel through it and so forth.
Such tables can be copy-pasted directly into Excel or Tableau but the result is a mess with extraneous lines and so forth which needs manual editing for us to work with it. Alternatively we can use the ScraperWiki Table Xtract tool to get the data in rather cleaner form, we can see the result of doing this below. It looks pretty good, the Station name and Lines columns come out nicely and there is only one row per station, and no blank rows. But something weird is going on in the numeric and date fields, characters have been appended to the data we can see in the table.
It turns out these extra characters are a result of invisible text added to the tables to make the table sortable by those columns. This “invisible” text can be seen by inspecting the source of the HTML page. There are various ways of making text invisible on a web page but Wikipedia seems to just use one in it’s sortable tables. Once I had identified the issue it was just a case of writing some code to hide the invisible text in the Table Xtract tool. To do this I modified the messytables library on which Table Xtract is built, you can see the modification here. The stringent code review requirements at ScraperWiki meant I had two goes at making the change!
You can see the result in the screenshot below, the Opened, Mainline Opened and Usage columns now are free of extraneous text. This fix should apply across Wikipedia and also to tables on other web pages which use the same method to make text invisible.
We’re keen to incrementally improve our tools, so if there’s a little fix to any of our tools that you want us to make then please let us know!