Tableau – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 A tool to help with your next job move Tue, 05 May 2015 07:32:49 +0000 A guest post from Jyl Djumalieva.

Jyl Djumalieva

Jyl Djumalieva

During February and March this year I had a wonderful opportunity to share the workspace with ScraperWiki team. As an aspiring data analyst, I found it very educational to learn how real-life data science happens. After observing ScraperWiki data scientist do some analytical heavy lifting I was inspired to embark on an exploratory analytics project of my own. This is what I’d like to talk about in this blog post.

The world of work has fundamentally changed: employees no longer grow via vertical promotions. Instead they develop transferable skills and learn new skills to reinvent their careers from time to time.

As people go through career transitions, they increasingly need tools that would help them make an informed choice on what opportunities to pursue. Imagine, if you were considering your career options, whether you would benefit from a high level overview of what jobs are available and where. To prioritize your choices and guide personal development, you might also want to know the average salary for a particular job, as well as skills and technology commonly used. In addition, perhaps you would appreciate insights on what jobs are related to your current position and could be your next move.

So, have tools that would meet all of these requirements previously existed? As far as I am aware: no. The good news is that there is a lot of job market information available. However, the challenge is that the actual job postings are not linked directly to job metadata (e.g. average salary, tools and technology used, key skills, and related occupations) This is why, as an aspiring data analyst, I have decided to bring the two sources of information together; you can find a merging of job postings with metadata demonstrated in the Tableau workbook here. For this purpose I have primarily used Python, publicly available data and Tableau Public, not to mention the excellent guidance of the ScraperWiki data science team.

The source of the actual job postings was the job board. I chose to use this site because it is an aggregator of job postings and includes vacancies from companies’ internal applicant tracking systems in addition to vacancies published on open job boards. The site also has an API, which allows to search job postings using multiple criteria and then retrieve results in an xml format. For this example, I used API to collect data for all job postings in the state of New York that were available on March 10, 2015. The script used to accomplish this task is posted to this bitbucket repository.

Teaching  jobs from Indeed in New York State, February 2015

Teaching jobs from Indeed in New York State, February 2015

To put in place the second element of the project – job metadata – I have gathered information from O*NET, which stands for Occupational Information Network (O*NET) and is a valuable online source of occupational information. O*NET is implemented under the sponsorship of the US Department of Labor/Employment and Training Administration. O*NET provides API for accessing most of the aspects of each occupation, such as skills, knowledge, abilities, tasks, related jobs, etc. It’s also possible to scrape data on average reported wages for each occupation directly from the O*NET website.

So, at this point, we have two types of information: actual jobs postings and job metadata. However, we need to find a way to link these two to turn it into a useful analytical tool. This has proven to be the most challenging part of the project. The real world job titles are not always easily assigned to a particular occupation. For instance, what do “Closing Crew” employees do, or what is the actual occupation of an “IT Invoice Analyst?”

The process for dealing with this challenge included exact and fuzzy matching of actual job titles with previously reported job titles, and assigning an occupation based on a key word in a job title (e.g., “Registered nurse”). Fuzzy wuzzy python library was a great resource for this task. I intend to continue improving the accuracy and efficiency of the title matching algorithm.

After collecting, extracting and processing the data, the next step was to find a way to visualize it. Tableau Public turned out to be a suitable tool for this task and made it possible to “slice and dice” the data from a variety of perspectives. In the final workbook, the user can search the actual job postings for NY state as well as see the information on geographic location of the jobs, average annual and hourly wages, skills, tools and technology frequently used and related occupations.

Next move, tools and technologies

Next move, tools and technologies from the Tableau tool

I encourage everyone who wants to understand what’s going on in their job sector to check my Tableau workbook and bitbucket repo out. Happy job transitioning!

Inordinately fond of beetles… reloaded! Wed, 10 Sep 2014 12:46:22 +0000 sciencemuseum_logoSome time ago, in the era before I joined ScraperWiki I had a play with the Science Museums object catalogue. You can see my previous blog post here. It was at a time when I was relatively inexperienced with the Python programming language and had no access to Tableau, the visualisation software. It’s a piece of work I like to talk about when meeting customers since it’s interesting and I don’t need to worry about commercial confidentiality.

The title comes from a quote by J.B.S. Haldane, who was asked what his studies in biology had told him about the Creator. His response was that, if He existed then he was “inordinately fond of beetles”.

The Science Museum catalogue comprises three CSV files containing information on objects, media and events. I’m going to focus on the object catalogue since it’s the biggest one by a large margin – 255,000 objects in a 137MB file. Each object has an ID number which often encodes the year in which the object was added to the collection; a title, some description, it often has an “item name” which is a description of the type of object, there is sometimes information on the date made, the maker, measurements and whether it represents part or all of an object. Finally, the objects are labelled according to which collection they come from and which broad group in that collection, the catalogue contains objects from the Science Museum, Nation Railway Museum and National Media Museum collections.

The problem with most of these fields is that they don’t appear to come from a controlled vocabulary.

Dusting off my 3 year old code I was pleased to discover that the SQL I had written to upload the CSV files into a database worked almost first time, bar a little character encoding. The Python code I’d used to clean the data, do some geocoding, analysis and visualisation was not in such a happy state. Or rather, having looked at it I was not in such a happy state. I appeared to have paid no attention to PEP-8, the Python style guide, no source control, no testing and I was clearly confused as to how to save a dictionary (I pickled it).

In the first iteration I eyeballed the data as a table and identified a whole bunch of stuff I thought I needed to tidy up. This time around I loaded everything into Tableau and visualised everything I could – typically as bar charts. This revealed that my previous clean up efforts were probably not necessary since the things I was tidying impacted a relatively small number of items. I needed to repeat the geocoding I had done. I used geocoding to clean up the place of manufacture field, which was encoded inconsistently. Using the Google API via a Python library I could normalise the place names and get their locations as latitude – longitude pairs to plot on a map. I also made sure I had a link back to the original place name description.

The first time around I was excited to discover the Many Eyes implementation of bubble charts, this time I now realise bubble charts are not so useful. As you can see below in these charts showing the number of items in each subgroup. In a sorted bar chart it is very obvious which subgroup is most common and the relative sizes of the subgroup. I’ve coloured the bars by the major collection to which they belong. Red is the Science Museum, Green is the National Rail Museum and Orange is the National Media Museum.


Less discerning members of ScraperWiki still liked the bubble charts.


We can see what’s in all these collections from the item name field. This is where we discover that the Science Museum is inordinately fond of bottles. The most common items in the collection are posters, mainly from the National Rail Museum but after that there are bottles, specimen bottles, specimen jars, shops rounds (also bottles), bottle, drug jars, and albarellos (also bottles). This is no doubt because bottles are typically made of durable materials like glass and ceramics, and they have been ubiquitous in many milieu, and they may contain many and various interesting things.


Finally I plotted the place made for objects in the collection, this works by grouping objects by location and then finding latitude and longitude for those group location. I then plot a disk sized by the number of items originating at that location. I filtered out items whose place made was simply “England” or “London” since these made enormous blobs that dominated the map.




You can see a live version of these visualisation, and more on Tableau Public.

It’s an interesting pattern that my first action on uploading any data like this to Tableau is to do bar chart frequency plots for each column in the data, this could probably be automated.

In summary, the Science Museum is full of bottles and posters, Tableau wins for initial visualisations of a large and complex dataset.

GeoJSON into ScraperWiki will go! Fri, 22 Aug 2014 08:04:20 +0000 imageSurely everyone likes things on maps?

Driven by this thought we’re produced a new tool for the ScraperWiki Platform: an importer for GeoJSON.

GeoJSON is a file format for encoding geographic information. It is based on JSON which is popular for web based APIs because it is light weight, flexible and easy to parse by JavaScript – the language that powers the interactive web.

We envisage the GeoJSON importer being handy to visualise geographic data, and export that data to software like Tableau using our OData connector.

Why should I import GeoJSON into the ScraperWiki Platform?

Importing any data to the ScraperWiki Platform allows you to visualise your data using tools like View in a Table, or Summarise this Data which is great for this GeoJSON of Parisian Street Art:


In addition you can use tools such as download to CSV or Excel, so it will act as a file converter.

An improved View on a Map tool

We’ve improved the View on a Map tool so you can visualise GeoJSON data right on the Platform, we found that if we tried to plot 10,000 points on a map it all got a bit slow and difficult to use, so we added point clustering. Now, if you have a map with lots of points on it then are clustered together under a symbol with the number of points on it. The colour of the symbol shows the density of points… a picture paints a thousand words, so see the results below for a map of Manchester’s grit bins:


Linking to Tableau using OData

Or you could use the OData connector to attach directly to Tableau, we did this with some data from GeoNet on earthquakes around New Zealand. We’ve provided instructions on how to do this in an early blog post. If you want to try an interactive version of the Tableau visualisation then it’s here.



What will you do with the GeoJSON tool?

Getting all the hash tags, user mentions… Tue, 03 Jun 2014 13:46:01 +0000 We’ve rolled out a change so you get more data when you use the Twitter search tool!

Multiple media entities

We’ve changed four columns. They used to all just randomly return one thing. Now they return all the things, separated by a space. The columns are:

  • hashtags now returns all of them with the hashes, e.g. #opendata #opendevelopment
  • user_mention has been renamed user_mentions, e.g. @tableau @tibco
  • media can now return multiple images and other things
  • url has been renamed urls and can return multiple links

We renamed two of the columns partly to reflect their new status, and partly because they now match the names in the Twitter API exactly.

What can you do with this new functionality?

We had a look at the numbers of media, hashtags, mentions and URLs for a collection of tweets on a popular hashtag (#kittens), using our favourite tool for this sort of work: Tableau. It requires a modicum of cunning to calculate the number of entries in a delimited list using Tableau functions. To count the numbers of entries in each field, we need to make a calculated field like this:

LEN([hashtags]) - LEN(REPLACE([hashtags],'#',''))

This is the calculation for hashtags, where I use # as a marker. You can do the same for mentions (using @ as the marker), and for URL and media use ‘http’ as a marker:

(float(LEN([urls]) - LEN(REPLACE([urls],'http','')))/4.0)

Hat-tip to Mark Jackson for that one.

For URLs and media we see that most tweets only contain one item, although for URLs there are posts with up to six identical URLs, presumably in an attempt to get search engine benefits. The behaviour for mentions and hashtags is more interesting. Hashtags top out at a maximum of 19 in a single tweet, every word has a hashtag.

The distribution is shown in the chart below, each tweet is represented by a thin horizontal bar, the length of the bar depends on the number of hashtags, the bars are sorted by size, so the longest bar at the top represents the maximum number of hashtags.


For mentions we see that most tweets only mention one or two other users at most:


Thanks to Mauro Migliarini for suggesting this change.

]]> 2 758221703
Our new US stock market tool Wed, 14 May 2014 08:13:27 +0000 In a recent blog post, Ian talked about getting stock market data into Tableau using our Code in a Browser tool. We thought this was so useful that we’ve wrapped this up into an easy-to-use tool. Now you can get stock data by pressing a button and choosing the stocks you’re interested in, no code required!

All you have to do is enter some comma-separated stocks, for example: AAPL,FB,MSFT,TWTR and then press the Get Stocks button to collect all the data that’s available. Once you’ve set the tool running, the data continues to automatically update with the latest data daily. Just as with any other ScraperWiki dataset, you can view in a table, query with SQL or download the data as a spreadsheet for use elsewhere. With our new OData connector, you can also import the data directly into Tableau.

You can see Ian demonstrating the use of the US stock market tool, and using the OData tool to connect to Tableau in this YouTube video:

The London Underground: Should I walk it? Sun, 04 May 2014 18:09:43 +0000 LU_logoWith a second tube strike scheduled for Tuesday I thought I should provide a useful little tool to help travellers cope! It is not obvious from the tube map but London Underground stations can be surprisingly close together, very well within walking distance.

Using this tool, you can select a tube station and the map will show you those stations which are within a mile and a half of it. 1.5 miles is my definition of a reasonable walking distance. If you don’t like it you can change it!

The tool is built using Tableau. The tricky part was allowing the selection of one station and measuring distances to all the others. Fortunately it’s a problem which has been solved, and documented, by Jonathan Drummey over on the Drawing with Numbers blog.

I used Euston as an origin station to demonstrate in the image below. I’ve been working at the Government Digital Service (GDS), sited opposite Holborn underground station, for the last couple of months. Euston is my mainline arrival station and I walk down the road to Holborn. Euston is coloured red in the map, and stations within a mile and a half are coloured orange. The label for Holborn does not appear by default but it’s the one between Chancery Lane and Tottenham Court Road. In the bottom right is a table which lists the walking distance to each station, Holborn appears just off the bottom and indicates a 17 minute walk – which is about right.

Should I walk it

The map can be controlled by moving to the top left and using the controls that should appear there. Shift+left mouse button allows panning of the map view. A little glitch which I haven’t sorted out is that when you change the origin station the table of stations does not re-sort automatically, the user must click around the distance label to re-sort. Any advice on how to make this happen automatically would be most welcome.

Distances and timings are approximate. I have the latitude and longitude for all the stations following my earlier London Underground project which you can see here. The distances I calculate by taking the Euclidean distance between stations in angular units and multiplying by a factor which gives distances approximately the same as those in Google Maps. So it isn’t a true “as the crow flies” distance but is proportional to it. The walking times are calculated by assuming a walking speed of 3 miles and hour. If you put your cursor over a station you’ll see the name of the station with the walking time and distance from your origin station.

A more sophisticated approach would be to extract more walking routes from Google Maps and use that to calculate distances and times. This would be rather more complicated to do and most likely not worth the effort, except if you are going South of the river.

Mine is not the only effort in this area, you can see a static map of walking distances here.

Try out the Tableau and QlikView connector Fri, 02 May 2014 08:02:52 +0000 In March, we launched an OData tool.

If you use Tableau or QlikView, it lets you easily get and refresh data from ScraperWiki.

Connect with OData

From today, the OData tool is now available on our new 30 day trial accounts.

Which means anyone can try it out for free!

Instructions here (particularly for Tableau).

Visualising the London Underground with Tableau Mon, 28 Apr 2014 13:57:57 +0000 I’ve always thought of the London Underground as a sort of teleportation system. You enter a portal in one place, and with relatively little effort appeared at a portal in another place. Although in Star Trek our heroes entered a special room and stood well-separated on platforms, rather than packing themselves into metal tubes.

I read Christian Wolmar’s book, The Subterranean Railway about the history of the London Underground a while ago. At the time I wished for a visualisation for the growth of the network since the text description was a bit confusing. Fast forward a few months, and I find myself repeatedly in London wondering at the horror of the rush hour underground. How do I avoid being forced into some sort of human compression experiment?

Both of these questions can be answered with a little judicious visualisation!

First up, the history question. It turns out that other obsessives have already made a table containing a list of the opening dates for the London Underground. You can find it here, on wikipedia. These sortable tables are a little tricky to scrape, they can be copy-pasted into Excel but random blank rows appear. And the data used to control the sorting of the columns did confuse our Table Xtract tool, until I fixed it – just to solve my little problem! You can see the number of stations opened in each year in the chart below. It all started in 1863, electric trains were introduced in the very final years of the 19th century – leading to a burst of activity. Then things went quiet after the Second World War, when the car came to dominate transport.


Originally I had this chart coloured by underground line but this is rather misleading since the wikipedia table gives the line a station is currently on rather than the one it was originally built for. For example, Stanmore station opened in 1932 as part of the Metropolitan line, it was transferred to the Bakerloo line in 1939 and then to the Jubilee line in 1979. You can see the years in which lines opened here on wikipedia, where it becomes apparent that the name of an underground line is fluid.

So I have my station opening date data. How about station locations? Well, they too are available thanks to the work of folk at Openstreetmap, you can find that data here. Latitude-longitude coordinates are all very well but really we also need the connectivity, and what about Harry Beck’s iconic “circuit diagram” tube map? It turns out both of these issues can be addressed by digitizing station locations from the modern version of Beck’s map. I have to admit this was a slightly laborious process, I used ImageJ to manually extract coordinates.

I’ve shown the underground map coloured by the age of stations below.

Age map2

Deep reds for the oldest stations, on the Metropolitan and District lines built in the second half of the 19th century. Pale blue for middle aged stations, the Central line heading out to Epping and West Ruislip. And finally the most recent stations on the Jubilee line towards Canary Wharf and North Greenwich are a darker blue.

Next up is traffic, or how many people use the underground. The wikipedia page contains information on usage, in terms of millions of passengers per year in 2012 covering both entries and exits. I’ve shown this data below with traffic shown at individual stations by the thickness of the line.


I rather like a “fat lines” presentation of the number of people using a station, the fatter the line at the station the more people going in and out. Of course some stations have multiple lines so get an unfair advantage. Correcting for this it turns out Canary Wharf is the busiest station on the underground, thankfully it’s built for it. Small above ground beneath it is a massive, cathedral-like space.

More data is available as a result of a Freedom of Information request (here) which gives data broken down by passenger action (boarding or alighting), underground line, direction of travel and time of day – broken down into fairly coarse chunks of the day. I use this data in the chart below to measure the “commuteriness” of each station. To do this I take the ratio of people boarding trains in the 7am-10am time slot with those boarding 4pm-7pm. For locations with lots of commuters, this will be a big number because lots of people get on the train to go to work in the morning but not many get on the train in the evening, that’s when everyone is getting off the train to go home.


By this measure the top five locations for “commuteriness” are:

  1. Pinner
  2. Ruislip Manor
  3. Elm Park
  4. Upminster Bridge
  5. Burnt Oak

It was difficult not to get sidetracked during this project, someone used the Freedom of Information Act to get the depths of all of the underground stations, so obviously I had to include that data too! The deepest underground station is Hampstead, in part because the station itself is at the top of a steep hill.

I’ve made all of this data into a Tableau visualisation which you can play with here. The interactive version shows you details of the stations as your cursor floats over them, allows you to select individual lines and change the data overlaid on the map including the depth and altitude data that.

Yahoo!Finance to Tableau via ScraperWiki Thu, 17 Apr 2014 10:24:51 +0000 Our recently announced OData connector gives Tableau users access to a world of unstructured and semi-structured data.

In this post I’d like to demonstrate the power of a Python library, Pandas, and the Code in a Browser tool to get “live” stock market data from Yahoo!Finance into Tableau. Python is a well-established programming language with a rich ecosystem of software libraries which can provide access to a wide range of data.

This isn’t a route to doing high frequency trading but it is a demonstrates the principles of using ScraperWiki as an adaptor to data on the web. Although Tableau supports a wide range of data connections it can’t handle everything. As well as ready-made tools to collect data and serve it up in different formats, ScraperWiki allows users to write their own tools. The simplest method is to use the “Code in a browser” tool.

I wrote about the Pandas library a few weeks ago, its designed to provide some of the statistical and data processing functionality R to users of Python. It grew out of the work of a financial analyst, Wes McKinney, so naturally he added a little piece of functionality to pull in stock market data from Yahoo!Finance. The code required to do this is literally a single line.

To make data we collect using the pandas library available to all of ScraperWiki tools, like the OData connector or the View in a Table tool, we need to write the data into a local database.

You can see the code to get Yahoo!Finance data and make it available in the screenshot below, and you can get a copy directly from this GitHub gist.


Once you’ve entered the code, then you can run it immediately or schedule it to run regularly.

In less than 10 lines of code we’ve added a new data source to Tableau!

The most complicated part of the process is getting the pandas library to recognise the dates properly. This is by no means a polished tool but it is fully functioning and can easily be modified to collect different stock data. Obvious extensions would be to collect a list of stocks, and to provide a user interface.

Once we have the data then we can access it over OData, I followed Andrew Watson’s instructions for making a “candlestick” plot (here). And the resulting plot is shown below and can be found on Tableau Public.


On a desktop installation of Tableau you can refresh the data at the click of a button.

What data can you get in less than 10 lines of code?

Publish your data to Tableau with OData Fri, 07 Mar 2014 16:48:38 +0000 We know that lots of you use data from our astonishingly simple Twitter tools in visualisation tools like Tableau. While you can download your data as a spreadsheet, getting it into Tableau is a fiddly business (especially where date formatting is concerned). And when the data updates, you’d have to do the whole thing over again.

There must be a simpler way!

And so there is. Today we’re excited to announce our new “Connect with OData” tool: the hassle-free way to get ScraperWiki data into analysis tools like Tableau, QlikView and Excel Power Query.

odata-screenshotTo get a dataset into Tableau, click the “More tools…” button and select the “Connect with OData” tool. You’ll be presented with a list of URLs (one for each table in your dataset).

Copy the URL for the table of interest. Then nip over to Tableau, select “Data” > “Connect to Data” > “OData”, and paste in the URL. Simple as that.

The OData connection is fast and robust – so far we’ve tried it on datasets with up to a million rows, and after a few minutes, the whole lot was downloaded and ready to visualise in Tableau. The best bit is that dates and Null values come through just fine, with zero configuration.

The “Connect with OData” tool is available to all paying ScraperWiki users, as well as journalists on our free 20-dataset journalist plan.


If you’re a Tableau user, try it out, and let us know what you think. It’ll work with all versions of Tableau, including Tableau Public.

]]> 2 758221163