Uncategorized – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 The Sensible Code Company is our new name https://blog.scraperwiki.com/2016/08/the-sensible-code-company-is-our-new-name/ https://blog.scraperwiki.com/2016/08/the-sensible-code-company-is-our-new-name/#respond Tue, 09 Aug 2016 06:10:13 +0000 https://blog.scraperwiki.com/?p=758224569 For a few years now, people have said “but you don’t just do scraping, and you’re not a wiki, why are you called that?”

We’re pleased to announce that we have finally renamed our company.

We’re now called The Sensible Code Company.

The Sensible Code Company logo

Or just Sensible Code, if you’re friends!

We design and sell products that turn messy information into valuable data.

As we announced a couple of weeks ago the ScraperWiki product is now called QuickCode.

Our other main product converts PDFs into spreadsheets, it’s called PDFTables. You can try it out for free.

We’re also working on a third product – more about that when the time is right.

You’ll see our company name change on social media, on websites and in email addresses over the next day or two.

It’s been great being ScraperWiki for the last 6 years. We’ve had an amazing time, and we hope you have too. We’re sticking to the same vision, to make it so that everyone can make full use of data easily.

We’re looking forward to working with you as The Sensible Code Company!

https://blog.scraperwiki.com/2016/08/the-sensible-code-company-is-our-new-name/feed/ 0 758224569
PDFTables.com: PHP, C# and VBA API examples https://blog.scraperwiki.com/2015/07/pdftables-com-php-c-and-vba-api-examples/ Fri, 31 Jul 2015 15:02:12 +0000 https://blog.scraperwiki.com/?p=758223439 Invoices, bank statements, feeds of public data… Painful though it can be, many business workflows need to be able to take data in from PDFs.

PDFTables.com has had an web API for a while. We’ve just added a few more language examples for C#, PHP and Visual Basic for Applications coders.

You can find them on the API documentation page.


Let us know if you build anything with it, or if you need an example for any other language!

Two jobs at ScraperWiki to ponder over the bank holiday https://blog.scraperwiki.com/2014/05/two-jobs-at-scraperwiki-to-ponder-over-the-bank-holiday/ Fri, 23 May 2014 11:55:16 +0000 https://blog.scraperwiki.com/?p=758221711 It’s busy busy at ScraperWiki at the moment.

Digger hiringWe’re growing our award winning data hub (where people scrape Twitter). We’ve lots of interesting consultancy, for clients like the Cabinet Office (GDS), United Nations (OCHA), and Autotrader.

So we’re hiring two people.

1) A Digital Marketer. This is an unusual opportunity to market a marketing data product! It’s SaaS, with an active acquisition funnel. Who do you know who likes content marketing and data and should take a look?

2) A Software Engineer. Again, unusual chance to work on an open source startupy product, full stack. And to do engineering for interesting clients. We use Python and Docker and lots of other good stuff.

As well as pondering if you can do the jobs, who do you know who’d like a change?

What has Europe ever done for us? https://blog.scraperwiki.com/2012/11/what-has-europe-ever-done-for-us/ https://blog.scraperwiki.com/2012/11/what-has-europe-ever-done-for-us/#comments Fri, 30 Nov 2012 09:21:57 +0000 http://blog.scraperwiki.com/?p=758217607 Hmm… 2.8 million euros, a ‘history recorder’, and the opportunity to have a full on working relationship with VU Amsterdam Uni, Lexis Nexis, plus with some brilliant bods in Trento and San Sebastien  (Happy Christmas!)

It’s official!  We have become a European FP7 partner with the VU Amsterdam University, Faculty of Arts’ (Prof Piek Vossen), Lexis Nexis and research centres in Trento (Italy) and the Basque city, San Sebastien (Spain).

The ‘History Recorder’ project is ambitious.
It aims to create some wizardry software that “reads” daily streams of news and stores exactly what happened, where and when in the world, and who was involved.   The software will use the same strategy as humans by building up a story and merging it with previously stored information.

We have very good reasons for wanting to be involved in this project, not least our responsibility for ensuring that the entire project is open source and for creating a blueprint for all subsequent FP7 projects to be made public and transparent!  We will also help to identify and gather data from interesting places and munge them through the ScraperWiki digger!  And of course we have to engage with customers to get the project embedded commercially and to help with this we will host events with our partners in the participating cities.

Whilst all of the team at ScraperWiki will be involved in ensuring that we are successful, we are hiring someone who will take a lead role in managing the project and will have a job posting on our site very soon – watch this space!

Lexis Nexis Logo, Trento Coat of Arms and San Sebastien Coat of Arms

Roll on 2013 as the project kicks off in January!

https://blog.scraperwiki.com/2012/11/what-has-europe-ever-done-for-us/feed/ 1 758217607
The Humble CSV! https://blog.scraperwiki.com/2012/09/the-humble-csv/ Thu, 27 Sep 2012 09:29:31 +0000 http://blog.scraperwiki.com/?p=758217528 It must be rocket science!  The CSV (comma separated values) file has been in use for 45 years, from before men walked on the moon, and it still remains the cheapest and most reliable way to move data from one computer system to another.

While hardware and software standards have moved forwards with the technology (from USB connectors to OpenGL cards), data has, with a few exceptions, remained in the Apollo space age.

That’s not to say there has been no progress. For example, there is Electronic data interchange to replace paper-based purchase orders with electronic equivalents, and Legal Electronic Data Exchange Standard use by law firms to transmit bills and invoices to their corporate clients, while the banks have SWIFT to handle interbank messages.

There are high profile initiatives like the ‘Linked Data’ movement spearheaded by Sir Tim Berners Lee and aimed at making data interoperability universal but is a hard nut to crack.

It must have been the promise of sales and the demands of frustrated customers that led Compaq, Digital Equipment, IBM, Intel, Microsoft, and Northern Telecom to replace an existing mixed connector system with a simpler architecture and thus create the USB (Universal Serial Bus) which became widely adopted within a mere two years of its introduction.  Where is the equivalent leadership for data interoperability going to come from?  Mr Ellison?  Mr Ballmer?

Without major software companies pushing for a standard it just ain’t going to happen and so the humble CSV file could be here for a very long time to come….

Hacking the National Health Service https://blog.scraperwiki.com/2012/09/hacking-the-nhs/ https://blog.scraperwiki.com/2012/09/hacking-the-nhs/#comments Mon, 24 Sep 2012 15:40:37 +0000 http://blog.scraperwiki.com/?p=758217541 In the age of easy to use consumer software – from Facebook to the iPhone – health workers find the software they get at work increasingly frustrating.

Talk to some! You’ll find stories of doctors crossing hospitals to reboot computers to get a vital piece of data. Stories of individuals keeping patient records on Excel where they can’t be handed over easily to the next shift.

If it’s vitally important that playing Farmville is easy and usable, how much more important is it that the software that keeps us healthy (and alive!) is easy and usable?

This weekend was the second NHS Hackday which is organised by Open Health Care UK, this time in Liverpool. Much as ScraperWiki often gets hacks and hackers together to learn from each other, NHS Hackday brings doctors (and other clinical practitioners) and geeks together.

There were 15 astonishing entries at the end of the weekend. I’m going to tell you about the two that won ScraperWiki prizes for supreme, awesome scraping.

Second ScraperWiki prize for scraping was Conflict of Interest. It ambitiously scrapes academic papers on PubMed, and parses them to find those with a registered conflict of interest with particular drugs companies. It’ll be interesting to see it properly launch – meanwhile take a look at the code and help them out.

First ScraperWiki prize for scraping went to ePortfolio hack. Medical trainees in the UK keep track of what they’ve learnt using the NHS ePortfolio. Unfortunately, it’s a closed system without an API – so very hard for junior doctors to log in and add comments while on the move. They want good mobile interfaces, and also they want to extract their data so they can do more with it.

ePortfolio hack is the first stage of liberating this data. It’s a scraper where you need a doctor’s user name and password. It logs into the ePortfolio website for you, gathers up the data and produces it in a structured form. Hope the team carries on and makes a full read/write API and iPad app!

Finally, an honourary mention goes to Clinical Optics Calulator, for best use of ScraperWiki. It’s an excellent but simple, visual interactive calculator for opticians.

Let’s hope this passion for usability, and open data, spreads out into health services across the world!

https://blog.scraperwiki.com/2012/09/hacking-the-nhs/feed/ 4 758217541
On-line directory tree webscraping https://blog.scraperwiki.com/2012/09/on-line-directory-tree-webscraping/ https://blog.scraperwiki.com/2012/09/on-line-directory-tree-webscraping/#comments Fri, 14 Sep 2012 16:01:39 +0000 http://blog.scraperwiki.com/?p=758217490 As you surf around the internet — particularly in the old days — you may have seen web-pages like this:

or this:

The former image is generated by Apache SVN server, and the latter is the plain directory view generated for UserDir on Apache.

In both cases you have a very primitive page that allows you to surf up and down the directory tree of the resource (either the SVN repository or a directory file system) and select links to resources that correspond to particular files.

Now, a file system can be thought of as a simple key-value store for these resources burdened by an awkward set of conventions for listing the keys where you keep being obstructed by the ‘/‘ character.

My objective is to provide a module that makes it easy to iterate through these directory trees and produce a flat table with the following helpful entries:

abspath fname name ext svnrepo rev url
/Charterhouse/ 2010/ PotOxbow/ PocketTopoFiles/ PocketTopoFiles/ PocketTopoFile / CheddarCatchment 19 http://www.cave-registry.org.uk/ svn/ CheddarCatchment/ Charterhouse/ 2010/ PotOxbow/ PocketTopoFiles/
/mmmmc/ rawscans/ MossdaleCaverns/ PolishedFloor-drawnup1of3.jpg PolishedFloor-drawnup1of3.jpg PolishedFloor-drawnup1of3 .jpg Yorkshire 2383 http://cave-registry.org.uk/svn/ Yorkshire/ mmmmc/ rawscans/ MossdaleCaverns/ PolishedFloor-drawnup1of3.jpg

Although there is clearly redundant data between the fields url, abspath, fname, name, ext, having them in there makes it much easier to build a useful front end.

The function code (which I won’t copy in here) is at https://scraperwiki.com/scrapers/apache_directory_tree_extractor/. This contains the functions ParseSVNRevPage(url) and ParseSVNRevPageTree(url), both of which return dicts of the form:

{'url', 'rev', 'dirname', 'svnrepo', 
 'contents':[{'url', 'abspath', 'fname', 'name', 'ext'}]}

I haven’t written the code for parsing the Apache Directory view yet, but for now we have something we can use.

I scraped the UK Cave Data Registry with this scraper which simply applies the ParseSVNRevPageTree() function to each of the links and glues the output into a flat array before saving it:

lrdata = ParseSVNRevPageTree(href)
ldata = [ ]
for cres in lrdata["contents"]:
    cres["svnrepo"], cres["rev"] = lrdata["svnrepo"], lrdata["rev"]
scraperwiki.sqlite.save(["svnrepo", "rev", "abspath"], ldata)

Now that we have a large table of links, we can make the cave image file viewer based on the query:
select abspath, url, svnrepo from swdata where ext=’.jpg’ order by abspath limit 500

By clicking on a reference to a jpg resource on the left, you can preview what it looks like on the right.

If you want to know why the page is muddy, a video of the conditions in which the data was gathered is here.

Image files are usually the most immediately interesting out of any unknown file system dump. And they can be made more interesting by associating meta-data with them (given that no convention for including interesting information in the EXIF sections of their file formats). This meta-data might be floating around in other files dumped into the same repository — eg in the form of links to them from html pages which relate to picture captions.

But that is a future scraping project for another time.

https://blog.scraperwiki.com/2012/09/on-line-directory-tree-webscraping/feed/ 1 758217490
Do all “analysts” use Excel? https://blog.scraperwiki.com/2012/07/do-all-analysts-use-excel/ https://blog.scraperwiki.com/2012/07/do-all-analysts-use-excel/#comments Tue, 31 Jul 2012 11:34:26 +0000 http://blog.scraperwiki.com/?p=758217448 We were wondering how common spreadsheets are as a platform for data analysis. It’s not something I’ve really thought about in a while; I find it way easier to clean numbers with real programming languages. But we suspected that virtually everyone else used spreadsheets, and specifically Excel Spreadsheet, so we did a couple of things to check that.


First, I looked for job postings for “analyst” jobs. I specifically looked in companies that provide tools or analysis for social media stuff. For each posting, I marked whether the posting require knowledge of Excel. They all did. And not only did they require knowledge of Excel, they required “Excellent computer skills, especially with Excel”, “Advanced Excel skills a must”, and so on. I generally felt that Excel was presented as the most important skill for each particular job.

Second, I posted on Facebook to ask “analyst” friends whether they use anything other than Excel.

Thomas Levine posted the Facebook status "I'm wondering how common Excel is. If you work as an 'analyst', could you tell me whether you do your analysis in anything other than Excel?", and two of his friends commented saying, quite strongly, that they use Excel a lot.

It seems that they don’t.


It seems that Excel is a lot more common than I’d realized. Moreover, it seems that “analyst” is basically synonmous with “person who uses Excel”.

Having investigated this and concluded that “analyst” is synonymous with “person who uses Excel”, I personally am going to stop saying that I “analyze” data because I don’t want people to think that I use Excel. But now I need another word that explains that I can do more advanced things.

Maybe that’s why people invented that nonsense role “data scientist”, which I apparently am. Actually, Dragon thought we should define “big data” as “data that you can’t analyze in Excel”.

For ScraperWiki as a whole, this analysis data science gives us an idea of the technical expertise to expect of people with particular job roles. We’ve recognized that the current platform expects that people are comfortable programming, so we’re working on something simpler. We pondered making Excel plugins for social media analysis functions, but now we think that that will be far too complicated for the sort of people who could use them, so we’re thinking about ways of making the interface even simpler without being overly constrained.

https://blog.scraperwiki.com/2012/07/do-all-analysts-use-excel/feed/ 5 758217448
Mapping deaths in the Italian prison system. https://blog.scraperwiki.com/2012/07/mapping-deaths-in-the-italian-prison-system/ https://blog.scraperwiki.com/2012/07/mapping-deaths-in-the-italian-prison-system/#comments Tue, 17 Jul 2012 15:38:45 +0000 http://blog.scraperwiki.com/?p=758217364 This is a guest post by Jacopo OttavianiItalian freelance journalist and developer. The story it tells was published in the Italian newspaper Il Fatto Quotidiano.

Currently in Italy many prisoners die every month in jail. According to an independent dossier by the Italian non-profit association Ristretti Orizzonti (lit. Narrow Horizons), almost one thousand deaths were registered in the last ten years (2002-2012). Most of detainees who died committed suicide (57%) or died of sickness (20%). Another relevant number of prisoners deceased in unclear circumstances (19%); the rest of drug overdose (26 cases) or omicide (11 cases). Such death rate underlines inadeguate conditions in which prisoners are forced to live in.

The data journalism project I am now introducing is intended to map deaths in prisons in terms of locations and causes. The whole project has been published by Il Fatto Quotidiano (lit., The Daily Fact, a young yet leading newspaper in Italy) and on the Guardian Datablog. However, an English version is also available.

The development process was based on a series of step, each of these working through free-of-charge tools. ScraperWiki was one of those: two scrapers were coded, extracting 1276 rows of data.

Two sources of data were scraped and crossed: the dossier released by Ristretti Orizzonti and the data published on the Minister of Justice website. The former lists the actual casualties, reporting names, dates, causes and prisons’ names, the latter provides addresses of prisons in Italy.

An overview of the development follows.

  1. 1st scraper (sourcecode and data) – The first scraper elaborated the Excel file containing the list of deaths in prisons. Each record reported the dead detenee’s credentials (name, surname and age), date and cause of death, and name of the prison where he died (but not the address, which was taken from the 2nd source).
    The XLS Python module made the scraping straight-forward. The only problem was due to a couple of malformed records, which did not contain proper Excel dates (i.e., dates were not in “date format” but in string format). To address this problem a manual modification was necessary. After that, the scraper ran smoothly.
  2. 2nd scraper (sourcecode and data) – Prisons’ addresses and contacts are published by the Italian Minister of Justice. A PHP scraper crawled the whole list of prisons parsing their addresses and contacts, which were distributed on a set of consecutive pages; data was stored into HTML tables. Some cells contained semi-structured text (e.g., sometimes telephone numbers started with “Tel.”, sometimes with “Telefono:”), but simple regular expressions could cope with that.
  3. Preprocessing – Once crawled and scraped the data, both of the tables were refined with Google Refine. GREL string functions and clustering methods were used to improve data consistency. Common transforms were perfomed as well, such as trim, HTML entities unescape and consecutive whitespace collapse.
  4. Join – Tables were then merged by matching prisons names in the first table to citynames in the second one. This “unnatural join” was based on string inclusion. Some prisons had to be associated explicitally since a few records in the Excel file did not contain the cityname (this happened with prisons well known by unofficial names, such as “Rebibbia” – which is the common way to call the Prison of Rome).
  5. Geolocation – The final table‘s records contained all dead detainees’ data and the geographical address of prisons where they died. Using Batchgeo each casualty got geolocated on the map of Italy. Markers represent single casualties and were clustered in pie charts to aggregate the causes of death on different zoom-levels.
  6. Error checking – Mainly, errors belonged to two categories. First, errors due to erroneous joined records. Those happened because some cities have more than one prison and the string inclusion did not work. In order to fix that, explicit rules were created (e.g., “Rebibbia” => “Prison of Rome”). This problem would not show up if each prison (in both dataset) was marked with a unique name/ID, making a natural merging straight-forward. Second, geolocation mismatches due to Google Maps disambiguation problems made hard by the complex Italian toponomastics. Such errors were fixed up manually.
  7. Publishing (Italian, English) – By surfing the map many stories emerged. Some of those, got published along with the map.
https://blog.scraperwiki.com/2012/07/mapping-deaths-in-the-italian-prison-system/feed/ 1 758217364
Middle Names in the United States over Time https://blog.scraperwiki.com/2012/06/middle-names-in-the-united-states-over-time/ https://blog.scraperwiki.com/2012/06/middle-names-in-the-united-states-over-time/#comments Fri, 15 Jun 2012 15:06:20 +0000 http://blog.scraperwiki.com/?p=758217039 I was wondering what proportion of people have middle names, so I asked the Census.

Recently you requested personal assistance from our on-line support
center. Below is a summary of your request and our response.

We will assume your issue has been resolved if we do not hear from you
within 48 hours.

Thank you for allowing us to be of service to you.

To access your question from our support site, click the following
link or paste it into your web browser.

What proportion of people have middle initials?

Discussion Thread
Response Via Email(CLMSO - EMM) - 03/14/2011 16:04
Thank you for using the US Census Bureau's Question and Answer Center. Un-
fortunately, the subject you asked about is not one for which the Census
Bureau collects data. We are sorry we were not able to assist you.

Question Reference #110314-000041
 Category Level 1: People
 Category Level 2: Miscellaneous
     Date Created: 03/14/2011 15:29
     Last Updated: 03/14/2011 16:04
	   Status: Pending Closure

Since they didn’t know, I looked at students at a university. Cornell University email addresses can contain two or three letters, depending on whether the Cornellian has a middle name. I retrieved all of the email addresses of then-current Cornell University students from the Cornell Electronic Directory and came up with this plot.

A plot of middle name prevalence by school among Cornell University students shows that 15824 students had middle names and 6649 not and that the proportion varies substantially by school, the graduate school having a particularly low rate of middle names and the agriculture school having a particularly high rate.

Middle name prevalence among Cornell University students

Based on discussions with some of the students in that census, I suspected that students underreport rather than overreport middle names and that the under-reporting is generally an accident.

A year later, I finally got around to testing that. I looked at the names of 85,822,194 dead Americans and came up with some more plots.

Plot of middle name prevalence as a function of time by state, showing a relatively sharp increase from 10% to 80% between 1880 and 1930, followed by a plateau until 1960, followed by a smaller jump to 95% by 1975.

Middle name prevalence as a function of time and state

The rate of middle names these days is about 90%, which is a lot more than the Cornell University student figures; this supports my suspicion that people under-report middle names rather than overreport them.

I was somewhat surprised that reported middle name prevalance varied so much over time but so relatively little by state. I suspect that most of the increase over time is explained by improvement of office procedures, but I wonder what explains slower increases around 1955 and 1990.


The death file provides a lot more data than I’ve shown you here, so check with me in a couple months to see what else I come up with.

https://blog.scraperwiki.com/2012/06/middle-names-in-the-united-states-over-time/feed/ 1 758217039