Products – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 QuickCode is the new name for ScraperWiki (the product) https://blog.scraperwiki.com/2016/07/quickcode-is-the-new-name-for-scraperwiki-the-product/ https://blog.scraperwiki.com/2016/07/quickcode-is-the-new-name-for-scraperwiki-the-product/#comments Thu, 14 Jul 2016 14:02:14 +0000 https://blog.scraperwiki.com/?p=758224468 Our original browser coding product, ScraperWiki, is being reborn.

We’re pleased to announce it is now called QuickCode.

QuickCode front page

We’ve found that the most popular use for QuickCode is to increase coding skills in numerate staff, while solving operational data problems.

What does that mean? I’ll give two examples.

  1. Department for Communities and Local Government run clubs for statisticians and economists to learn to code Python on QuickCode’s cloud version. They’re doing real projects straight away, such as creating an indicator for availability of self-build land. Read more

  2. Office for National Statistics save time and money using a special QuickCode on-premises environment, with custom libraries to get data from spreadsheets and convert it into the ONS’s internal database format. Their data managers are learning to code simple Python scripts for the first time. Read more

Why the name change? QuickCode isn’t about just scraping any more, and it hasn’t been a wiki for a long time. The new name is to reflect its broader use for easy data science using programming.

We’re proud to see ScraperWiki grow up into an enterprise product, helping organisations get data deep into their soul.

Does your organisation want to build up coding skills, and solve thorny data problems at the same time?

We’d love to hear from you.

]]>
https://blog.scraperwiki.com/2016/07/quickcode-is-the-new-name-for-scraperwiki-the-product/feed/ 1 758224468
Case study: Enrique Cocero getting political data from PDFs https://blog.scraperwiki.com/2015/08/case-study-enrique-cocero-getting-political-data-from-pdfs/ Thu, 20 Aug 2015 15:23:15 +0000 https://blog.scraperwiki.com/?p=758223754 cocero2Political strategy is international now.

Enrique Cocero works from Madrid for his consultancy 7-50 Electoral Math, using data to understand voters and candidates in election campaigns across the world.

He’s struggled with PDFs for a long time, and recently found PDF Tables via a Google search. He says:

I used to have nightmares – I’m sleeping better now!

Enrique got into political analysis while living in Boston in the US, volunteering on the Warren vs. Brown senate election, which was the most expensive contest in senate history.

Unfortunately, particularly in the US, lots of the raw data about politics comes on digital paper. For example, this PDF has 25 pages of detailed Missouri primary election results.

Missouri Primary results

Enrique uses the data to build models in the stats software R. He needs structured tables first to load into R. He uses PDF Tables to convert the PDF files into Excel. The output looks like this:

MIssouri Primary in Spreadsheet

Enrique tried various other conversion tools, such as from Adobe, but found the quality wasn’t high enough. In particular, cells were merged between columns, and data was misplaced. He had to spend too much time cleaning up the output, but often even that wouldn’t work.

Enrique’s models calculate, precinct by precinct, which voters to target. There are some people who will vote for you whatever happens, and others who never will. Where are the voters in between, along the chaotic edge? You need to learn as much as you can about the motives of those voters.

Because there is lots of open data, and a culture of data analysis in campaigns, the US is very active for 7-50 Electoral Math. Being in Spain, Enrique works a lot there too. There are less PDFs in Spain and more traditional web scraping, such as these Catalonia Parliamentary election results.

Catalunya Parliament results

Enrique often has to go to separate sites for each region, and then into separate pages for each year.

The system in Spain is very secretive. There’s not as much detailed data available, so instead Enrique has to reach conclusions by approximations, and make projections from the data there is.

Israel, in contrast, is a “paradise for elections” according to Enrique. With 120 seats in the Knesset, politicians have constantly shifting alliances. They jump jobs a lot, making it a fun place to analyse.

PDF Tables has lots of customers getting political data from PDFs. One day, the world will work out a popular data interchange method. For now we’re glad Enrique is at least sleeping a bit better!

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758223754
The four kinds of data PDF https://blog.scraperwiki.com/2015/08/the-four-kinds-of-data-pdf/ https://blog.scraperwiki.com/2015/08/the-four-kinds-of-data-pdf/#comments Tue, 11 Aug 2015 12:22:36 +0000 https://blog.scraperwiki.com/?p=758223601 At ScraperWiki, we talk to lots of customers who need to convert PDFs to Excel.

Why are they doing it?

The industries are diverse – banking, insurance, retail, logistics, political campaigning, energy…

What separates them in data terms though, is each has one of four different kinds of workflow.

A. Large tables

These are PDFs which are printed from databases. One PDF may have as many as half a million values in it.

For example, this PDF of Ugandan election results has a single enormous table spread over 1168 pages. Every page has the same 22 columns.

Uganda election

The key thing here is that the data is uniform, and there’s lots of it in the same form.

Once extracted to Excel by PDF Tables, the data can be easily loaded into a database, BI software or a stats package and queried.

B. Pivotted tables

A typical spreadsheet isn’t a pure table of numbers. Often there are headings and subheadings and data spread between tabs. There are subtotals and totals. If you’ve ever used pivot tables, these spreadsheets are a bit like one.

For example, look at this citrus production PDF from the US Department of Agriculture.

Citrus production USDA

Even when you’ve got its content out into a spreadsheet, it still isn’t data. The varieties of orange are interleaved with the states as subheadings. Annual production is mixed up in the same table as monthly forecasts.

To a human, such digital paper is easy to understand. To an SQL database, it first needs “unpivotting”. That’s the kind of tricky work ScraperWiki’s DataBaker product is designed for.

C. Transactions

Here, each PDF represents one transaction. It might be an invoice, a derivative trade, a purchase order or a property transfer.

Most of the important data will only appear once in each PDF, and there will be lots of PDFs. For example, each PDF may have one sender, one recipient, one transaction number and one total cost. There might be multiple purchase line items, forming a subtable.

Have a look at this invoice – it’s from Wikipedia, so I can share it. Most transaction PDFs are very private!

473px-Invoice_by_Allen_Tyler_for_torso_course_at_LSU

The buyer, the seller, their addresses, the invoice number, the ship data, the total… In data terms those are all in one row of an invoice table. There’s also a subtable, with the individual items being invoiced for.

Often organisations end up with legacy systems, where EDI never got implemented, and receive transactions as PDFs, usually over email or SFTP. PDF Tables helps unpick that, and automate the flow of data again.

D. Reports

The final kind of PDF mixes text and images with tables. Company reports are a typical example.

For example, this is page 221 of General Electric’s 2014 Annual Report.

General Electric financial report

Usually, a few individual tables are required. Copy and paste would be ideal. Meanwhile, you can convert them online with PDF Tables, which has an option to download in an Excel file with a tab for each page (“multiple sheets”).

Conclusion

These four kinds of PDF are different in both:

  1. The software used to make them – A) databases, B) spreadsheets, C) “mail merge”, D) DTP.
  2. How the data needs processing after basic table extraction: A) is already data, B) needs “unpivotting”, C) needs fields picking out, D) needs tables picking out.

Which kinds of data PDF do you come across in your day to day work?

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
https://blog.scraperwiki.com/2015/08/the-four-kinds-of-data-pdf/feed/ 1 758223601
….and suddenly I could convert my bank statement from PDF to Excel… https://blog.scraperwiki.com/2015/08/and-suddenly-i-could-convert-my-bank-statement-from-pdf-to-excel/ Wed, 05 Aug 2015 13:03:59 +0000 https://blog.scraperwiki.com/?p=758223405 24 HoursDo you ever:

  • Need an old bank statement only to find out that the bank has archived it, and want to charge you to get it back?
  • Spot check to make sure there are no fraudulent transactions on your account?
  • Like to summarise all your big ticket items for a period?
  • Need to summarise business expenses?

It’s been difficult for me to do any of these as bank transaction systems are Luddite.

15 years after signing up to my smile internet bank account I received a ground breaking message.

“Your paperless statement is now available to view when you login to online banking”.

I logged in excited, expecting an incredible new interface.

Eureka - PDF StatementNo … it meant I can now download a PDF!

Don’t get me wrong – PDF is the “Portable Document Format” – so at least I can keep my own records which is a step forward. But it’s just as clumsy to analyse a PDF as it is to trawl through the bank’s online system (see The Tyranny of the PDF to understand why).

We know a lot about the problems with PDFs at ScraperWiki and we made PDFTables.com.  I’m able to convert my PDF to Excel and get a list of  transactions which I can analyse and store in some order.  Yes – I have to do some post processing but I can automate this with a spreadsheet macro.

You can see on the example I have included that the alignment of the transactions is spot on and I could even use our DataBaker product to take out the transaction descriptions and the values and put them into another system.

Although we’d love everything to be structured data all the way through, the number of PDFs on the web is still increasing exponentially.  Hooray for PDFTables.com!

Statement #173

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758223405
PDFTables: All the tables in one page, CSV https://blog.scraperwiki.com/2015/06/pdftables-all-the-tables-in-one-page-csv/ Tue, 30 Jun 2015 13:00:24 +0000 https://blog.scraperwiki.com/?p=758223293 Lots of you have asked for it, and we’ve finally changed the Excel download format at PDFTables.com to put all the pages of your PDF into one worksheet. This is particularly useful if you have big tables that span multiple pages.

Download menu You can still have the old format, just choose “Excel (multiple sheets)” from the Download menu.

CSV format support

You’ll also spot there are other download formats. CSV is new. It’s particularly useful if you’re using the API to integrate PDF Tables with your application.

The XML format is what we used to call HTML. It has HTML-style tables in it, wrapped up as a valid XML file.

Let us know if you’ve any questions or feedback!

]]>
758223293
Elasticsearch and elasticity: building a search for government documents https://blog.scraperwiki.com/2015/06/elasticsearch-and-elasticity-building-a-search-for-government-documents/ Mon, 22 Jun 2015 08:39:44 +0000 https://blog.scraperwiki.com/?p=758222932 A photograph of clouds under a magnifying glass.

Examining Clouds” by Kate Ter Harr, licensed under CC BY 2.0.

Based in Paris, the OECD is the Organisation for Economic Co-operation and Development. As the name suggests, the OECD’s job is to develop and promote new social and economic policies.

One part of their work is researching how open countries trade. Their view is that fewer trade barriers benefit consumers, through lower prices, and companies, through cost-cutting. By tracking how countries vary, they hope to give legislators the means to see how they can develop policies or negotiate with other countries to open trade further.

This is a huge undertaking.

Trade policies not only operate across countries, but also by industry.  This process requires a team of experts to carry out the painstaking research and detective work to investigate current legislation.

Recently, they asked us for advice on how to make better use of the information available on government websites. A major problem they have is searching through large collections of document to find relevant legislation. Even very short sections may be crucial in establishing a country’s policy on a particular aspect of trade.

Searching for documents

One question we considered is: what options do they have to search within documents?

  1. Use a web search engine. If you want to find documents available on the web, search engines are the first tool of choice. Unfortunately, search engines are black boxes: you input a term and get results back without any knowledge of how those results were produced. For instance, there’s no way of knowing what documents might have been considered in any particular search. Personalised search also governs the results you actually see. One normal-looking search of a government site gave us a suspiciously low number of results on both Google and Bing. Though later searches found far more documents, this is illustrative of the problems of search engines for exhaustive searching.
  2. Use a site’s own search feature. This is more likely to give us access to all the documents available. But, every site has a different layout and there’s a lack of a unified user interface for searching across multiple sites at once. For a one-off search of documents, having to manually visit and search across several sites isn’t onerous. Repeating this for a large number of searches soon becomes very tedious.
  3. Build our own custom search tool. To do this, we need to collect all the documents from sites and store those in a database that we run. This way we know what we’ve collected, and we can design and implement searches according to what the OECD need.

Elasticsearch

Enter Elasticsearch: a database designed for full text search and one which seemed to fit our requirements.

Getting the data

To see how Elasticsearch might help the OECD, we collected several thousand government documents from one website.

We needed to do very little in the way of processing. First, we extracted text from each web page using Python’s lxml. Along with the URL and the page title, we then created structured documents (JSON) suitable for storing in Elasticsearch.

Running Elasticsearch and uploading documents

Running Elasticsearch is simple. Visit the release page, download the latest release and just start it running. One sensible thing to do out of the box is change the default cluster name — the default is just elasticsearch. Making sure Elasticsearch is firewalled off from the internet is another sensible precaution.

When you have it running, you can simply send documents to it for storage using a HTTP client like curl:

curl "http://localhost:localport/documents/document" -X POST -d @my_document.json

For the few thousand documents we had, this wasn’t sluggish at all, though it’s also possible to upload documents in bulk should this prove too slow.

Querying

Once we have documents stored, the next thing to do is query them!

Other than very basic queries, Elasticsearch queries are written in JSON, like the documents it stores, and there’s a wide variety of query types bundled into Elasticsearch.

Query JSON is not difficult to understand, but it can become tricky to read and write due to the Russian doll-like structure it quickly adopts. In Python, the addict library is a useful one for making it easier to more directly write queries out without getting lost inside an avalanche of {curly brackets}.

As a demo, we implemented a simple phrase matching search using the should keyword.

This allows combination of multiple phrases, favouring documents containing more matches. If we use this to search for, e.g. "immigration quota"+"work permit", the results will contain one or both of these phrases. However, results with both phrases are deemed more relevant.

The Elasticsearch Tool

elasticsearch_tool

With our tool, researchers can enter a search, and very quickly get back a list of URLs, document titles and a snippet of a matching part of the text.

elasticsearch_results

What we haven’t implemented is the possibility of automating queries which could also save the OECD a lot of time. Just as document upload is automated, we could run periodic keyword searches on our data. This way, Elasticsearch could be scheduled to lookout for phrases that we wish to track. From these results, we could generate a summary or report of the top matches which may prompt an interested researcher to investigate.

Future directions

For (admittedly small scale) searching, we had no problems with a single instance of Elasticsearch. To improve performance on bigger data sets, Elasticsearch also has built-in support for clustering, which looks straightforward to get running.

Clustering also ensures there is no single point of failure. However, there are known issues in that current versions of Elasticsearch can suffer document loss if nodes fail.

Provided Elasticsearch isn’t used as the only data store for documents, this is a less serious problem. It is possible to keep checking that all documents that should be in Elasticsearch are indeed there, and re-add them if not.

Elasticsearch is powerful, yet easy to get started with. For instance, its text analysis features support a large number of languages out of the box. This is important for the OECD who are looking at documents of international origin.

It’s definitely worth investigating if you’re working on a project that requires search. You may find that, having found Elasticsearch, you’re no longer searching for a solution.

]]>
758222932
Announcing PDFTables.com https://blog.scraperwiki.com/2015/05/announcing-pdftables-com/ Mon, 18 May 2015 14:46:18 +0000 https://blog.scraperwiki.com/?p=758222804 PDFs were invented at the same time as the web.  As “digital paper”, they’re trustworthy and don’t change behind your back.

This has a downside – often the definitive source of published data is a PDF. It’s hard to get tens of thousands of numbers out and into a spreadsheet or database. Copying and pasting is too slow, and popular conversion tools munge columns together.

At ScraperWiki, we’ve been helping people get the data back out of PDFs for nearly 5 years.

In that time we’ve developed an Artificial Intelligence algorithm. Just like your eyes, it can see the spacing between columns, picking out the structure of a table from its shape.

It’s called PDFTables.com.

PDF Tables screenshot

This is the first self-service, web-based product designed for getting volumes of data from PDFs. It’s super fast to convert individual PDFs, and there’s a web API to automate more.

You can use it a couple of times without signing up, and then get 50 pages more for free. We charge per page, so you only pay for what you need.

We’d love feedback – please contact us to let you know what you think.

Got a PDF you want to get data from?
Try our easy web interface over at PDFTables.com!
]]>
758222804
GeoJSON into ScraperWiki will go! https://blog.scraperwiki.com/2014/08/geojson-into-scraperwiki-will-go/ Fri, 22 Aug 2014 08:04:20 +0000 https://blog.scraperwiki.com/?p=758222182 imageSurely everyone likes things on maps?

Driven by this thought we’re produced a new tool for the ScraperWiki Platform: an importer for GeoJSON.

GeoJSON is a file format for encoding geographic information. It is based on JSON which is popular for web based APIs because it is light weight, flexible and easy to parse by JavaScript – the language that powers the interactive web.

We envisage the GeoJSON importer being handy to visualise geographic data, and export that data to software like Tableau using our OData connector.

Why should I import GeoJSON into the ScraperWiki Platform?

Importing any data to the ScraperWiki Platform allows you to visualise your data using tools like View in a Table, or Summarise this Data which is great for this GeoJSON of Parisian Street Art:

ParisianArt

In addition you can use tools such as download to CSV or Excel, so it will act as a file converter.

An improved View on a Map tool

We’ve improved the View on a Map tool so you can visualise GeoJSON data right on the Platform, we found that if we tried to plot 10,000 points on a map it all got a bit slow and difficult to use, so we added point clustering. Now, if you have a map with lots of points on it then are clustered together under a symbol with the number of points on it. The colour of the symbol shows the density of points… a picture paints a thousand words, so see the results below for a map of Manchester’s grit bins:

ManchesterGrit

Linking to Tableau using OData

Or you could use the OData connector to attach directly to Tableau, we did this with some data from GeoNet on earthquakes around New Zealand. We’ve provided instructions on how to do this in an early blog post. If you want to try an interactive version of the Tableau visualisation then it’s here.

 

TableauView

What will you do with the GeoJSON tool?

]]>
758222182
The story of getting Twitter data and its “missing middle” https://blog.scraperwiki.com/2014/08/the-story-of-getting-twitter-data-and-its-missing-middle/ https://blog.scraperwiki.com/2014/08/the-story-of-getting-twitter-data-and-its-missing-middle/#comments Thu, 21 Aug 2014 10:41:17 +0000 https://blog.scraperwiki.com/?p=758222196 Search for tweetsWe’ve tried hard, but sadly we are not able to bring back our Twitter data tools.

Simply put, this is because Twitter have no route to market to sell low volume data for spreadsheet-style individual use.

It’s happened to similar services in the past, and even to blog post instructions.

There’s lots of confusion in the market about the exact rules, and why they happen. This blog post tries to explain them clearly!

How can you get Twitter data?

There are four broad ways.

1. Developers can use the API to get data for their own use. The rate limits are actually quite generous, much better than, say, Linked In. It’s an easy and powerful API to use.

There are two problems with this route – firstly to developers it sets the expectation that you can do whatever the API allows. You can’t, in practice you have to follow one of the routes below, or Twitter will shut down your application.

Secondly, it is unfair to non-programmers, who can’t get access to data which programmers easily can. More on that in the “why” section below.

2. Software companies can make an application which use the developer API.

As soon as it gets serious, they should join the Twitter Certified Program to make sure Twitter approve of the app. Ultimately, only Twitter can say whether or not their T&Cs are being met.

These applications can’t allow general data analysis and coding by their users – they have to have specific canned dashboards and queries. This doesn’t meet ScraperWiki’s original vision of empowering people to understand and work with data how they like.

Datasift home page3. Bulk Tweet data is available from Datasift and Gnip. This is called ‘the firehose’, and only includes Tweets (for other data you have to use the methods above).

Datasift is a fantastic product, which indexes the data for you and provides lots of other social media data. Gnip is now owned by Twitter, and is still in the process of blending into them – they’re based in Colorado, rather than San Francisco.

Both companies have to get the main part of Twitter to vet your exact use case. Your business has to be worth at least $3000 / month to them to make this worthwhile.

The actual cost of roughly 10 cents / 1000 Tweets is not too bad, lots of our customers could pay that. But few have the need to get 30 million Tweets a month! In lots of ways, this option is too powerful for most people.

4. Special programs. There are a few of these, for example the Library of Congress archive all Tweets and Twitter are running a pilot Twitter Data Grants program for academics.

These show that it is worth talking to and lobbying Twitter for new ways to get data.

Why do Twitter restrict data use?

The obvious, and I think incorrect, answer is “for commercial reasons”. These are the real reasons.

1. Protect privacy and stop malicious uses.

If you use the firehose via, say, Datasift you have to delete your copy of a Tweet as soon as a user deletes it. Similar rules apply if, for example, somebody makes their account private, or deletes their account. This is really really impressive – fantastic for user privacy. Part of the reason Twitter are so careful about vetting uses is to make sure this is followed.

Twitter also prevent uses which might harm their users in other ways. I don’t know any details, but I understand that they stop Governments gaining large volumes of Twitter data which could be used to do things like identify anonymous accounts by looking at patterns. I’m guessing this has come from Twitter’s rise to prominence during various ‘Twitter Revolutions‘, such as in Iran in 2009.

Twitter developer page2. They’re a media company now.

Twitter has changed from its early days, it is now a media company, rather than a messaging bus. For example, the front page of their developer site is about Twitter Cards and embedding Tweets, with no mention of the data features. This means their focus is on a good consumer experience, and advertising, not finding new routes to market for data.

3. They’re missing bits of the market.

Every company can’t cover and do everything its ecosystem might want. In this case, we think Twitter are simply missing a chunk of the market, and could get more revenue from it.

While there are plenty of products letting you analyse Twitter data in specific ways, there is nothing if you want to use Excel, or other desktop tools like Tableau or Gephi.

For example, Tableau are partnered with Datasift, which from the outside might make it look like Tableau users are covered.  Unfortunately, customers still have to have their use case vetted, and be prepared to spend at least $3000 / month. Also, the Tweets are metered rather than limited, making it awkward for junior staff to freely make use of the capability. It’s just too powerful and expensive for many use cases.

The kind of users in this “missing middle” don’t want to learn a new, limited data analysis interface. They want to use the simple, desktop data analysis products that we’re already familiar with. They also just want a file – they know how to keep track of files.

Conclusion

The ScraperWiki platform continues without Twitter data. You can code your own scraper in your browseraccurately extract tables from PDFs and much more.

We know a lot about Twitter data, and have contacts with lots of parts of the ecosystem. If you have a high value use of the data, our professional services division are happy to help.

]]>
https://blog.scraperwiki.com/2014/08/the-story-of-getting-twitter-data-and-its-missing-middle/feed/ 8 758222196
Twitter tool update https://blog.scraperwiki.com/2014/07/twitter-tool-update/ https://blog.scraperwiki.com/2014/07/twitter-tool-update/#comments Fri, 11 Jul 2014 14:33:58 +0000 https://blog.scraperwiki.com/?p=758222046 Last week, our Twitter API use was suspended.

We’re talking to various people at Twitter, DataSift and Gnip to try and
resolve this.

Unfortunately, we still can’t tell or predict when or if we’ll be able to bring
the service back. To avoid making false promises, we’ve removed the tools from
our website for now.

]]>
https://blog.scraperwiki.com/2014/07/twitter-tool-update/feed/ 4 758222046