NewsReader – Hack 100,000 World Cup Articles

June 10, The Hub Westminster (@NewsReader) Ian Hopkinson has been telling you about our role in the NewsReader project. We’re making a thing that crunches large volumes of news articles. We’re combining natural language processing and semantic web technology. It’s an FP7 project so we’re working with a bunch of partners across Europe. We’re 18 […]

Book review: The Signal and the Noise by Nate Silver

Nate Silver first came to my attention during the 2008 presidential election in the US. He correctly predicted the outcome of the November results in 49 of 50 states, missing only on Indiana where Barack Obama won by just a single percentage point. This is part of a wider career in prediction: aside from a […]

Getting Twitter connections

Introducing the Get Twitter Friends tool Our Twitter followers tool is one of our most popular: enter a Twitter username and it scrapes the followers of that account. We were often asked if it’s possible not only to get the users that follow a particular account, but the users that are followed by that account […]

Scraping Spreadsheets with XYPath

Spreadsheets are great. They’re ubiquitously available, beaten only by the web pages and the word processor documents. Like the word processor, they’re easy to use and give the user a blank page, but they divide the page up into cells to make sure that the columns and rows all line up. And unlike more complicated […]

NewsReader – one year on

ScraperWiki has been contributing to NewsReader, an EU FP7 project, for over a year now. In that time, we’ve discovered that all the TechCrunch articles would make a pile 4 metres high, and that’s just one relatively small site. The total volume of news published everyday is enormous but the tools we use to process it […]

Face ReKognition

I’ve previously written about social media and the popularity of our Twitter Search and Followers tools. But how can we make Twitter data more useful to our customers? Analysing the profile pictures of Twitter accounts seemed like an interesting thing to do since they are often the faces of the account holder and a face […]

Book review: Hadoop in Action by Chuck Lam

Hadoop in Action by Chuck Lam provides a brief, fairly technical introduction to the Hadoop Big Data ecosystem. Hadoop is an open source implementation of the MapReduce framework originally developed by Google to process huge quantities of web search data. The name MapReduce, refers to dividing up jobs amongst multiple processors (“Mapping”) and then recombining […]

Book review: Python for Data Analysis by Wes McKinney

As well as developing scrapers and a data platform, at ScraperWiki we also do data analysis. Some of this is just because we’re interested, other times it’s because clients don’t have the tools or the time to do the analysis they want themselves. Often the problem is with the size of the data. Excel is […]

The best data opens itself on UK Gov’s Performance Platform

This is third in a series of posts about the UK Government’s Performance Platform, cross-posted on the OKFN blog as it is about open data. Part 1 introduced why the platform is exciting, and part 2 described how it worked inside. The best data opens itself. No need to make Freedom of Information requests to pry the information […]

ScraperWiki

Extract tables from PDFs and scrape the web

Archive | Data Science