data scientist – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 ScraperWiki needs you! Thu, 10 Oct 2013 10:40:47 +0000 Royal Liver Building, LiverpoolAre you excited by data?

Are you impatient to download and explore new datasets?

If you answered yes, to either of these questions then you might be interested in a new data scientist position which has arisen here at ScraperWiki.

Don’t worry if you’ve never described yourself as a data scientist, we’re looking for people who love exploring data from whatever background.

You’ll be based at our offices in Liverpool, right next door to the Metropolitan Cathedral. Liverpool is a vibrant city replete with fine architecture, a rich history, excellent transport links, and a dynamic tech community.

For more details, and to apply please visit our jobs page:

We’re hiring: the world’s best data scientists! Thu, 08 Nov 2012 17:23:23 +0000

If you’re a ScraperWiki coder with great communication skills and a passion for data, then you should probably bookmark our new Jobs page. We’ll be hiring for a few different roles over the coming months, and we’d love to hear from you!

Right now, we’re looking for two Data Scientists to help Dragon Dave get, clean and visualise data for our corporate customers and data requestors. Duties include communicating with non-technical customers, organising and producing complex scrapers, and, most important of all, consuming one or two of our Graze cartons every week (but hands off the pistachios, they’re Chris’s).

For more information, and details on how to apply, visit

]]> 1 758217582
‘Big Data’ in the Big Apple Thu, 29 Sep 2011 15:05:25 +0000 My colleague @frabcus captured the main theme of Strata New York #strataconf in his most recent blog post.  This was our first official speaking engagement in the USA as a Knight News Challenge 2011 winner.  Here is my twopence worth!

At first we were a little confused at the way in which the week long conference was split into three consecutive mini conferences with what looked like repetitive content.  The reality was that the one day Strata Jump Start was like an MBA for people trying to understand the meaning of ‘Big Data’.  It gave a 50,000 foot view of what is going on and made us think about the legal stuff, how it will impact the demand for skills and how the pace with which data is exploding will dramatically change the way in which businesses operate – every CEO should attend or watch the videos and learn!

The following two days called the Strata Summit were focused on what people need to

Big Apple…and small products

think about strategically to get business ready for the onslaught.  In his welcome address Edd Dumbill program chair for O’Reilly said “Computers should serve humans….we have been turned into filing clerks by computers….we spend our day sifting, sorting and filing information…something has gone upside down, fortunately the systems that we have created are also part of the solution…big data can help us…it may be the case that big data has to help us!”

To use the local lingo we took a ‘deep dive’ into various aspects of the challenges.  The sessions were well choreographed and curated.  We particularly liked the session ‘Transparency and Strategic Leaking’ by Dr Michael Nelson (Leading Edge Forum- CSC) where he talked about how companies need to be pragmatic in an age when it is impossible to stop data leaking out of the door.  Companies he said ‘are going to have to be transparent’ and ‘are going to have to have a transparency policy’.   He referred to a recent article in the Economist ‘The Leaking Corporation’ and its assertion that corporations that leak their own data ‘control the story’.

Courtesy of O’Reilly Media

Simon Wardley’s (Leading Edge Forum – CSC) ‘Situation Normal Everything Must Change’ segment made us laugh especially the philosophical little quips that came from his encounter with a London taxi driver – he conducted it at lightening speed and his explanation of ‘ecosystems’ and how big data offers a potential solution to the ‘Innovation Paradox’ was insightful.   It was a heavy duty session but worth it!

Courtesy of O’Reilly Media

There were tons of excellent sessions to peruse.  We really enjoyed Cathy O’Neill’s  What kinds of people are needed for data management’ which talked about data scientists and how they can help corporations  to discern ‘noise’ from signal.

Our very own Francis Irving was interviewed about how ScraperWiki relates to Big Data and Investigative Journalism.

Courtesy of O’Reilly Media

Unfortunately we did not manage to see many of the technology exhibitors #fail. However we did see some very sexy ideas including a wonderful software start-up called – a platform for data prediction competitions and its Chief Data Scientist Jeremy Howard gave us some great ideas on how to manage ‘labour markets’.

..Oh yes and we checked out why it is called Strata….

We had to leave early to attend the Online New Association – #ONA event in Boston so we missed part III which was the two day Strata Conference itself – it is designed for people at the cutting edge of data –  the data scientists and data activists!  I just hope that we manage to get to Strata 2012 in Santa Clara next February.

In his closing address ‘Towards a global brain’  Tim O’Reilly gave a list of 10 scary things that are leading into the perfect humanitarian storm including…Climate Change, Financial Meltdown, Disease Control, Government inertia……so we came away thinking of a T-Shirt theme…Hmm we’re f**ked so lets scrape!!!

Courtesy Hugh MacLeod

Four data trends to rule them all, the data scientist king to bind them Mon, 26 Sep 2011 16:07:16 +0000 My favourite soundbite from O’Reilly’s Strata data conference was a definition of big data. John Rauser, Amazon’s main data scientist, said to me that “data is big data when you can’t process it on one machine”. And naturally, small data is data that you can process on one machine.

What’s nice about this definition is it makes it immediately clear that as time passes, big data is getting smaller.  And of course, small data is getting bigger. This is linked to four interrelating data technology and business trends.

1. Super Moore’s law for data. Even without any specific new technology, what we can process on “one computer” would be getting larger anyway with Moore’s law applied to processors, RAM and disks.

But that’s not what’s happening, we’re also in the middle of the commoditisation of what was once part of Google’s competitive advantage – distribution of work over clusters of bog standard servers, using things like Hadoop.

Right now you still need to hire special engineers to do that, but it is only a matter of time before it is just a service you buy with your credit card – process any amount of data at any speed, with just a slider that any wealthy business man’s data scientist can drag.

The feeling is, rollercoaster!, we’re going faster than Moore’s law with data right now.

2. Business of big data. The result of the first trend, is that now every company does store and can process as much data as once only the tech giants did. This is very significant strategically and tactically, in very specific ways to each industry. Given the right use of the data (see the next two trends), it changes everything.

We’ve seen this in book selling for years now because Amazon was ahead of the curve. But imagine both basic algorithms, and fancy ones like cunningly used to reproduce images from our visual cortex this week, applied in as yet untouched areas.

3. Collaboration of data. The above two trends do not need the Internet. Even if we were still all locked in isolated corporate data centres, through a freak historic accident preventing the invention of global packet switching, we would still be getting the transistor cheapness of Moore’s law, and we’d still be running clusters (“clouds”) of map/reduce servers with just local networking.

The Internet isn’t about raw CPU power. It’s about collaboration. Collaboration is changing how we work with documents, how we share news, how we keep in touch with our friends, how we build software. Why wouldn’t it change how we work with data? Of course, it already is.

There are quite different ways it can happen more. It can create marketplaces for the commercial exchange of data, more transparent than the existing siloing data resellers. It can create tools for socialising the analysis, visualisation, quality checking and gathering of data. It can allow governments, and corporations, to be radically transparent at opening data in all cases except their unique competitive advantage.

It’s remarkable how old the Internet is, and how badly we collaborate on working with data.

4. The data scientist is king. A term cooked up (so I’m told) a few years ago by the chief data scientists at Linked in and Facebook while in a bar deciding what to name their new teams, this is the latest and newly trending iteration of job titles on what was called a statistician or data analyst.

But it isn’t the same. They’re not just geeks buried away. Yes, a data scientist is a data geek. They love data, and interesting data, more than anything. They know how to program, but in scripting languages and SQL, not hard core software engineering. They understand statistics as if they were brought up by a prior probability.

But they also care about the business, and they know how to communicate. They can give presentations to senior management, making hard stats clear. Volunteer, non-profit data scientists have an unassailable passion for their mission.

Data scientists are the glue, linking data to decisions.  Cathy O’Neil gave a fantastic talk at Strata describing these beasts, useful if you are one but didn’t realise it had a buzzy new marketing term, or if you are getting into the business of data and need to hire some.

You can’t make full and accurate use of any of the other trends above, without data scientists.

]]> 4 758215471