learning – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 (Machine) Learning about ScraperWiki’s Twitter followers https://blog.scraperwiki.com/2013/12/machine-learning-about-scraperwikis-twitter-followers/ Tue, 03 Dec 2013 10:27:12 +0000 https://blog.scraperwiki.com/?p=758220470 Machine learning is commonly used these days. Even if you haven’t directly used it personally, you’ve almost certainly encountered it. From checking your credit card purchases to prevent fraudulent transactions, through to sites like Amazon or IMDB telling you what things you might like, it’s a way of making sense of the large amounts of data that are increasingly accessible.

Supervised learning involves taking a set of data to which you have assigned labels and then training a classifier based on this data. This classifier can then be applied to similar data where the labels (or classes) is unknown. Unsupervised learning is where we let machine learning cluster our data for us and hence identify classes automatically.

A frequently used demonstration is the automatic identification of different plant species. The measurements of parts of their flowers are the data and the species is equivalent to a label or class designation. It’s easy to see how these methods can be extended to the business world, for example:

  • Given certain things I know about a manufacturing process, how do I best configure my production line to minimise defects in my product?
  • Given certain things I know about one of my customers, how likely are they to take up an offer I want to make them?

You might have a list of potential customers who have signed up to a newsletter, providing you with some profile information and a set of outcomes: those customers from the list who have bought your product. You then train your classifier to identify likely spenders based on the profiles of those you know of already. When new people sign up to your mailing list, you can then evaluate them with your trained classifier to discover if they are likely customers and thus how much time you should lavish on them.

Scraping our Twitter followers

As ScraperWiki’s launching a fantastic new PDF scraping product that’s of interest to businesses, we wondered if we could apply machine learning to find out whether our Twitter followers link to businesses in their account profiles?

First, we scraped the followers of ScraperWiki’s Twitter account. With the aid of a Python script, we used the expandurl API to convert the Twitter follower URLs from the shortened Twitter t.co form to their ultimate destination, and then we scraped a page from each site.

Building a classifier

For our classifier, we used content we’d scraped from the website along with a classification of business or not business for each site.

We split the follower URLs into around a thousand sites to build the classifier with and around eight hundred sites to actually try the classifier on. Myself and Ian spent several hours classifying websites linked to by ScraperWiki followers’ to see whether they appeared to be businesses or not. With these classifications collated and the sites of interest scraped, we could feed in the processed content from the HTML into a scikit-learn classifier.

We used a linear support vector classifier which was simple to create. The more challenging part is actually deciding on what features to retrieve from each website and storing this in an appropriate matrix of features.

Automatically classifying followers

Classifying one thousand followers by hand was an arduous day of toil. By contrast, running the classifier was far more pleasant. Once started, it happily went off and classified the remaining eight hundred accounts for us. The lift curve below shows how the classifier actually performed on the sites we tested it on (red curve) compared to the expected performance if we simply took a random sampling of sites (blue line) as we work through the websites.

Lift curve showing how the business classifier performs.

Looking at the top 25% of the classifier’s predictions, we actually find the majority of the businesses. To confirm each one of these predictions by hand would take us just a quarter of the time it would take to look through the entire set, yet we’d find 70% of all of the businesses, a big time saving.

Feeding the classifier’s predictions into the Contact Details Tool is then a convenient workflow to help us figure out which of our followers are businesses and then how we could go and contact them.

If you’d like to chat to us to see how this type of classification could help your business, get in touch with us and let us know what you’re interested in discovering!

]]>
758220470
The long dark tea time of the computer programmer https://blog.scraperwiki.com/2012/01/the-long-dark-tea-time-of-the-computer-programmer/ https://blog.scraperwiki.com/2012/01/the-long-dark-tea-time-of-the-computer-programmer/#comments Fri, 13 Jan 2012 10:00:14 +0000 http://blog.scraperwiki.com/?p=758216075 Ian McNaught-Davis presenting the BBC's 1983 TV series "Making the Most of the Micro"

The way in which Information Technology is taught in England is so dull and harmful it should be scrapped – that’s the view of the Education Secretary Michael Gove.
‘A nation of digital illiterates’ (BBC)

Many years ago there was a total corporate take-over of the computer software sector in the UK. Big money was to be made out of controlling the profits generated by software applications, which were protected from competition by incompatible and inoperable standards and the force of law. (An attempt by the UK government to establish a very modest requirement for open standards was successfully killed off last week.)

One of the most painful aspects of this take-over was the way in which the same corporations managed to deform the entire education system into serving their purposes. All things resembling actual computer programming were cleansed from the curriculum, which was instead packed with dire, tedious training modules for drilling students in how to use those self-same corporations’ big software suites.

Back in the 1980s, before this take-over, I learned to program on the BBC Micro, which was widespread throughout the UK at the time. There is good evidence that this was the reason there has been such a strong software industry in the UK over the last three decades.

Let’s just use the words of computer games pioneer Ian Livingston from his February 2011 report:

“Given that the new online world is being transformed by creative technology companies like Facebook, Twitter, Google and video games companies, it seems incredible that there is an absence of computer programming in schools. The UK has gone backwards at a time when the requirement for computer science as a core skill is more essential than ever before. When Sir Clive Sinclair launched the ZX Spectrum in 1982, affordable computers were eagerly purchased for the homes of a creative nation. At the same time, the BBC Micro was adopted as the computer platform of choice for most schools and became the cornerstone of computing in British education in the 1980s. There was a thirst for creative computing both in the home and in schools creating a further demand at universities for courses in computer science. This certainly contributed to the rapid growth of the UK computer games industry.

“But instead of building on the BBC’s Computer Literacy Project in the 1980s, schools turned away from programming in favour of ICT. Whilst useful in teaching various proprietary office software packages, ICT fails to inspire children to study computer programming. It is certainly not much help for a career in games. In a world where technology affects everything in our daily lives, so few children are taught such an essential STEM skill as programming. Bored by ICT, young people do not see the potential of the digital creative industries. It is hardly surprising that the games industry keeps complaining about the lack of industry-ready computer programmers and digital artists.”

The official Government response was mostly public relations, mentioning developments that have nothing to do them such as Raspberry Pi, which, incidentally, appears to be an attempt by Livingston’s generation to recreate the 1980s when we once watched computers on TV (instead of watching TV on computers).

While this change in Government policy is absolutely vital, it is odd the way the people lobbying for it haven’t branched outside of their narrow fields of computer games and computer graphics – which are ultimately little more than a game of shifting pixels around on a VDU, utilizing very mature technologies based on software applications developed by large corporations.

They’re missing the point: This is the era of smart energy monitors that need to be coupled to microcontrollers that do something with appliances over the internet in response to data. Or robotrading bank accounts that use live data feeds to monitor and execute your share portfolio while you’re watching Corrie. The focus should be on Arduinos and simple robotics to control the energy use in your house, or on creative programming to trump the hedgefunders.

Recently I have been doing experiments through the interface of a home-built 3D printer, completely bypassing their UI application to drive it directly through the serial port.

Here is my first result:

Come on guys. I know it takes years to mastering the ability to render the perfect series of CGI frames of a spoon stirring the brown liquid in a cup of tea. But get a computer controlled robot to actually make me a cup of tea — that would be something!

]]>
https://blog.scraperwiki.com/2012/01/the-long-dark-tea-time-of-the-computer-programmer/feed/ 2 758216075