vector – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 (Machine) Learning about ScraperWiki’s Twitter followers https://blog.scraperwiki.com/2013/12/machine-learning-about-scraperwikis-twitter-followers/ Tue, 03 Dec 2013 10:27:12 +0000 https://blog.scraperwiki.com/?p=758220470 Machine learning is commonly used these days. Even if you haven’t directly used it personally, you’ve almost certainly encountered it. From checking your credit card purchases to prevent fraudulent transactions, through to sites like Amazon or IMDB telling you what things you might like, it’s a way of making sense of the large amounts of data that are increasingly accessible.

Supervised learning involves taking a set of data to which you have assigned labels and then training a classifier based on this data. This classifier can then be applied to similar data where the labels (or classes) is unknown. Unsupervised learning is where we let machine learning cluster our data for us and hence identify classes automatically.

A frequently used demonstration is the automatic identification of different plant species. The measurements of parts of their flowers are the data and the species is equivalent to a label or class designation. It’s easy to see how these methods can be extended to the business world, for example:

  • Given certain things I know about a manufacturing process, how do I best configure my production line to minimise defects in my product?
  • Given certain things I know about one of my customers, how likely are they to take up an offer I want to make them?

You might have a list of potential customers who have signed up to a newsletter, providing you with some profile information and a set of outcomes: those customers from the list who have bought your product. You then train your classifier to identify likely spenders based on the profiles of those you know of already. When new people sign up to your mailing list, you can then evaluate them with your trained classifier to discover if they are likely customers and thus how much time you should lavish on them.

Scraping our Twitter followers

As ScraperWiki’s launching a fantastic new PDF scraping product that’s of interest to businesses, we wondered if we could apply machine learning to find out whether our Twitter followers link to businesses in their account profiles?

First, we scraped the followers of ScraperWiki’s Twitter account. With the aid of a Python script, we used the expandurl API to convert the Twitter follower URLs from the shortened Twitter t.co form to their ultimate destination, and then we scraped a page from each site.

Building a classifier

For our classifier, we used content we’d scraped from the website along with a classification of business or not business for each site.

We split the follower URLs into around a thousand sites to build the classifier with and around eight hundred sites to actually try the classifier on. Myself and Ian spent several hours classifying websites linked to by ScraperWiki followers’ to see whether they appeared to be businesses or not. With these classifications collated and the sites of interest scraped, we could feed in the processed content from the HTML into a scikit-learn classifier.

We used a linear support vector classifier which was simple to create. The more challenging part is actually deciding on what features to retrieve from each website and storing this in an appropriate matrix of features.

Automatically classifying followers

Classifying one thousand followers by hand was an arduous day of toil. By contrast, running the classifier was far more pleasant. Once started, it happily went off and classified the remaining eight hundred accounts for us. The lift curve below shows how the classifier actually performed on the sites we tested it on (red curve) compared to the expected performance if we simply took a random sampling of sites (blue line) as we work through the websites.

Lift curve showing how the business classifier performs.

Looking at the top 25% of the classifier’s predictions, we actually find the majority of the businesses. To confirm each one of these predictions by hand would take us just a quarter of the time it would take to look through the entire set, yet we’d find 70% of all of the businesses, a big time saving.

Feeding the classifier’s predictions into the Contact Details Tool is then a convenient workflow to help us figure out which of our followers are businesses and then how we could go and contact them.

If you’d like to chat to us to see how this type of classification could help your business, get in touch with us and let us know what you’re interested in discovering!

]]>
758220470