twitter – ScraperWiki Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 58264007 Where do tweets come from? Mon, 16 Jun 2014 08:40:51 +0000 Geography of Twitter @replies

Geography of Twitter @replies by Eric Fisher, reproduced under a Creative Commons Attribution 2.0 Generic license.

In our Twitter search tool, we provide the location of tweets via the latitude and longitude data Twitter offers. If you want to know about where the user was who created a particular tweet, it’s unfortunate then that most Twitter users (including me) don’t enable this feature. What you usually find are rare sightings of latitude and longitude amongst mostly empty columns.

However, you can often get a good idea of a user’s location either from what they enter as location in their profile or from their time zone. We already get this information when you use the Twitter Friends tool, but not when searching for tweets. Now we’ve added it to our Twitter search too, so you can get an idea of where individual tweets were sent from.

This snippet of a search shows you what we now get and highlights the clear difference between the lonely lat, lng columns and the much busier user location and time zone:


Create a new Twitter search dataset and you should see this extra data too!

]]> 1 758221888
What’s Twitter time zone data good for? Thu, 05 Jun 2014 10:08:20 +0000 2744390812_c6e2aa449b_o

Curioso elemento el tiempo” by leoplus, available under a Creative Commons Attribution-ShareAlike license.

The Twitter friends tool has just been improved to retrieve the time zone of users. This is actually more useful than it first might sound.

If you’ve looked at Twitter profiles before, you’ve probably noticed that users can, and sometimes do, enter anything they like as their location.

Looking at @ScraperWiki‘s followers, we can see from a small snippet of users that this can sometimes give us messy data:

...Denver. & Beyond
Hyper Island | Stockholm
Niteroi, Brazil
There's a wine blog too .....
London / Berkshire...

People may enter the same location in a number of ways, and may provide data that isn’t even a location.

Locations from time zones

If we look at users’ time zones, Twitter only allows users to pick from a certain number of well-defined time zones. (There’s 141 in total; I’ve collated the entire set here.) The data this returns is much neater and we’d expect that this typically reflects the user’s home location:

...Abu Dhabi

We find far fewer unique time zone data entries than unique location data for @ScraperWiki’s followers: there are 1586 different location entries, but just 106 time zones. If we wanted to discover which countries or regions our users are likely to be, the time zone data would be far simpler to work with.

Furthermore, time zone data can give us insight into the location of Twitter users who don’t specify their location if they’ve selected a time zone.

For ScraperWiki’s followers, we found 670 of them had an empty location and around the same number had an empty time zone. But, far fewer user accounts (only 255) have both of these fields empty. So, in some cases, we could have a good guess at the location for users who we couldn’t previously from the data the tool was providing.

We’re always working to improve the Twitter tools! If you have ideas for features you’d like to see, let us know!

]]> 2 758221842
Getting all the hash tags, user mentions… Tue, 03 Jun 2014 13:46:01 +0000 We’ve rolled out a change so you get more data when you use the Twitter search tool!

Multiple media entities

We’ve changed four columns. They used to all just randomly return one thing. Now they return all the things, separated by a space. The columns are:

  • hashtags now returns all of them with the hashes, e.g. #opendata #opendevelopment
  • user_mention has been renamed user_mentions, e.g. @tableau @tibco
  • media can now return multiple images and other things
  • url has been renamed urls and can return multiple links

We renamed two of the columns partly to reflect their new status, and partly because they now match the names in the Twitter API exactly.

What can you do with this new functionality?

We had a look at the numbers of media, hashtags, mentions and URLs for a collection of tweets on a popular hashtag (#kittens), using our favourite tool for this sort of work: Tableau. It requires a modicum of cunning to calculate the number of entries in a delimited list using Tableau functions. To count the numbers of entries in each field, we need to make a calculated field like this:

LEN([hashtags]) - LEN(REPLACE([hashtags],'#',''))

This is the calculation for hashtags, where I use # as a marker. You can do the same for mentions (using @ as the marker), and for URL and media use ‘http’ as a marker:

(float(LEN([urls]) - LEN(REPLACE([urls],'http','')))/4.0)

Hat-tip to Mark Jackson for that one.

For URLs and media we see that most tweets only contain one item, although for URLs there are posts with up to six identical URLs, presumably in an attempt to get search engine benefits. The behaviour for mentions and hashtags is more interesting. Hashtags top out at a maximum of 19 in a single tweet, every word has a hashtag.

The distribution is shown in the chart below, each tweet is represented by a thin horizontal bar, the length of the bar depends on the number of hashtags, the bars are sorted by size, so the longest bar at the top represents the maximum number of hashtags.


For mentions we see that most tweets only mention one or two other users at most:


Thanks to Mauro Migliarini for suggesting this change.

]]> 2 758221703
Favorite Tweets! Thu, 15 May 2014 08:27:50 +0000 How often was each Tweet favorited? Now you can tell, with a new column we’ve just added to our Twitter search tool – thanks to ScraperWiki user Alden Golab for suggesting this.

Favorite count

You can sort by that column to find the most liked Tweet on a subject. For example, ScraperWiki’s right now is about our London Underground visualization.

We don’t get large numbers of favourites or retweets on our account but the BBC Breaking news account has far more activity. We collected the tweets they made in the last week, including the number of retweets and favourites for each tweet. This amounted to 92 tweets, two of which were retweets of other accounts for which we did not get favourite or retweet accounts. We manually classified the topic of tweets for the biggest stories of the week.

The chart below shows the tweets plotted by both numbers of Favourites and numbers of Retweets, a sample of the tweets are labelled with the text of the tweet. The points are coloured by the manually determined topics. The chart uses a log scale for both axes to spread out the individual tweets. In general the number of retweets and favourites increases together, tweets with more retweets have more favourites but this is only a loose relationship. In general people are more likely to retweet something than they are to Favourite it.

No other strong patterns jump out, we suspect that the Twitter “Favourite” functionality is not wall named – it’s often used as a reminder to read something later. Pyschologically, there is probably an issue with “Favouriting” the report of a murder or some other atrocity.

You can see more detail of this analysis in this Tableau Public visualisation.

RetweetFavourite scatter

Verified Twitter users Thu, 01 May 2014 06:10:34 +0000 We’ve added a “verified accounts” column to our Twitter friends tool – thanks to ScraperWiki user Delfin Paris for suggesting this.


The 1 means it is a Twitter verified account, 0 means it isn’t. You can sort by that column to find all the most notable accounts that are following someone. For example, ScraperWiki has 44 followers who are verified users. Here’s a few, as you can see, they’re mainly journalists!

ScraperWiki's verified

You can see how they compare to our other users in this chart which plots the number of followers an account has along the horizontal axis and the number they follow on the vertical axis. The verified accounts are shown as orange dots. This plot shows that on average verified accounts have more followers than unverified ones. There’s nothing to do to turn it on for new users that you’re scraping. If you’ve previously scraped a user, you’ll have to clear them and start again to add the verified column.

ScraperWiki and Enemy Images Mon, 14 Apr 2014 08:30:10 +0000 This is a guest blog post by Elizaveta Gaufman, a researcher and lecturer at the University of Tübingen.

The theme of my PhD dissertation is enemy images in contemporary Russia.

However, I am not only interested in the governmental rhetoric which is relatively easy to analyse, but also in the way enemy images are circulating on the popular level. As I don’t have an IT background I was constantly looking for easy-to-use tools that would help me analyse the data from social networks, which for me represent a sort of a petri dish of social experimentation. Before I came across ScraperWiki I had to use a bunch of separate tools to analyse data: I had to scrape the data with one tool, then visualise it with another and ultimately create word clouds with the most frequently used terms with a third tool. A separate problem was scraping visuals, which in my previous methodology was not possible.

In my dissertation I conceptualize of enemy images as ‘an ensemble of negative conceptions that describes a particular group as threatening to the referent object’. As an enemy image is not so easy to operationalize, or to create a query for, I filtered a list of threats through opinion poll data and another tool that helped me establish which threats are debated in Russian mass media (both print and online). Then was ScraperWiki’s turn.

ScraperWiki allows its user to analyse the information on the cloud, including the number of re-tweets, frequency analysis of the tweet texts, language, and a number of other parameters, including visuals which is extremely useful for my research. Another advantage is that I did not need to download any social network analysis programs that run on Windows anyway, while I have a Mac.  In order to analyse Twitter data, firstly, I input the identified threats in the separate datasets that extract tweets from Cyrillic segment of Twitter, because the threats are in Cyrillic as well (in case of Pussy Riot, e. g., I input in Cyrillic transliteration of it as ‘Пусси’).

ScraperWiki data hub

The maximum allowed number of datasets on the cheaper plan is 10, so I included 8 threats that scored on both parameters from my previous filtering (the West, fundamentalism, China, terrorism, inter-ethnic conflict, the US, Pussy Riot and homosexuality) and subsequently tested the remaining threats (Georgia, Estonia, foreign agents, Western investment, and Jews) to check which ones will yield more Tweets in order to be included in the dataset. Surprisingly enough, the two extra datasets ended up being ‘foreign agents’ and ‘Jews’ that completed my data hub. In order to illustrate the features of Scraperwiki, I show two screenshots below that explain the data visualization.

Summary of Twitter data for query ‘yevrei’

The most important data for my dissertation on Screenshot 2 is the word frequency analysis that shows what kind of words were used the most in the tweets in this case from the query ‘yevrei’ (Jews). Scraperwiki also offers an option to view the data in the table, so it is also possible to view the ‘raw’ data and close-read the relevant tweets that in this dataset include the words ‘vinovaty’ (guilty), ‘den’gi’ (money), ‘zhidy’ (kikes) etc, but even without the close-reading some of the frequently used words reveal ‘negative conceptions about a group of people’ that represent the cornerstone of an enemy image.

Visuals from query ‘yevrei’


Second most important data for my dissertation is contained in the section ‘media’ where the collection of tweeted pictures is summarized. The visuals from this collection are sorted from the most to least re-tweeted and I can then analyse them according to my visual analysis methodology.

Even though ScraperWiki provides a lot of solutions to my data collection and analysis problems, my analysis is easier to carry out because of the language: Cyrillic provides a sort of a cut-off principle for the tweets that would not be possible with the English language (unless ScraperWiki adds a feature that would allow for geographic filtering of the tweets). But in general it is a great solution for social scientists like me who do not have a lot of funding or substantial IT knowledge, but still want to perform solid research involving big data.


Getting Twitter connections Tue, 01 Apr 2014 16:00:53 +0000 Introducing the Get Twitter Friends tool
Chain Linkage by Max Klingensmith is licensed under CC BY-ND 2.0``

Chain Linkage by Max Klingensmith is licensed under CC BY-ND 2.0

Our Twitter followers tool is one of our most popular: enter a Twitter username and it scrapes the followers of that account.

We were often asked if it’s possible not only to get the users that follow a particular account, but the users that are followed by that account too. It’s a great idea, so we’ve developed a tool to do that which we’re testing before we roll it out to all users very soon.

The new “Get Twitter friends” tool is as simple as ever to use. The difference is that when you look at the results with the “View in a table” tool now, you’ll see two tables: twitter_followers and twitter_following. Together, these show you all of a user’s Twitter connections.

How close is @ScraperWiki to our followers?

Our Twitter followers

Our lovely Twitter followers 🙂

With this new tool, we can get an idea of how well connected any Twitter user is to their followers. For instance, how many of ScraperWiki’s followers does ScraperWiki follow back?

Using the “Download as a spreadsheet” tool, we can import the data into Excel. By using filters, we can discover how many users are common to both the list of users who follow and are followed by an account. Alternatively, if you’re intrepid enough to use SQL, you can perform this query directly on ScraperWiki’s platform using the “Query with SQL” tool:

SELECT id FROM twitter_followers
SELECT id FROM twitter_following

This gives 774, over half of the total number of users that @ScraperWiki follows. We’re certainly interested in what many of our own followers are doing!

Finding new users to follow

To get suggestions for other Twitter users to look out for, you could track who the users you’re particularly interested in are following.

For instance, Tableau make one of the data visualisation software packages of choice here at ScraperWiki, and we’re fans of O’Reilly’s Strata conferences too. Which Twitter accounts do both of these two accounts follow, that ScraperWiki isn’t already following on Twitter?

It wasn’t too tricky to answer this question using SQL queries on the SQLite databases that the Twitter tool outputs. (Again, you could download the data as spreadsheets and use Excel to do your analysis.)

It turns out that there were 86 accounts that are followed by @tableau and @StrataConf, but not by @ScraperWiki already. Almost all of these are working with data. There are individuals like Hadley Wickham, responsible for R’s ggplot2, and Doug Cutting, who’s one of the creators of Hadoop. And there are businesses like Gnip and Teradata, all relevant suggestions.

Twitter accounts followed by @tableau and @StrataConf, but not by @ScraperWiki

Followed by @tableau and @StrataConf; not by us… yet!

It’s also possible to easily sort these results by user follower count. This lets you see the most popular Twitter accounts that probably belong to companies or people who are prominent in their field. At the same time, you might want to track accounts with relatively few followers: in our example, if both @tableau and @StrataConf are following them, then no doubt they’re doing something interesting.

Want to try it?

We’ve released it to all users now, so visit your ScraperWiki datahub, create a new dataset and add the tool!

]]> 1 758221356
Publish your data to Tableau with OData Fri, 07 Mar 2014 16:48:38 +0000 We know that lots of you use data from our astonishingly simple Twitter tools in visualisation tools like Tableau. While you can download your data as a spreadsheet, getting it into Tableau is a fiddly business (especially where date formatting is concerned). And when the data updates, you’d have to do the whole thing over again.

There must be a simpler way!

And so there is. Today we’re excited to announce our new “Connect with OData” tool: the hassle-free way to get ScraperWiki data into analysis tools like Tableau, QlikView and Excel Power Query.

odata-screenshotTo get a dataset into Tableau, click the “More tools…” button and select the “Connect with OData” tool. You’ll be presented with a list of URLs (one for each table in your dataset).

Copy the URL for the table of interest. Then nip over to Tableau, select “Data” > “Connect to Data” > “OData”, and paste in the URL. Simple as that.

The OData connection is fast and robust – so far we’ve tried it on datasets with up to a million rows, and after a few minutes, the whole lot was downloaded and ready to visualise in Tableau. The best bit is that dates and Null values come through just fine, with zero configuration.

The “Connect with OData” tool is available to all paying ScraperWiki users, as well as journalists on our free 20-dataset journalist plan.


If you’re a Tableau user, try it out, and let us know what you think. It’ll work with all versions of Tableau, including Tableau Public.

]]> 2 758221163
Face ReKognition Thu, 13 Feb 2014 09:51:27 +0000 G8Italy2009

I’ve previously written about social media and the popularity of our Twitter Search and Followers tools. But how can we make Twitter data more useful to our customers? Analysing the profile pictures of Twitter accounts seemed like an interesting thing to do since they are often the faces of the account holder and a face can tell you a number of things about a person. Such as their gender, age and race. This type of demographic information is useful for marketing, and understanding who your product appeals to. It could also be a way of tying together public social media accounts since people like me use the same image across multiple accounts.

Compact digital cameras have offered face recognition for a while, and on my PC, Picasa churns through my photos identifying people in them. I’ve been doing image analysis for a long time, although never before on faces. My first effort at face recognition involved using the OpenCV library. OpenCV provides a whole suite of image analysis functions which do far more than just detect faces. However, getting it installed and working with the Python bindings on a PC was a bit fiddly, documentation was poor and the built-in face analysis capabilities were poor.

Fast forward a few months, and I spotted that someone had cast the ReKognition API over the images that the British Library had recently released, a dataset I’ve been poking around at too. The ReKognition API takes an image URL and a list of characteristics in which you are interested. These include, gender, race, age, emotion, whether or not you are wearing glasses or, oddly, whether you have your mouth open. Besides this summary information it returns a list of feature locations (i.e. locations in the image of eyes, mouth nose and so forth). It’s straightforward to use.

But who should be the first targets for my image analysis? Obviously, the ScraperWiki team! The pictures are quite small but ReKognition identified I was a “Happy, white, male, age 46 with no glasses on and my mouth shut”. Age 46 is a bit harsh – I’m actually 39 in my profile picture. A second target came out “Happy, Indian, male, age 24.7, with glasses on and mouth shut”. This was fairly accurate, Zarino was 25 when the photo was taken, he is male, has his glasses on but is not Indian. Two (male) members of the team, have still not forgiven ReKognition for describing them as female, particularly the one described as a 14 year old.

Fun as it was, this doesn’t really count as an evaluation of the technology. I investigated further by feeding in the photos of a whole load of famous people. The results of this are shown in the chart below. The horizontal axis is someone’s actual age, the vertical axis shows their age predicted by ReKognition. If the predictions were correct the points representing the celebrities would fall on the solid line. The dotted line shows a linear regression fit to the data. The equation of the line y = 0.673x (I constrained it to pass through zero) tells us that the age is consistently under-predicted by a third, or perhaps celebrities look younger than they really are! The R2 parameter tells us how good the fit is: a value of 0.7591 is not too bad.


I also tried out ReKognition on a couple of class photos – taken at reunions, graduations and so forth. My thinking here being that I would get a cohort of people aged within a year of each other. These actually worked quite well; for older groups of people I got a standard deviation of only 5 years across a group of, typically, 10 people. A primary school class came out at 16+/-9 years, which wasn’t quite so good. I suspect the performance here is related to the fact that such group photos are taken relatively carefully and the lighting and setup for each face in the photo is, by its nature, the same.

Looking across these experiments: ReKognition is pretty good at finding faces in photos, and not find faces where there are none (about 90% accurate). It’s fairly good with gender (getting it right about 80% of the time, typically struggling a bit with younger children), it detects glasses pretty well. I don’t feel I tested it well on race. On age results are variable, for the ScraperWiki set the R^2 value for linear regression between actual and detected ages is about 0.5. Whilst for famous people it is about 0.75. In both cases it tends to under-estimate age and has never given an age above 55 despite being fed several more mature celebrities and grandparents. So on age, it definitely tells you something and under certain circumstances it can be quite accurate. Don’t forget the images we’re looking at are completely unconstrained, they’re not passport photos.

Finally, I applied face recognition to Twitter followers for the ScraperWiki account, and my personal account. The Summarise This Data tool on the ScraperWiki Platform provides a quick overview of the data added by face recognition.


It turns out that a little over 50% of the followers of both accounts have a picture of a human face as their profile picture. It’s clear the algorithm makes the odd error mis-identifying things that are not human faces as faces (including the back of a London Taxi Cab). There’s also the odd sketch or cartoon of a face, rather than a photo and some accounts have pictures of famous people, rather than obviously the account holder. Roughly a third of the followers of either account are identified as wearing glasses, three quarters of them look happy. Average ages in both cases were 30. The breakdown in terms of race is 70:13:11:7 White:Asian:Indian:Black. Finally, my followers are approximately 45% female, and those of ScraperWiki are about 30% female.

We’re now geared up to apply this to lists of Twitter followers – are you interested in learning more about your followers? Then send us an email and we’ll be in touch.

]]> 3 758220945
Getting sociable Fri, 24 Jan 2014 09:10:20 +0000 socail media collage

Image by Yoel Ben-Avraham

The Search for Tweets and Get Twitter followers tools are the most popular on our platform.

Why is this?

In part this is because we’re sociable creatures; platforms like Twitter get a lot of interaction time from a lot of people. A certain section of the population has a data packrat mentality. For them ScraperWiki is an easy way to collect, keep and download that Twitter data which they feel you must be able to make use of. But more than this, there is value in social data.

Why should you be interested?

This is data about your customers, or people who have made some small effort to interact with you. What do these people have in common? Where might you find more people, like them, who would be interested in your products? What can you offer them to make them give you money? You may have your own internal data to mine for insights but if you wanted this type of data from customers then the alternative is running market research investigations. This would likely provide richer data but at a higher cost. Mining your social data will help answer these questions relatively economically.

What can we do?

At ScraperWiki we’ve just scratched the surface of what’s possible in terms of data collection and analysis from social platforms. Alongside the public Twitter tools we have a LinkedIn tool, and experimental tools for extracting data from Plurk, Flickr, Instagram and Facebook. Adding more tools is simply a matter of time and inclination rather than any powerful technical challenge. So far we’ve been exclusively using the free, public APIs for services. This works pretty well. The public API for Twitter only gives search results back seven days. But if you’ve started a collector on ScraperWiki we’ll keep all the tweets in your search for as long as it’s running. Twitter’s API is “rate limited”. It will only provide a fixed number of results in a given time period. This threshold is pretty high though – in theory you can get 18,000 tweets every 15 minutes. Using our follower tool we’ve collected up to millions of followers for high profile accounts.

If more data is required then bigger data feeds from DataSift and Gnip are possible, although these are pretty pricey – thousands of dollars per month. Access over and above the public APIs is only available for some services.

What is possible?

So that’s the data collection side of things. Once we have the data then what you can do with it is only limited by your imagination, and programming skills. For example, we’ve looked at the time course of the #InspiringWomen hashtag back here. Andy Cotgreave at Tableau shows how easy this type of analysis is to do with pre-packaged tools here. You could use this sort of analysis to track a product launch or a media campaign.

You can look at the characteristics of your followers, and incidentally discover some of Twitter’s following rules – as I showed here, at the end of my review on R in Action. You can do machine learning to find out which of your followers are company accounts (see here). I’ve even written code to find out which followers have faces in their profile pictures. This sort of analysis tells you more about your followers, and hopefully customers.

These are just examples. I’m interested in finding out about the people who have liked or commented on the ScraperWiki Facebook page. I’ve discovered that I find the responses from Facebook’s API easier to understand than Facebook’s web interface!

Finding out more?

If you want to learn more about exploring social media data for your self then Matthew A. Russell’s book is and excellent introduction; I’ve reviewed it here.

If you would like us to help you with your data, then contact me (!

]]> 2 758220913