Data Science London 12th June – a speaker speaks

Data Science London run an approximately monthly programme of evening events comprising short talks, beer and pizza. Last week I was invited to give a talk on Scraping and Parsing PDF using Python.

The venue for these events is the Westminster Hub in central London – we were diverted in our approach by the premier for Man of Steel in Leicester Square.

The audience was large, friendly and very diverse. Most, if not all of the audience, were highly technical. There were men in suits and ties, people with piercings, t-shirts and shorts. There were academics, web developers, economists, political science students.

There were four speakers on the evening:

Rosaria Silipo from Knime presented on using their platform to process social discourse data from the Slashdot; analysing it for sentiment and for user roles in the community, the content from Slashdot acted as a substitute for content from a telecoms forum which their commercial clients were interested in.
I spoke on scraping and parsing PDF files, giving some details of the Python libraries we commonly use and illustrating with examples from my Royal Society membership list parsing and the verbatim records of the UN General Assembly and Security Council. I’ll write about this second project another time. The audience were very responsive (they laughed at my jokes) and there were some good questions at the end.
Third up, after a brief pause for me to fetch a beer and wind down, was Doug Cutting – inventor of Hadoop who now works for Cloudera, he spoke about adding search capabilities to Lucene and Hadoop. I suspect he may have been the reason for the packed house.
Finally Ian Oszwald from Mor Consulting spoke about brand name disambiguation for twitter i.e. knowing when someone is talking about Apple the brand or apple the fruit. There are tools for this type of problem but they appear to have been trained on longer form media and so do not perform well with short form sources such as twitter. Ian demonstrated an approach using the scikit-learn machine learning package for Python. This was a work in progress, for which he is looking for collaborators.

All in all a very enjoyable and interesting evening. I can heartily recommend Data Science London events if you get a chance to go.

Finally a big thank you to Carlos for organising such a great event.

Full ‘New Zealand’ House!

ScraperWiki

Extract tables from PDFs and scrape the web

Blog

Data Science London 12th June – a speaker speaks