Database – ScraperWiki https://blog.scraperwiki.com Extract tables from PDFs and scrape the web Tue, 09 Aug 2016 06:10:13 +0000 en-US hourly 1 https://wordpress.org/?v=4.6 58264007 Book review: Graph Databases by Ian Robinson, Jim Webber and Emil Eifrem https://blog.scraperwiki.com/2015/01/book-review-graph-databases-by-ian-robinson-jim-webber-and-emil-eifrem/ https://blog.scraperwiki.com/2015/01/book-review-graph-databases-by-ian-robinson-jim-webber-and-emil-eifrem/#comments Sat, 03 Jan 2015 07:36:07 +0000 https://blog.scraperwiki.com/?p=758222475 graphdatabasesRegular readers will know I am on a bit of a graph binge at the moment. In computer science and mathematics graphs are collections of nodes joined by edges, they have all sorts of applications including the study of social networks and route finding. Having covered graph theory and visualisation, I now move on to graph databases. I started on this path with Seven Databases in Seven Weeks which introduces the Neo4j graph database.

And so to Graph Databases by Ian Robinson, Jim Webber and Emil Eifrem which, despite its general title, is really a book about Neo4j. This is no big deal since Neo4j is the leading open source graph database.

This is not just random reading, we’re working on an EU project, NewsReader, which makes significant use of RDF – a type of graph-shaped data. We’re also working on a project for a customer which involves traversing a hierarchy of several thousand nodes. This leads to some rather convoluted joining operations when done on a SQL database, a graph database might be better suited to the problem.

The book starts with some definitions, identifying the types of graph database (property graph, hypergraph, RDF). Neo4j uses property graphs where nodes and edges are distinct items and each can hold properties. In contrast RDF graphs are expressed as triples which encompass both edges and nodes. In hypergraphs multiple edges can be expressed as a single item. A second set of definitions are regarding the types of graph processing system: graph databases and graph analytical engines. Neo4j is designed to provide good performance for database-like queries, acting as a backing store for a web application rather than an analytical engine to carry out offline calculations. There’s also an Appendix comparing NoSQL databases which feels like it should be part of the introduction.

A key feature of native graph databases, such as Neo4j, is “index-free adjacency”. The authors don’t seem to define this well early in the book but later on whilst discussing the internals of Neo4j it is all made clear: nodes and edges are stored as fixed length records with references to a list of nodes to which they are connected. This means its very fast to visit a node, and then iterate over all of its attached neighbours. The alternative index-based lookups may involve scanning a whole table to find all links to a particular node. It is in the area of traversing networks that Neo4j shines in performance terms compared to SQL.

As Robinson et al emphasise in motivating the use of graph databases: Other types of NoSQL database and SQL databases are not built fundamentally around the idea of relationships between data except in quite a constrained sense. For SQL databases there is an overhead to carrying out join queries which are SQLs way of introducing relationships. As I hinted earlier storing hierarchies in SQL databases leads to some nasty looking, slow queries. In practice SQL databases are denormalised for performance reasons to address these cases. Graph databases, on the other hand, are all about relationships.

Schema are an important concept in SQL databases, they are used to enforce constraints on a database i.e. “this thing must be a string” or “this thing must be in this set”. Neo4j describes itself as “schema optional”, the schema functionality seems relatively recently introduced and is not discussed in this book although it is alluded to. As someone with a small background in SQL the absence of schema in NoSQL databases is always the cause of some anxiety and distress.

A chapter on data modelling and the Cypher query language feels like the heart of the book. People say that Neo4j is “whiteboard friendly” in that if you can draw a relationship structure on a whiteboard then you can implement it in Neo4j without going through the rigmarole of making some normalised schema that doesn’t look like what you’ve drawn. This seems fair up to a point, your whiteboard scribbles do tend to be guided to a degree by what your target system is, and you can go wrong with your data model going from whiteboard to data model, even in Neo4j.

I imagine it is no accident that more recent query languages like Cypher and SPARQL look a bit like SQL. Although that said, Cypher relies on ASCII art to MATCH nodes wrapped in round brackets and edges (relationships) wrapped in square brackets with arrows –>  indicating the direction of relationships:

MATCH (node1)-[rel:TYPE]->(node2)
RETURN rel.property

which is pretty un-SQL-like!

Graph databases goes on to describe implementing an application using Neo4j. The example code in the book is in Java but there appears, in py2neo, to be a relatively mature Python client. The situation here seems to be in flux since searching the web brings up references to an older python-embedded library which is now deprecated. The book pre-dates Neo4j 2.0 which introduced some significant changes.

The book finishes with some examples from the real world and some demonstrations of popular graph theory analysis. I liked the real world examples of a social recommendation system, access control and parcel routing. The coverage of graph theory analysis was rather brief, and didn’t explicit use Cypher which would have made the presentation different from what you find in the usual graph theory textbooks.

Overall I have mixed feelings about this book: the introduction and overview sections are good, as is the part on Neo4j internals. It’s a rather slim volume, feels a bit disjointed and is not up to date with Ne04j 2.0 which has significant new functionality.  Perhaps this is not the arena for a dead-tree publication – the Neo4j website has a comprehensive set of reference and tutorial material, and if you are happy with a purely electronic version than you can get Graph Databases for free (here).

]]>
https://blog.scraperwiki.com/2015/01/book-review-graph-databases-by-ian-robinson-jim-webber-and-emil-eifrem/feed/ 2 758222475
Book review: Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson https://blog.scraperwiki.com/2014/11/book-review-seven-databases-in-seven-weeks-by-eric-redmond-and-jim-r-wilson/ https://blog.scraperwiki.com/2014/11/book-review-seven-databases-in-seven-weeks-by-eric-redmond-and-jim-r-wilson/#comments Wed, 12 Nov 2014 10:36:45 +0000 https://blog.scraperwiki.com/?p=758222445 sevendatabasesI came to databases a little late in life, as a physical scientist I didn’t have much call for them. Then a few years ago I discovered the wonders of relational databases and the power of SQL. The ScraperWiki platform strongly encourages you to save data to SQLite databases to integrate with its tools.

There is life beyond SQL databases much of it evolved in the last few years. I wanted to learn more and a plea on twitter quickly brought me a recommendation for Seven databases in Seven Weeks by Eric Redmond and Jim R. Wilson.

The book covers the key classes of database starting with relational databases in the form of PostgreSQL. It then goes on to look at six further databases in the so-called NoSQL family – all relatively new compared to venerable relational databases. The six other databases fall into several classes: Riak and Redis are key-value stores, CouchDB and MongoDB are document databases, HBase is a columnar database and Neo4J is a graph database.

Relational databases are characterised by storage schemas involving multiple interlinked tables containing rows and columns, this layout is designed to minimise the repetition of data and to provide maximum query-ability. Key-value stores only store a key and a value in the manner of a dictionary but the “value” may be of a complex type. A value can be returned very fast given a key – this is the core strength of the key-value stores. The document stores MongoDB and CouchDB store JSON “documents” rather than rows. These documents can store information in nested hierarchies which don’t necessarily need to all have the same structure this allows maximum flexibility in the type of data to be stored but at the cost of ease of query.

HBase fits into the Hadoop ecosystem, the language used to describe it looks superficially like that used to describe tables in a relational database but this is a bit misleading. HBase is designed to work with massive quantities of data but not necessarily give the full querying flexibility of SQL. Neo4J is designed to store graph data – collections of nodes and edges and comes with a query language particularly suited to querying (or walking) data so arranged. This seems very similar to triplestores and the SPARQL – used in semantic web technologies.

Relational databases are designed to give you ACID (Atomicity, Consistency, Isolation, Durability), essentially you shouldn’t be able to introduce inconsistent changes to the database and it should always give you the same answer to the same query. The NoSQL databases described here have a subtly different core goal. Most of them are designed to work on the web and address CAP (Consistency, Availability, Partition), indeed several of them offer native REST interfaces over HTTP which means they are very straightforward to integrate into web applications. CAP refers to the ability to return a consistent answer, from any instance of the database, in the face of network (or partition) problems. This assumes that these databases may be stored in multiple locations on the web. A famous theorem contends that you can have any two of Consistency, Availability and Partition resistance at any one time but not all three together.

NoSQL databases are variously designed to scale horizontally and vertically. Horizontal scaling means replicating the same database in multiple places to provide greater capacity to serve requests even with network connectivity problems. Vertically scaling by “sharding” provides the ability to store more data by fragmenting the data such that some items are stored on one server and some on another.

I’m not a SQL expert by any means but it’s telling that I learnt a huge amount about PostgreSQL in the forty or so pages on the database. I think this is because the focus was not on the SQL query language but rather on the infrastructure that PostgreSQL provides. For example, it discusses triggers, rules, plugins and specialised indexing for text search. I assume this style of coverage applies to the other databases. This book is not about the nitty-gritty of querying particular database types but rather about the different database systems.

The NoSQL databases generally support MapReduce style queries this is a scheme most closely associated with Big Data and the Hadoop ecosystem but in this instance it is more a framework for doing queries which maybe executed across a cluster of computers.

I’m on a bit of a graph theory binge at the moment so Neo4J was the most interesting to me.

As an older data scientist I have a certain fondness for things that have been around for a while, like FORTRAN and SQL databases, I’ve looked with some disdain at these newfangled NoSQL things. To a degree this book has converted me, at least to the point where I look at ScraperWiki projects and think – “It might be better to use a * database for this piece of work”.

This is an excellent book which was pitched at just the right level for my purposes, I’ll be looking for more Pragmatic Programmers books in future.

]]>
https://blog.scraperwiki.com/2014/11/book-review-seven-databases-in-seven-weeks-by-eric-redmond-and-jim-r-wilson/feed/ 2 758222445
‘Big Data’ in the Big Apple https://blog.scraperwiki.com/2011/09/big-data-in-the-big-apple/ Thu, 29 Sep 2011 15:05:25 +0000 http://blog.scraperwiki.com/?p=758215505 My colleague @frabcus captured the main theme of Strata New York #strataconf in his most recent blog post.  This was our first official speaking engagement in the USA as a Knight News Challenge 2011 winner.  Here is my twopence worth!

At first we were a little confused at the way in which the week long conference was split into three consecutive mini conferences with what looked like repetitive content.  The reality was that the one day Strata Jump Start was like an MBA for people trying to understand the meaning of ‘Big Data’.  It gave a 50,000 foot view of what is going on and made us think about the legal stuff, how it will impact the demand for skills and how the pace with which data is exploding will dramatically change the way in which businesses operate – every CEO should attend or watch the videos and learn!

The following two days called the Strata Summit were focused on what people need to

Big Apple…and small products

think about strategically to get business ready for the onslaught.  In his welcome address Edd Dumbill program chair for O’Reilly said “Computers should serve humans….we have been turned into filing clerks by computers….we spend our day sifting, sorting and filing information…something has gone upside down, fortunately the systems that we have created are also part of the solution…big data can help us…it may be the case that big data has to help us!”

To use the local lingo we took a ‘deep dive’ into various aspects of the challenges.  The sessions were well choreographed and curated.  We particularly liked the session ‘Transparency and Strategic Leaking’ by Dr Michael Nelson (Leading Edge Forum- CSC) where he talked about how companies need to be pragmatic in an age when it is impossible to stop data leaking out of the door.  Companies he said ‘are going to have to be transparent’ and ‘are going to have to have a transparency policy’.   He referred to a recent article in the Economist ‘The Leaking Corporation’ and its assertion that corporations that leak their own data ‘control the story’.

Courtesy of O’Reilly Media

Simon Wardley’s (Leading Edge Forum – CSC) ‘Situation Normal Everything Must Change’ segment made us laugh especially the philosophical little quips that came from his encounter with a London taxi driver – he conducted it at lightening speed and his explanation of ‘ecosystems’ and how big data offers a potential solution to the ‘Innovation Paradox’ was insightful.   It was a heavy duty session but worth it!

Courtesy of O’Reilly Media

There were tons of excellent sessions to peruse.  We really enjoyed Cathy O’Neill’s  What kinds of people are needed for data management’ which talked about data scientists and how they can help corporations  to discern ‘noise’ from signal.

Our very own Francis Irving was interviewed about how ScraperWiki relates to Big Data and Investigative Journalism.

Courtesy of O’Reilly Media

Unfortunately we did not manage to see many of the technology exhibitors #fail. However we did see some very sexy ideas including a wonderful software start-up called Kaggle.com – a platform for data prediction competitions and its Chief Data Scientist Jeremy Howard gave us some great ideas on how to manage ‘labour markets’.

..Oh yes and we checked out why it is called Strata….

We had to leave early to attend the Online New Association – #ONA event in Boston so we missed part III which was the two day Strata Conference itself – it is designed for people at the cutting edge of data –  the data scientists and data activists!  I just hope that we manage to get to Strata 2012 in Santa Clara next February.

In his closing address ‘Towards a global brain’  Tim O’Reilly gave a list of 10 scary things that are leading into the perfect humanitarian storm including…Climate Change, Financial Meltdown, Disease Control, Government inertia……so we came away thinking of a T-Shirt theme…Hmm we’re f**ked so lets scrape!!!

Courtesy Hugh MacLeod

]]>
758215505