Graph Database in Javascript?

This crazy thought came to me after working on some stuff on graph data structure — Let’s implement a graph DB in javascript

(I always have thoughts on putting everything in javascript, while most of the cases someone did it.)

Why

Such db supporting most basic features, transaction, graph query has following purposes.

  • common data visualization

It is very likely you have some graph-based data visualization at front end (d3.js) and a graph database (neo4j) at the back. What if the graph db can output the graph representation that can easily bind to DOM with adapter and visualized ?

  • cache at client side

It is possible to interact with the graph front end, but for any new query you need to send to the back again

What if we can query a subgraph and store it in the js graph db as a cache and support basic query functions. This greatly reduce traffic, reaction time and Dom binding for data visualizations

  • in javascript

Javascript itself is sexy. Ok aside from being nerd about it — it is easy to code, with nice optimization and when you have everything in javascript (mobile,web,nodejs) you don’t want to have a dependency on backend in java or so. We are talking about the future – this is not targeting to replace neo4j

The prototype should have these characteristics

  • in memory

at the moment it is aimed that the graph model itself is represented using javascript objects itself, to facilitate the visualization and simpler coding.

It is possible to use Node.js or local storage to support persistence, but that is not in current scope as the design will be very different

  • open source

of course.

  • Basic

At the moment, it does not have to scale very nice, high availably, recovery which is the real complex part of database

We are talking about a In-memory, light-weight database in javascript that can run in both browser and nodejs

Most I know is from the concise & inspiring book Graph Database

The most crucial point about graph database in my mind is about index-free-adjanceny.

We need to focus on this fundamental problem. Is this doable in javascript?

Neo4j is the reference architecture for discussion. The Graph DB Internals chapter of the book is a highly recommended.

Just one thing neo4j very different is that neo4j is a file-based storage, while currently we discuss a in-memory database prototype. Still, most concepts apply.

Lessons from Neo4j

Firstly some words on index-free-adjanceny —

Graph traversal is cheap because every node has reference to node it is connected to, forward or backward.

In databases that we usually refer to (mostly RDBMS), we can use index to locate a record. (i.e. when look up by id)

This is cheap – but only relatively. When the record transaction id 1 say it is of customer id 100, you need another index lookup, which is o(n * logn), and you know the long painful stories on optimizing database to get the index right

Index-free-adjanency is about in every node you have the reference, i.e. location of the node that is connected to. i.e. O(1), without doing a search, which even indexed takes O(log n)

Neo4j also put this considerations in the file-based storage layer.

Node store file of neo4j is fixed length (9 byte), makes the lookup by id an O(1) operation. (e.g. with id 100 we know it is at 900bytes)

With this context, traversal in graph is fast – because going from node id=1 is to another node id=99 , we chase the “next id” in a node and just move to the known byte location of it – which likely in the same fixed-length

Although how far this concept goes I am not sure, -Transversal across files/machine? as

“User’s view on the graph and actual records on disk are dissimilar.”

It is also worth mentioning the mathematical problem of optimally partitioning graph across in distributed environment is NP-complete problem.

1st class Relationship

Also in Graph DB like neo4j, relationship itself is a “first-class citizen”

Each Node doesn’t simply store pointer of next node, every relationship is store and allow easy lookup. this facilitate going back and forth in the graph (relationship is directed)

Bear in mind a node can connect to multiple nodes. Pointers are also stored in relationship such that we can go to one side’s node ‘s next or previous relationship cheaply.

Wont go into too much details as most important is the idea. Slide 12 here shows the actually record “schema”

In javascript?

we need to consider the language itself.

Naively, it is possible to use object reference to get objects linked.

For example, with

Node Alice —loves—> Node Bob –loves–> Node Charlie

(Sadly in this world of cryptography characters, there is no one love each other)

Alice = {

 "nextRel": Love_1

}

Bob = {

 "nextRel": Love_2

}

Love_1 = {

 "first_node":Alice,

 "second_node":Bob,

 "relationType": "love"

}

However, this may not be sufficient for graph DB – we need low cost traversal.

If I need to go to next node of a, b, there are two steps here: First get the reference of Bob

Alice[“nextRel”][“second_node”] itself is two object lookup, which is not that cheap apparently.

So what is closer? Array?

One of the very characteristic about javascript as you may know, Array is not really an “array” in sense of other languages.

When we learn C, malloc()

Array are consecutive memory locations which makes it really efficient to lookup value and process together.

In Javascript, arrays are Objects. Arrays can be sparsed.

A equivalent data structure don’t really exists in javascript, but the good news is most javascript engine try to make javascript work like that.

We need to know how far this goes and make sure this optimization in place.

In engines like V8, hidden classes are created behind the scence such that property access don’t require a dictionary lookup.

Furthermore, Array Objects are special in the specification – 15.4

Index-based Arrays are optimized.

A for loop on an index-based array is much faster than looping properties in a random object.(http://jsperf.com/performance-of-array-vs-object/3)

Tips on v8 perf by Addy Osmani

There’s really only one major difference between objects and arrays in JavaScript, and that’s the arrays’ magic length property. If you’re keeping track of this property yourself, objects in V8 should be just as fast as arrays.

Array vs Objects is relevant?better traversal model?

So after all there is no difference and we can use a big nested array to mimic the fixed-length records in neo4j?

A Poc is done http://jsperf.com/javascript-graph/2

Testing in Chrome 30.0.1599.101 on OS X 10.7.4
Test Ops/sec
traverse random connected node by index
indexTranverse(large_random_nodes_offset,NODE_COUNT_LARGE);
17.19±1.20%79% slower
traverse random connected node by pointer
pointerTranverse(large_random_nodes, NODE_COUNT_LARGE);
22.95±41.98%80% slower
traverse ordered node by index
indexTranverse(large_ordered_nodes_offset,NODE_COUNT_LARGE);
62.43±83.32%fastest
traverse ordered node by reference
pointerTranverse(large_ordered_nodes, NODE_COUNT_LARGE);
103±26.84%fastest

To make result relevant, we are taking a large record – one million nodes.

Also we have cases where what each node connecting to is randomized.

The result is consistent that using pointer perform better in both ordered / random case in similar ratio, which benefit from reduced array lookup. And access in order really speed up – This implies the major benefit of optimizing is from localized memory access which help mechanisms like caching & reduce swapping bytes in memory.

Memory locality still matters

Thus, locality in array is still important consideration on where nodes should reside.

Some idea to investigate could be buckets of arrays of subgraph, periodic optimization to put connected nodes closer. Meanwhile, this should not limit the graph db to express itself that nodes being able to be freely connected to any node.

One place we can apply array directly is looping the relationships of a node.

just a start

I could be very wrong. This is just an idea to explore.

I don’t even have much experience in practical neo4j at the moment. so I will come back to this and see.

Discussions

There are quite a lot of related discussions / inspiring posts

http://stackoverflow.com/questions/614126/why-is-array-push-sometimes-faster-than-arrayn-value/614255#614255

http://stackoverflow.com/questions/7408734/building-an-object-graph-from-flat-object-array-in-javascript

http://mrale.ph/blog/2012/06/03/explaining-js-vms-in-js-inline-caches.html

So Mozilla’s Pancake is storing user browsing history & stuff in elastic search & neo4j. What if a Chrome extension or Firefox can do some of it directly. https://wiki.mozilla.org/PancakeInfrastructure#PancakeElasticSearchServer

Advertisements

10 thoughts on “Graph Database in Javascript?

  1. You’ve inspired me… well actually I stumbled on this looking for tips on the development of my person client-side graph database (not designed from Neo4j model, but from my personal requirements). I would be interested to get other’s thoughts on it as well. It’s only about a week into infancy at this point though.

    https://github.com/jmfarp2011/GraphDBjs

      • Awesome. It is pretty basic at this point, and like I said, dev’d for my personal project, but finding that there really doesn’t seem to be any other solutions out there doing this (at least that I found), I decided to open source it and maybe get some other people looking at it and possibly contributing to it as well. I would love any feedback or suggestions. I see too much value in the graph to settle with indexedDB or contant AJAX/JSON(P) for data.

    • I see you’ve since reworked this into Graph.js. I am thinking of an application where the DB would be rather static and fed as textual source code. That should be as human readable as possible!

      I suggest adding an optional element “types”, which would introduce validation on types to prevent typos. It might be also used to short-hand name/type, so this would be equivalent to your example:

      { types : [“person”, “place”, “thing”],
      entities : [
      { person: “Tom”, …},
      { person: “Bob”, …},
      { place: “Tom\’s house”, …},
      { thing: “Tom\’s motorcycle”, …}

      But using those long cryptic keys is an unintelligible maintenance nightmare! Your original idea with name lookup was much more appealing. This could again be short-handed: since every edge needs a source, you can put name/type or my above proposal directly. And if you again predefine rels, they might be used as short-hand keys for the target:

      rels : [“lives at”, “residence”, “painted”, ‘knows’],
      edges : [
      { person: “Tom”, “lives at”: { place: “Tom\’s house” }, since: 1998 },
      { person: “Bob”, painted: “Tom\’s house” }, // unique, no type needed
      { person: “Bob”, knows: “Tom”, since: “2012-07-15”},
      { place: “Tom\’s house”, residence: “Tom”}

      • The reason I switched to the sha1 crypto id was to reduce the odds of having a collision when compared to the previous method, but I can expose the _index method to be extended for custom index generation. The idea was that the entities would already have an ID supplied by the server, and only the ones created on the client would have the client-side sha1 ID, and only until the client-to-server communication was complete and the server had supplied the new ID. Graph.js is intended to be a framework that runs on both node.js and in a browser and communicates between the 2 via Web Sockets. The database server would notify all client databases when entities are added/changed/deleted so that all views provide the most current info to the user’s without page refreshes/Ajax calls. There is still a lot to do for implementation, but I feel what I have is a great start. For your application, it seems that GraphDBjs is a better choice.

      • Thanks for your detailed reply. But I’m not sure what to make of it. Are you saying that you will be maintaining the two forks separately?

        Key collision is a matter of DB design. The user needs to feed a unique key over one or few attributes. Hardwiring this as name and type is one possibility, not necessarily the best. I may get a chance to play with Neo4j soon, I’ll be wiser then 😉

        Actually I’m leaning towards generating the graph in Perl (way cooler than JS) so syntax is not such a big issue. However that would mean: calling a JS function to obtain the future key is not possible. Even if it allows JS’s more liberal-than-json syntax, it should still be a language independent exchange format!

        I’m not sure I ‘ll need a server, maybe only as an update pusher. Any changes to my DB are either in the real data and read-only for the user, or in viewing statistics local to each client.

    • Graph.js is a graph database; from what I can tell about levelgraph, it is a triplestore. Graph.js is also aimed at making the syntax more inline with something like jQuery; levelgraph doesn’t seem to be concerned with that, so there may be a bit steeper learning curve when getting started with levelgraph than Graph.js. Levelgraph seems a bit further polished than Graph.js, likely because they have a team of developers, and Graph.js has just me, always looking for other contributors though, if you’re interested.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s