Graph Database in Javascript?

This crazy thought came to me after working on some stuff on graph data structure — Let’s implement a graph DB in javascript

(I always have thoughts on putting everything in javascript, while most of the cases someone did it.)

Why

Such db supporting most basic features, transaction, graph query has following purposes.

  • common data visualization

It is very likely you have some graph-based data visualization at front end (d3.js) and a graph database (neo4j) at the back. What if the graph db can output the graph representation that can easily bind to DOM with adapter and visualized ?

  • cache at client side

It is possible to interact with the graph front end, but for any new query you need to send to the back again

What if we can query a subgraph and store it in the js graph db as a cache and support basic query functions. This greatly reduce traffic, reaction time and Dom binding for data visualizations

  • in javascript

Javascript itself is sexy. Ok aside from being nerd about it — it is easy to code, with nice optimization and when you have everything in javascript (mobile,web,nodejs) you don’t want to have a dependency on backend in java or so. We are talking about the future – this is not targeting to replace neo4j

The prototype should have these characteristics

  • in memory

at the moment it is aimed that the graph model itself is represented using javascript objects itself, to facilitate the visualization and simpler coding.

It is possible to use Node.js or local storage to support persistence, but that is not in current scope as the design will be very different

  • open source

of course.

  • Basic

At the moment, it does not have to scale very nice, high availably, recovery which is the real complex part of database

We are talking about a In-memory, light-weight database in javascript that can run in both browser and nodejs

Most I know is from the concise & inspiring book Graph Database

The most crucial point about graph database in my mind is about index-free-adjanceny.

We need to focus on this fundamental problem. Is this doable in javascript?

Neo4j is the reference architecture for discussion. The Graph DB Internals chapter of the book is a highly recommended.

Just one thing neo4j very different is that neo4j is a file-based storage, while currently we discuss a in-memory database prototype. Still, most concepts apply.

Lessons from Neo4j

Firstly some words on index-free-adjanceny —

Graph traversal is cheap because every node has reference to node it is connected to, forward or backward.

In databases that we usually refer to (mostly RDBMS), we can use index to locate a record. (i.e. when look up by id)

This is cheap – but only relatively. When the record transaction id 1 say it is of customer id 100, you need another index lookup, which is o(n * logn), and you know the long painful stories on optimizing database to get the index right

Index-free-adjanency is about in every node you have the reference, i.e. location of the node that is connected to. i.e. O(1), without doing a search, which even indexed takes O(log n)

Neo4j also put this considerations in the file-based storage layer.

Node store file of neo4j is fixed length (9 byte), makes the lookup by id an O(1) operation. (e.g. with id 100 we know it is at 900bytes)

With this context, traversal in graph is fast – because going from node id=1 is to another node id=99 , we chase the “next id” in a node and just move to the known byte location of it – which likely in the same fixed-length

Although how far this concept goes I am not sure, -Transversal across files/machine? as

“User’s view on the graph and actual records on disk are dissimilar.”

It is also worth mentioning the mathematical problem of optimally partitioning graph across in distributed environment is NP-complete problem.

1st class Relationship

Also in Graph DB like neo4j, relationship itself is a “first-class citizen”

Each Node doesn’t simply store pointer of next node, every relationship is store and allow easy lookup. this facilitate going back and forth in the graph (relationship is directed)

Bear in mind a node can connect to multiple nodes. Pointers are also stored in relationship such that we can go to one side’s node ‘s next or previous relationship cheaply.

Wont go into too much details as most important is the idea. Slide 12 here shows the actually record “schema”

In javascript?

we need to consider the language itself.

Naively, it is possible to use object reference to get objects linked.

For example, with

Node Alice —loves—> Node Bob –loves–> Node Charlie

(Sadly in this world of cryptography characters, there is no one love each other)

Alice = {

 "nextRel": Love_1

}

Bob = {

 "nextRel": Love_2

}

Love_1 = {

 "first_node":Alice,

 "second_node":Bob,

 "relationType": "love"

}

However, this may not be sufficient for graph DB – we need low cost traversal.

If I need to go to next node of a, b, there are two steps here: First get the reference of Bob

Alice[“nextRel”][“second_node”] itself is two object lookup, which is not that cheap apparently.

So what is closer? Array?

One of the very characteristic about javascript as you may know, Array is not really an “array” in sense of other languages.

When we learn C, malloc()

Array are consecutive memory locations which makes it really efficient to lookup value and process together.

In Javascript, arrays are Objects. Arrays can be sparsed.

A equivalent data structure don’t really exists in javascript, but the good news is most javascript engine try to make javascript work like that.

We need to know how far this goes and make sure this optimization in place.

In engines like V8, hidden classes are created behind the scence such that property access don’t require a dictionary lookup.

Furthermore, Array Objects are special in the specification – 15.4

Index-based Arrays are optimized.

A for loop on an index-based array is much faster than looping properties in a random object.(http://jsperf.com/performance-of-array-vs-object/3)

Tips on v8 perf by Addy Osmani

There’s really only one major difference between objects and arrays in JavaScript, and that’s the arrays’ magic length property. If you’re keeping track of this property yourself, objects in V8 should be just as fast as arrays.

Array vs Objects is relevant?better traversal model?

So after all there is no difference and we can use a big nested array to mimic the fixed-length records in neo4j?

A Poc is done http://jsperf.com/javascript-graph/2

Testing in Chrome 30.0.1599.101 on OS X 10.7.4
Test Ops/sec
traverse random connected node by index
indexTranverse(large_random_nodes_offset,NODE_COUNT_LARGE);
17.19±1.20%79% slower
traverse random connected node by pointer
pointerTranverse(large_random_nodes, NODE_COUNT_LARGE);
22.95±41.98%80% slower
traverse ordered node by index
indexTranverse(large_ordered_nodes_offset,NODE_COUNT_LARGE);
62.43±83.32%fastest
traverse ordered node by reference
pointerTranverse(large_ordered_nodes, NODE_COUNT_LARGE);
103±26.84%fastest

To make result relevant, we are taking a large record – one million nodes.

Also we have cases where what each node connecting to is randomized.

The result is consistent that using pointer perform better in both ordered / random case in similar ratio, which benefit from reduced array lookup. And access in order really speed up – This implies the major benefit of optimizing is from localized memory access which help mechanisms like caching & reduce swapping bytes in memory.

Memory locality still matters

Thus, locality in array is still important consideration on where nodes should reside.

Some idea to investigate could be buckets of arrays of subgraph, periodic optimization to put connected nodes closer. Meanwhile, this should not limit the graph db to express itself that nodes being able to be freely connected to any node.

One place we can apply array directly is looping the relationships of a node.

just a start

I could be very wrong. This is just an idea to explore.

I don’t even have much experience in practical neo4j at the moment. so I will come back to this and see.

Discussions

There are quite a lot of related discussions / inspiring posts

http://stackoverflow.com/questions/614126/why-is-array-push-sometimes-faster-than-arrayn-value/614255#614255

http://stackoverflow.com/questions/7408734/building-an-object-graph-from-flat-object-array-in-javascript

http://mrale.ph/blog/2012/06/03/explaining-js-vms-in-js-inline-caches.html

So Mozilla’s Pancake is storing user browsing history & stuff in elastic search & neo4j. What if a Chrome extension or Firefox can do some of it directly. https://wiki.mozilla.org/PancakeInfrastructure#PancakeElasticSearchServer