Graph Database in Javascript?

This crazy thought came to me after working on some stuff on graph data structure — Let’s implement a graph DB in javascript

(I always have thoughts on putting everything in javascript, while most of the cases someone did it.)

Why

Such db supporting most basic features, transaction, graph query has following purposes.

  • common data visualization

It is very likely you have some graph-based data visualization at front end (d3.js) and a graph database (neo4j) at the back. What if the graph db can output the graph representation that can easily bind to DOM with adapter and visualized ?

  • cache at client side

It is possible to interact with the graph front end, but for any new query you need to send to the back again

What if we can query a subgraph and store it in the js graph db as a cache and support basic query functions. This greatly reduce traffic, reaction time and Dom binding for data visualizations

  • in javascript

Javascript itself is sexy. Ok aside from being nerd about it — it is easy to code, with nice optimization and when you have everything in javascript (mobile,web,nodejs) you don’t want to have a dependency on backend in java or so. We are talking about the future – this is not targeting to replace neo4j

The prototype should have these characteristics

  • in memory

at the moment it is aimed that the graph model itself is represented using javascript objects itself, to facilitate the visualization and simpler coding.

It is possible to use Node.js or local storage to support persistence, but that is not in current scope as the design will be very different

  • open source

of course.

  • Basic

At the moment, it does not have to scale very nice, high availably, recovery which is the real complex part of database

We are talking about a In-memory, light-weight database in javascript that can run in both browser and nodejs

Most I know is from the concise & inspiring book Graph Database

The most crucial point about graph database in my mind is about index-free-adjanceny.

We need to focus on this fundamental problem. Is this doable in javascript?

Neo4j is the reference architecture for discussion. The Graph DB Internals chapter of the book is a highly recommended.

Just one thing neo4j very different is that neo4j is a file-based storage, while currently we discuss a in-memory database prototype. Still, most concepts apply.

Lessons from Neo4j

Firstly some words on index-free-adjanceny —

Graph traversal is cheap because every node has reference to node it is connected to, forward or backward.

In databases that we usually refer to (mostly RDBMS), we can use index to locate a record. (i.e. when look up by id)

This is cheap – but only relatively. When the record transaction id 1 say it is of customer id 100, you need another index lookup, which is o(n * logn), and you know the long painful stories on optimizing database to get the index right

Index-free-adjanency is about in every node you have the reference, i.e. location of the node that is connected to. i.e. O(1), without doing a search, which even indexed takes O(log n)

Neo4j also put this considerations in the file-based storage layer.

Node store file of neo4j is fixed length (9 byte), makes the lookup by id an O(1) operation. (e.g. with id 100 we know it is at 900bytes)

With this context, traversal in graph is fast – because going from node id=1 is to another node id=99 , we chase the “next id” in a node and just move to the known byte location of it – which likely in the same fixed-length

Although how far this concept goes I am not sure, -Transversal across files/machine? as

“User’s view on the graph and actual records on disk are dissimilar.”

It is also worth mentioning the mathematical problem of optimally partitioning graph across in distributed environment is NP-complete problem.

1st class Relationship

Also in Graph DB like neo4j, relationship itself is a “first-class citizen”

Each Node doesn’t simply store pointer of next node, every relationship is store and allow easy lookup. this facilitate going back and forth in the graph (relationship is directed)

Bear in mind a node can connect to multiple nodes. Pointers are also stored in relationship such that we can go to one side’s node ‘s next or previous relationship cheaply.

Wont go into too much details as most important is the idea. Slide 12 here shows the actually record “schema”

In javascript?

we need to consider the language itself.

Naively, it is possible to use object reference to get objects linked.

For example, with

Node Alice —loves—> Node Bob –loves–> Node Charlie

(Sadly in this world of cryptography characters, there is no one love each other)

Alice = {

 "nextRel": Love_1

}

Bob = {

 "nextRel": Love_2

}

Love_1 = {

 "first_node":Alice,

 "second_node":Bob,

 "relationType": "love"

}

However, this may not be sufficient for graph DB – we need low cost traversal.

If I need to go to next node of a, b, there are two steps here: First get the reference of Bob

Alice[“nextRel”][“second_node”] itself is two object lookup, which is not that cheap apparently.

So what is closer? Array?

One of the very characteristic about javascript as you may know, Array is not really an “array” in sense of other languages.

When we learn C, malloc()

Array are consecutive memory locations which makes it really efficient to lookup value and process together.

In Javascript, arrays are Objects. Arrays can be sparsed.

A equivalent data structure don’t really exists in javascript, but the good news is most javascript engine try to make javascript work like that.

We need to know how far this goes and make sure this optimization in place.

In engines like V8, hidden classes are created behind the scence such that property access don’t require a dictionary lookup.

Furthermore, Array Objects are special in the specification – 15.4

Index-based Arrays are optimized.

A for loop on an index-based array is much faster than looping properties in a random object.(http://jsperf.com/performance-of-array-vs-object/3)

Tips on v8 perf by Addy Osmani

There’s really only one major difference between objects and arrays in JavaScript, and that’s the arrays’ magic length property. If you’re keeping track of this property yourself, objects in V8 should be just as fast as arrays.

Array vs Objects is relevant?better traversal model?

So after all there is no difference and we can use a big nested array to mimic the fixed-length records in neo4j?

A Poc is done http://jsperf.com/javascript-graph/2

Testing in Chrome 30.0.1599.101 on OS X 10.7.4
Test Ops/sec
traverse random connected node by index
indexTranverse(large_random_nodes_offset,NODE_COUNT_LARGE);
17.19±1.20%79% slower
traverse random connected node by pointer
pointerTranverse(large_random_nodes, NODE_COUNT_LARGE);
22.95±41.98%80% slower
traverse ordered node by index
indexTranverse(large_ordered_nodes_offset,NODE_COUNT_LARGE);
62.43±83.32%fastest
traverse ordered node by reference
pointerTranverse(large_ordered_nodes, NODE_COUNT_LARGE);
103±26.84%fastest

To make result relevant, we are taking a large record – one million nodes.

Also we have cases where what each node connecting to is randomized.

The result is consistent that using pointer perform better in both ordered / random case in similar ratio, which benefit from reduced array lookup. And access in order really speed up – This implies the major benefit of optimizing is from localized memory access which help mechanisms like caching & reduce swapping bytes in memory.

Memory locality still matters

Thus, locality in array is still important consideration on where nodes should reside.

Some idea to investigate could be buckets of arrays of subgraph, periodic optimization to put connected nodes closer. Meanwhile, this should not limit the graph db to express itself that nodes being able to be freely connected to any node.

One place we can apply array directly is looping the relationships of a node.

just a start

I could be very wrong. This is just an idea to explore.

I don’t even have much experience in practical neo4j at the moment. so I will come back to this and see.

Discussions

There are quite a lot of related discussions / inspiring posts

http://stackoverflow.com/questions/614126/why-is-array-push-sometimes-faster-than-arrayn-value/614255#614255

http://stackoverflow.com/questions/7408734/building-an-object-graph-from-flat-object-array-in-javascript

http://mrale.ph/blog/2012/06/03/explaining-js-vms-in-js-inline-caches.html

So Mozilla’s Pancake is storing user browsing history & stuff in elastic search & neo4j. What if a Chrome extension or Firefox can do some of it directly. https://wiki.mozilla.org/PancakeInfrastructure#PancakeElasticSearchServer

Convert HTML to ENML for Evernote – a non-trivial process

As title implies, this post is for Evernote hackers.

This function is technically one of the most crucial part for making Cheeatz, an editor to Evernote your Code with Gists and Markdown.

so my use case is: convert the HTML generated by Gist’s javascript and Markdown into ENML and save in Evernote.

(convertinng gist’s javascript and markdown into HTML is another non-trivial process, which is not scope of this article)

There are nice javascript libraries to work with Evernote, namely the official sdk. For manipulating ENML, I recommen enmljs by berryboy. This is a simple & handy util.

So enml.js has useful and well-named methods – enml.PlainTextOfENMLenml.HTMLOfENML, etc

the only thing missing is ENMLOfHTML(), which I need.

enmlOfHtml

Github: enmlOfHtml

html to enml, <a> to an <a><p> to a <p>, Sounds easy. Only later I found this is indeed a non-trivial process. I hacked it anyway and put it as above.

usage:

var enmlOfHtmljs = require('enmlOfHtml');

var html = '<html><p>put html here</p></html>';

//ENML is valid ENML that you can send to evernote for creation

enmlOfHtmljs.ENMLOfHTML(html,function(err,ENML){

    console.log(ENML);

});

You can go straight to try it, but a understanding is highly recommend:

Before go in to details, we need to understand the process of saving a note in Evernote.

I won’t cover those auth,token, etc where you can read in their documentation, but focus on ENML.

ENML

ENML is based on a subset of XHTML. There are rules and schema to follow, permitted and prohibited element which can be read here

What need to be done to convert HTML into ENML in Evernote server

From the documentation,

  1. Convert the document into valid XML
  2. Discard all tags that are not accepted by the ENML DTD
  3. Convert tags to the proper ENML equivalent (e.g. BODY becomes EN-NOTE)
  4. Validate against the ENML DTD
  5. Validate href and src values to be valid URLs and protocols

XML

As in step 1, the basic thing is you need to write XML.

here I used xml-writer which enmljs used

Dom or Not?

Some library write xml using tree-like structure or with DOM-likeapi. From my experience there is performance punishment to emulate the dom at node side (e.g. with jsdom). I choose to write those HTML straight

I have been trying with libxmljs, but I dont see advantage using it at the moment for building XML. However I believe for parsing this one is nice.

Since this use purely regex, this part should work in both client side and server side.

Dont escape those HTML!

One Caveat is you need to writeRaw to write characters, otherwise HTML will be escaped

Clean up your HTML

Then step 2 & 3 is the tricky part. Doing it with regex alone could be painful, but luckily I found this

module node-resanitize

I modifiy the library to support options on what attributes to escape.

also remember to replace body with en-note

CSS!

This is one of the most non-trivial part which is logically:

  1. there is link style sheet in HTML (as in gist)
  2. ENML dont support link tag.
  3. Luckily, style attribute is supported in most tags.

inline it!

=>so you need to extract that style sheet (download if needed) and inline it as attribute.

luckily, there is a bigger audience for this problem. Another place posing similar requirements are what you use day to days, Email.

So there are some good libraries out there. Styliner is excellent.

Meanwhile, it used Q and result is returned inside the callback, and this make this enmlOfHtml put result into callback as well.

Note the 5th step – values in href and src must be valid URLs and protocols

This is what I missed and somehow created a bug.

At the time of writing, github changed their javascript to render one of the link without the gist domain –>actually a bug

so instead of href="https://gist.github.com/vincent....", there is href="/vincent..."

Then when user try to create Note in my site, it fails as when I call the create Note api there is an error

{ errorCode: 11, 

parameter: 'Error processing document: Invalid a href attribute:vincentlaucy/5548010/raw/29e88cc4f84422df5febadf93b10227f4c894c9b/gistfile1.js' } 

With try and error, to get Evernote accept your ENML, it must start with :// at least

Some Similar implementation is in Sanitize, where you can pass options on what to accept (e.g. ftp://, http:// etc), just it is client side.

These values should be either removed / replace with default / current domain to pass the validation.

I put a simple regex for that purpose.

Side-track: this is why you should always write “learning test” against external api

Make it better: Local Validation

I didnt mention step 4- validation

As metnioned in Evernote’s Docuemntation

Note: While it is possible to rely on the Evernote Cloud API to validate the ENML of your notes, we recommend downloading the DTD file (linked above) and use it to validate your note’s XML within your app. A few reasons this is a good idea:

  • Note validation will be much faster when performed locally.
  • Note validation can be performed offline.
  • The results of validating your notes locally will be the same as if you were to rely on the Evernote Cloud API to validate your ENML.

So Evernote is using DTD but not XSD, I googled a little bit on using node for DTD validation, however seems no javascript library available at the moment. Let me know if you found one.

Make it better: more

so I put a trivial implementation for this non-trivial process, but more worth to be done

  • test casessss
  • make this module support requirejs
  • it on client side
  • find/create a module that is good at both client side and server side HTML sanitize, with generic options

Hope you find it useful.

Happy to tell you this blog post produced using Cheeatz

Syntax-Specific Key Binding in sublime Text 2 – e.g. fix javascript reindent with jsFormat and Ctrl+Shift+F

Sublime Text 2 is great. period.

In short

to use Cmd-Shift-F to reindent javascript with jsFormat without affecting others, put this in the Key Bindings :

Story:

Again this supposingly simple question took me some time to figure out

My use case is: I found there is a bug in Sublime Text 2 for reindent javascript which

whenever there is a comment line, the indentation of next bracket follow the indent of the comment line instead of matching bracket

Wrong:
i.e.
          funciton(){
//something
};

JSFormat

So i installed the package JsFormat which is also closer to the my style

Then there is another problem — it is not overriding the reindent but create another command ‘jsFormat’ thus I cannot use my usual hotkey cmd + shift + f to reindent but need to use the default ctrl+alt+F. Even I change it to cmd+shift+f, other language (syntax) will be affected

The Settings - More -> Syntax-specific is for setting but not key bindings

Scope!

Finally figure out the scope concept from this SO Thread
so the scope for js is scope.js not scope.javascript

The correct way to know the scope is to check syntax’s .tmLanguage file for the key scopeName

seems the alternative -plugin ScopeHunter is not working for me

Originally I thought it is the bug, but it works now

Related SO: http://stackoverflow.com/questions/16520165/sublime-text-2-varying-macro-per-syntax

My version

Version:2.0.1, Build2217 OS: MacOSX 10.7.4

Browser automation testing in javascript- selenium, webdriverJs

I am adding automation testing the for my application and I have also been the online testing “champion” in my workplace (Java).

As usual, Concept first

This is stil a relative fresh field. There are quite a lot of concerns, I try to write one by one and be concise.

Why browser automation testing

I am not very expereinced in unit-testing or javascript. IMO, it may be a bit too expensive, especially when you have simple code and what you focus is the webpage being shown to users and things work there. It is expensive in a sense that comparing to JAVA, again IMO, it is harder to yield in automated refactoring and it is easily coupled with DOM.

There are nice behaviour testing framework like Jasmine. Some of them are also possible to run in browsers. But still that may not be the same as browser automation testing, which test against USER ACTIONS in browsers. You use the real page to test can care less about the fixtures.

Selenium

Nowadays selenium, from Google, is the de facto standard for that. I am used to it in Java and it is roboust, easy to write and scalable. Even run in headless Linux is possible . Alternatives: JsTestServer

Why browser automation testing with JS

I am building couchdb app (with node.js tools) and simpler, javascript is the language of web.

Selenium Javascript port (client/binding)

There are ports to use selenium in javascript. Almost all are based on the OfficialJsonWireProtocol

So the story is: your javascript test code run with the selenium webdriver client, which communicates with the selenium server via the JsonWireProtocol, which then issue requests to the connected browsers and carry out the actions.

Frameworks out there..

At first I was quite confused as they all name as webDriverJs. So some quick facts.

All are available as node component, and in NPM:

official selenium-webdriver-js – actively developing

Camme/WebdriverJs – last commit Sep2012

**Update: per author’s comment it is now active again :)**

WD.js – actively developing

Some footprint analysis in github: not very popular in wild

  • selenium-webdriver 38 code results
  • webdriverjs: 142 code results

Which to use?

Sometimes it is not the matter of which API to use, rather it is more on understanding the tradeoff and how you make them decouple. IMO things are still not very mature today and I am thus suggesting for a even simpler framework. Lets dive in more…

Browser Automation Testing Concerns

These CONCERNS should be DECOUPLED.

client API

Basically all the clients framework are async and used a queue-like implementation. So your script will finish but it actually just registered the actions for selenium. Then selenium will wait and call the callbacks at right time.

Comparing the official one and Camme/WebdriverJs, the former one is more sophisticated with concept of “Promise” and “Defer”, but this may also be harder to understand and write. Meanwhile the later has no explicit wait method and this is quite a large limitation in testing against ajax.

As you can see in the WebDriverJs wiki, concerns are problem of state, API too verbose, too many nested callback and how they are currently handled. I think the official client did a great job.

It is also for sure that if you use the official API, it is expected to be more updated with selenium..

Selector Style

There are many methods to select an element in selenium. Name/Id/Css Selectors.

The official webdriverJs’s basic API use the common selenium API driver.findElement(webdriver.By.name('q')); where the advantange is it is consistent even you come from Java.

On the other hand it may be quite strange to write in javascript, esp content assist is unlikely in place and the builder pattern makes you hard to debug & see available methods. There is also lack of documentation.

On this I prefer Camme/WebdriverJs which is more jQuery-like.

Also it may be to your surprise, it is possible for selenium to execute javascript. By this I mean you can have the script as a string and let it be executed by selenium server for you, no matter from Java/Python/etc, of course in javascript. This is quite handy in some situations, but some common use case like getting the value back into the place you call could be hard.

jQuery?

Then one might ask is that possible to use jQuery?

First we need to separate the questions into 2: 1. to use in test script 2. to run in browser

For 1, you can always make jquery available in Node.js by require('jquery'). The point is you cant use jQuery selector directly to write the test flow, as actual actions are done by selenium instead

For 2, it may be hard as you can imagine the dependency.

You may argue that jQuery is not necessary because selenium support css selectors already. but some methods are not, like .first() or .show(). I believe this can greatly save developer’s time.

I just tried to check if a element is present in the official webdriver, which is quite hard.

So the tradeoff:consistency amond languages vs native style. I will prefer and suggest to have a selenium client like jQuery as much as possible.

Testing – PageObject

It is imporant to separate test code and automation code.

This official reference to PageObject is extremely helpful.

I tried to write my own js page-object, but that may be harder as to duel with the scope and state.

this js-page-object project also seems interesting.

Testing – assert Style

Afer all this is a test case. Camme/WebdriverJs provides a test mode. This may be very convenient to keep scripts compact and with less dependencies.

Example

    client
    .testMode()
    .init()
    .url("https://github.com")
    .tests.cssPropertyEquals(".login a", "color", "#4183c4", "Color of .login a is #4183c4")
    .tests.titleEquals("Secure source code hosting and collaborative development - GitHub", "Title of the page is 'Secure source code hosting and collaborative development - GitHub'")
    .click(".pricing a")
    .tests.titleEquals("Plans & Pricing - GitHub", "Title of the page is 'Plans & Pricing - GitHub'")
    .tests.visible(".pagehead", true, ".pagehead is visible after click")
    .end();

For me, I used in hybrid with nodeunit.

For the general assert(actual expected) vs assert(expected, actual) war link1 , the former is what in Node and nodeunit. I am from Java but I think I will stick with Node’s approach from now on.

One can use the tearDown() method to make sure the browser is killed after test, even in case of exceptions.

Conclusion

I will try to write more and make it a “white paper”

— suggestion is lets use the official API and look forward to its API wrappers ang page-object implementatio. May make one myself if I have time and I really need it.