Convert HTML to ENML for Evernote – a non-trivial process

As title implies, this post is for Evernote hackers.

This function is technically one of the most crucial part for making Cheeatz, an editor to Evernote your Code with Gists and Markdown.

so my use case is: convert the HTML generated by Gist’s javascript and Markdown into ENML and save in Evernote.

(convertinng gist’s javascript and markdown into HTML is another non-trivial process, which is not scope of this article)

There are nice javascript libraries to work with Evernote, namely the official sdk. For manipulating ENML, I recommen enmljs by berryboy. This is a simple & handy util.

So enml.js has useful and well-named methods – enml.PlainTextOfENMLenml.HTMLOfENML, etc

the only thing missing is ENMLOfHTML(), which I need.

enmlOfHtml

Github: enmlOfHtml

html to enml, <a> to an <a><p> to a <p>, Sounds easy. Only later I found this is indeed a non-trivial process. I hacked it anyway and put it as above.

usage:

var enmlOfHtmljs = require('enmlOfHtml');

var html = '<html><p>put html here</p></html>';

//ENML is valid ENML that you can send to evernote for creation

enmlOfHtmljs.ENMLOfHTML(html,function(err,ENML){

    console.log(ENML);

});

You can go straight to try it, but a understanding is highly recommend:

Before go in to details, we need to understand the process of saving a note in Evernote.

I won’t cover those auth,token, etc where you can read in their documentation, but focus on ENML.

ENML

ENML is based on a subset of XHTML. There are rules and schema to follow, permitted and prohibited element which can be read here

What need to be done to convert HTML into ENML in Evernote server

From the documentation,

  1. Convert the document into valid XML
  2. Discard all tags that are not accepted by the ENML DTD
  3. Convert tags to the proper ENML equivalent (e.g. BODY becomes EN-NOTE)
  4. Validate against the ENML DTD
  5. Validate href and src values to be valid URLs and protocols

XML

As in step 1, the basic thing is you need to write XML.

here I used xml-writer which enmljs used

Dom or Not?

Some library write xml using tree-like structure or with DOM-likeapi. From my experience there is performance punishment to emulate the dom at node side (e.g. with jsdom). I choose to write those HTML straight

I have been trying with libxmljs, but I dont see advantage using it at the moment for building XML. However I believe for parsing this one is nice.

Since this use purely regex, this part should work in both client side and server side.

Dont escape those HTML!

One Caveat is you need to writeRaw to write characters, otherwise HTML will be escaped

Clean up your HTML

Then step 2 & 3 is the tricky part. Doing it with regex alone could be painful, but luckily I found this

module node-resanitize

I modifiy the library to support options on what attributes to escape.

also remember to replace body with en-note

CSS!

This is one of the most non-trivial part which is logically:

  1. there is link style sheet in HTML (as in gist)
  2. ENML dont support link tag.
  3. Luckily, style attribute is supported in most tags.

inline it!

=>so you need to extract that style sheet (download if needed) and inline it as attribute.

luckily, there is a bigger audience for this problem. Another place posing similar requirements are what you use day to days, Email.

So there are some good libraries out there. Styliner is excellent.

Meanwhile, it used Q and result is returned inside the callback, and this make this enmlOfHtml put result into callback as well.

Note the 5th step – values in href and src must be valid URLs and protocols

This is what I missed and somehow created a bug.

At the time of writing, github changed their javascript to render one of the link without the gist domain –>actually a bug

so instead of href="https://gist.github.com/vincent....", there is href="/vincent..."

Then when user try to create Note in my site, it fails as when I call the create Note api there is an error

{ errorCode: 11, 

parameter: 'Error processing document: Invalid a href attribute:vincentlaucy/5548010/raw/29e88cc4f84422df5febadf93b10227f4c894c9b/gistfile1.js' } 

With try and error, to get Evernote accept your ENML, it must start with :// at least

Some Similar implementation is in Sanitize, where you can pass options on what to accept (e.g. ftp://, http:// etc), just it is client side.

These values should be either removed / replace with default / current domain to pass the validation.

I put a simple regex for that purpose.

Side-track: this is why you should always write “learning test” against external api

Make it better: Local Validation

I didnt mention step 4- validation

As metnioned in Evernote’s Docuemntation

Note: While it is possible to rely on the Evernote Cloud API to validate the ENML of your notes, we recommend downloading the DTD file (linked above) and use it to validate your note’s XML within your app. A few reasons this is a good idea:

  • Note validation will be much faster when performed locally.
  • Note validation can be performed offline.
  • The results of validating your notes locally will be the same as if you were to rely on the Evernote Cloud API to validate your ENML.

So Evernote is using DTD but not XSD, I googled a little bit on using node for DTD validation, however seems no javascript library available at the moment. Let me know if you found one.

Make it better: more

so I put a trivial implementation for this non-trivial process, but more worth to be done

  • test casessss
  • make this module support requirejs
  • it on client side
  • find/create a module that is good at both client side and server side HTML sanitize, with generic options

Hope you find it useful.

Happy to tell you this blog post produced using Cheeatz