Wednesday, November 25, 2009

JavaScript module loading, the browser and CommonJS

JavaScript module syntax and loading seems like a hot topic at the moment, and here are some thoughts about how to construct module syntax and a loader, with the goal of trying to get to a more universal approach for it. This will be discussed in the context of CommonJS, but browser-based module loaders have existed for a while. All are constrained by the browser in some fashion as listed below. I have a preferred solution, also described in this post.

First, a look at module syntax.

CommonJS is an umbrella for a few different things, including a spec for a module syntax and a standard library of modules. This post is just interested in its module syntax. A simple example of CommonJS syntax, defining an "increment" module defined in increment.js:

var add = require('math').add;
exports.increment = function(val) {
return add(val, 1);
};
How could we build a module loader with this syntax? Here are a couple of options:

1) You parse the module before executing it, looking for require calls. You make sure to fetch those modules and work out the right dependency order in which to execute the modules.

2) You just run the module and when require() is hit, do an synchronous IO operation to load the required module.

For both approaches, use a sandbox or specific context to make sure things like "exports" are defined separately for each module.

Both of those options are easy to implement on the server side since you have more control over the IO layer, and can create separate contexts for each module. However, in the browser, things are different. Creating a separate context for each module is tricky. For IO, there are two realistic approaches, and each has its difficulties:
  • XMLHttpRequest (XHR)
  • script tags
XHR allows us to do either approach, #1 or #2. It can get the contents of the module and parse it into a structure that pulls out the dependencies and we can sort out the right order to execute things. We could use sync XHR calls to accomplish #2, block when each require call is seen. However, sync XHR calls in the browser really hurt performance.

This is actually what the default Dojo loader has done for a very long time, and I believe some pathways in the Google's Closure library do the same thing. It is always recommended you do a custom build to combine all the modules you need into one file to cut out those XHR calls when you want to go to production.

So path #1 would make more sense with an XHR-based loader. However, for an XHR loader to work, it has to use eval() to bring the module into being. Some environments, like Adobe AIR do not allow eval(), and it makes debugging hard to do across browsers. Firefox and WebKit have a convention to allow easier eval-based debugging, but it is still not what I consider to be in keeping with traditional script loading in a browser.

Instead of eval, after the XHR call finishes its parsing and module wrapping for context, you could try to create a script tag that has a body set to the modified module source, but this really hurts debugging: if there is an error, the error line number will be some weird line in a gigantic HTML file instead of the line number of the actual module.

Dojo has a djConfig.debugAtAllCosts option that will use sync XHR to pull down all the modules, parse the for dependencies, work out the right load order, then load each module via a dynamically added script src="" tag. However, since IE and WebKit will evaluate dynamically added script tags out of DOM order -- they evaluate them in network receive order (which is nice for long-polling comet apps, but does not help module loading). So, each script tag has to be added one at a time, then wait for it to finish then add the next one. Not so speedy.

XHR is also normally limited to just accessing the same host as the web page. This makes it hard to use CDNs to load content, and get performance benefits with that approach. There is now support for xdomain XHR in most recent browsers, but IE prefers to use a non-standard XDomainRequest object, making our module loader more complicated. And xdomain XHR just plain does not work in older browsers like IE6.

So, an XHR-based loader is not so great.

Script tags are nice because they keep with the known script pathway in browsers -- easy to debug, and we can get parallel loading. However, we cannot do approach #2 in the browser: our JavaScript in the browser cannot access the module contents before they are evaluated. And since dynamically added script src="" tags via head.appendChild() are not a synchronous operation, approach #2 will not work.

So, really we need to do a variant of #1, pull out the dependencies needed by the module, then after those dependencies are loaded, execute the module. The way to do this in script: put a function wrapper around the module contents, and call a module loader function with a list of dependencies and the module function wrapper. Something like this, for a module with the name of "c" that has dependencies of "a" and "b", Here is a syntax (call it Variant A) for defining a module "c" with this approach:

loader(
"c",
["a", "b"],
function(a, b) {
//The module definition of "c" in here.
//return an object to define what "c" is.
return {};
}
);

or, another variant, call it Variant B:

loader({
name: "c",
dependencies: ["a", "b"],
module: function(a, b) {
//The module definition of "c" in here.
//return an object to define what "c" is.
return {};
}
});

Ideally, we would not have to tell the loader that this structure defines module "c" (the first arg in Variant A and the name: property in Variant B) -- the loader could work this out. Unfortunately, since script tags can load asynchronously and at least IE can trigger script.onload events out of order when compared to when the script is actually evaluated, we need to keep the module name as part of the module definition. This also helps with custom builds, where you can combine a few of these module definition calls into one script.

This approach is actually what Dojo's xdomain loader has done for a very long time, but with more verbose syntax. However, it requires a custom build to convert modules into this structure. The other option is to use a server-side process to convert the modules on the fly, but I do not feel that is keeping with the simplicity of normal browser development: just open a text editor, write some script, save, reload, no extra server config/process needed, besides maybe a vanilla web server.

So, I believe that modules should be coded by the developer in this module wrapper format. YUI 3 has taken this approach, and it is the approach I have taken for RunJS too. However, YUI 3 is limited to needing some module dependency metadata files to help it out. It also uses module names that do not map to the actual module's defined name/functions.

OK, back to CommonJS.

As it stands now, I believe the CommonJS format is not suitable for modules in the browser. There have been attempts to get it to work, but the attempts either use a sync XHR loader, or a "transform-on-the-fly" server process to convert the code to a module wrapper similar to Variant B.

I would rather see a module wrapper format that works with browser natively, that can be hand-authored by developers and that will work with CommonJS modules. CommonJS started out as ServerJS. As ServerJS, the case could be made that supporting browsers may not be an aim of a ServerJS module format. However, with the name change to CommonJS, I believe supporting browsers as a first class citizen is important for CommonJS to get more traction.

So the trick is to come up with a module syntax that has a function wrapper, but is not too wordy with boilerplate. We need some boilerplate, since we need a function wrapper. I believe RunJS has the right right approach. The boilerplate is very terse, basically Variant A mentioned above:

run(
"c",
["a", "b"],
function(a, b) {
//The module definition of "c" in here.
//return an object to define what "c" is.
return {};
}
);

I can see where there is some bikeshedding on the name "run". I think script() instead of run() is a viable alternative, and I may switch to that in the near future (and rename RunJS to ScriptJS).

I have attempted to engage the CommonJS community by putting up a proposal for an Alternate Module Format.

Progress has been slow, but to be expected: the CommonJS group is trying to do lots of other things like define a standard library and build out implementations. However, I am hopeful we can get something that works for the browser front end developers.

The ideal scenario is that some variant of the above syntax is just adopted as the only CommonJS module format. That would save a lot of conversion work, and I believe it makes things much simpler for CommonJS compliant loader. Right now, for CommonJS loaders there is a concept of a require() and require.async() and having to expose Promises for the async stuff. The above format neatly avoids the issue of whether the modules are loaded async or sync and avoids any need for Promises in the module loader. I think it is fine though for modules themselves to use Promises as part of individual module APIs, but at least the loader and module syntax stays simple.

I also do not believe a "module" variable needs to be defined for each module and an exports variable is avoided by returning an object from the module function wrapper.

I can appreciate that the CommonJS folks with modules already written may not like moving to the above syntax. I think it helps in the long run if we can just have one syntax, but in the meantime, I plan on doing the following:
  • Continue to engage the CommonJS community.
  • build out RunJS, probably rename to ScriptJS in the near future, and use script() instead of run()
  • Write a converter that converts Dojo modules to the RunJS/ScriptJS module syntax. I have something basic working, here is an example of Dojo's themeTester.html using RunJS-formatted dojo/dijit/dojox modules. That example is not bulletproof yet (I used a built version of Dojo which removes some dependency info) and i18n modules have not been converted either. RunJS also has built-in support for i18n modules.
  • Convert Raindrop to use RunJS-formatted dojo and convert the Raindrop modules to that format.
  • Override run.load()/script.load() in server environments so it could be used in CommonJS server implementations.
  • Work on a converter for existing CommonJS modules.
  • Use RunJS/ScriptJS as the module syntax for Blade and/or Dojo 2.0 efforts.
If module syntax/loading is important to you, then please join the discussion list for CommonJS, so we can sort this out. It would be great to get consensus on JavaScript module syntax and loading, and I think CommonJS is the area to do that.

I am happy to adjust some of the syntax in RunJS/ScriptJS to match some consensus, but I strongly prefer a terse format. The existing ones I have seen for server-converted CommonJS modules is too verbose for me, particularly for the common cases of defining a module with some dependencies.

Friday, November 20, 2009

Raindrop, CouchDB and data models

Raindrop uses CouchDB for data storage. We are starting to hit some tough issues with how data is stored and queried. This is my attempt to explain them. I am probably not the best to talk about these things. Mark Hammond, Raindrop's back-end lead is a better candidate for it. I am hoping by trying to write it out myself, I can get a better understanding of the issues and trade-offs. Also note that this is my opinion/view, may not be the view of my employer and work colleagues, etc...

First, what are our requirements for the data?
  • Extensible Data: we want people to write extensions that extend the data.
  • Rollback: we want it easy for people to try extensions, but this means some may not work out. We need to roll back data created by an extension by easily removing the data they create.
  • Efficient Querying: We need to be able to efficiently query this data for UI purposes. This includes possibly filtering the data that comes back.
  • Copies: Having copies of the data helps with two things:
    • Replication: beneficial when we think about a user having a Raindrop CouchDB on the client as well as the server.
    • Backup: for recovering data if something bad happens.
How Raindrop tries to meet these goals today

Extensible Data: each back-end data extension writes a new "schema" for the type of data it wants to emit. A schema for our purposes is just a type of JSON object. It has a "rd_schema_id" on it that tells us the "type" of the schema. For instance a schema object with rd_schema_id == "rd.msg.body" means that we expect it to have properties like "from", "to" and "body" on it. Details on how schemas relate to extensions:
  • An extension specifies what input schema it wants to consume, and the extension is free to emit no schemas (if the input schema does not match some criteria), or one or more schemas.
  • Each schema written by an extension is stamped with a property rd_schema_provider = "extension name".
  • All the messages schemas are tied together via an rd_key value, a unique, per-message value. Schemas that have the same rd_key value all relate to the same message.
More info is on the Document Model page.

Rollback: Right now each schema is stored as a couch document. To roll back an extension, we just select all documents with rd_schema_provider = "extension name" that we want to remove, and remove them. As part of that action, we can re-run extensions that depended on that data to have them recalculate their values, or to just remove the schemas generated by those extensions.

Having each schema as a separate document also helps with the way CouchDB stores data -- if you make a change to a document and save it back, then it appends the new document to the end of the storage. The previous version is still in storage, but can be removed via a compaction call.

If we store all the schemas for a message in one CouchDB document, then it results in more frequent writes of larger documents to storage, making compaction much more necessary.

Efficient Querying: Querying in CouchDB means writing Views. However, a view is like a query that is run as data is written, not when the UI may actually want to retrieve the information. The views can then be very efficient and fast when actually called.

However, the down side is that you must know the query (or a pretty good idea of it) ahead of time. This is hard since we want extensible data. There may be some interesting things that need to be queried later, but adding a view after there are thousands of documents is painful: you need to wait for couch to run all the documents through the view when you create the view.

Our solution to this, started by Andrew Sutherland and refined by Mark, was to create what we call "the megaview". It essentially tries to emit every piece of interesting data in a document as a row in the view. Then, using the filtering capabilities of CouchDB when calling the view (which are cheap), we can select the documents we want to get.

Copies: While we have not actively tested it, we planned on using CouchDB's built-in replication support. This was seen as particularly valuable for master-master use cases: when I have a Raindrop CouchDB on my laptop and one in the cloud.

Problems Today

It feels like the old saying, "Features, Quality or Time, pick two", except for us it is "Extensible, Rollback, Querying or Copies, pick three". What we have now is an extensible system with rollback and copies, but the querying is really cumbersome.

One of the problems with the megaview: no way to do joins. For instance, "give me all twitter messages that have not been seen by the user". Right now, knowledge of a message being from twitter is in a different schema document than the schema document that knows if it has been seen by the user. And the structure of the megaview means we can really only select one property at a time on a schema.

So it means doing multiple megaview calls and then doing the join in application code. We recently created a server-side API layer in python to do this. So the browser only makes one call to the server API and that API layer does multiple network calls to CouchDB to get the data, then does the join merging in memory.

Possible solutions

Save all schemas for a message in one document and more CouchDB views
Saving all schemas for a message in one document makes it possible to then at least consult one document for both the "type=twtter, seen=false" sort of data, but we still cannot query that with the megaview. It most likely means using more CouchDB views to get at the data. But views are expensive to generate after data has been written. So this approach does not seem to scale for our extensible platform.

This approach means taking a bit more care on rollbacks, but it is possible. It also increases the size of data stored on disk via Couch's append-only model, and will require compaction. With our existing system, we could consider just never compacting.

This is actually the approach we are starting to take. Mark is looking at creating "summary documents" of the data, but the summary documents are based on the API entry points, and the kind of data the API wants to consume. These API entry points are very application-specific, so the summary document generation will likely operated like just another back end extension. Mark has mentioned possibly just going to one document to store all schemas for a message too.

However, what we have not sorted out how to do is an easier join model: "type=twitter and seen=false". What we really want is "type=twitter and seen=false, ordered by time with most recent first". Perhaps we can get away with a small set of CouchDB views that are very specific and that we can identify up-front. Searching on message type and being seen or unseen, ordered by time seems like a fairly generic need for a messaging system.

However, it means that the system as a whole is less extensible. Other applications on the Raindrop platform need to either use our server API model of using the megaview then doing joins in their app API code (may not be so easy to learn/perform), or tell the user to take the hit waiting for their custom views to get up to date with all the old messages.

Something that could help: Make CouchDB views less painful to create after the fact. Right now, creating a new view, then changing any document means waiting for that view to index all the documents in the couch, and it seems to take a lot of resources for this to happen. I think we would be fine with something that started with most recent documents first and worked backwards in time, using a bit more resources at first, but then tailing off and doing it in the background more, and allow the view to return data for things it has already seen.

Do not use CouchDB
It would be very hard for us to move away from CouchDB, and we would likely try to work with the CouchDB folks to make our system work best with couch and vice versa. It is helpful though to look at alternatives, and make sure we are not using a hammer for a screwdriver.

Schema-less storage is a requirement for our extensible platform. Something that handles ad-hoc queries better might be nice, since we basically are running ad-hoc queries with our API layer now, in that they have to do all the join work each time, for each request.

Dan Goldstein in the Raindrop chat mentioned MongoDB. Here is a comparison of MongoDB and CouchDB. Some things that might be useful:
  • Uses update-in-place, so the file system impact/need for compaction is less if we store our schemas in one document are likely to work better.
  • Queries are done at runtime. Some indexes are still helpful to set up ahead of time though.
  • Has a binary format for passing data around. One of the issues we have seen is the JSON encode/decode times as data passes around through couch and to our API layer. This may be improving though.
  • Uses language-specific drivers. While the simplicity of REST with CouchDB sounds nice, due to our data model, the megaview and now needing a server API layer means that querying the raw couch with REST calls is actually not that useful. The harder issue is trying to figure out the right queries to do and how to do the "joins" effectively in our API app code.
What we give up:
1) easy master-master replication. However, for me personally, this is not so important. In my mind, the primary use case for Raindrop is in the cloud, given that we want to support things like mobile devices and simplified systems like Chrome OS. In those cases it is not realistic to run a local couch server. So while we need backups, we probably are fine with master-slave. To support the sometimes-offline case, I think it is more likely that using HTML5 local storage is the path there. But again, that is just my opinion.

2) ad-hoc query cost may still be too high. It is nice to be able to pass back a JavaScript function to do the query work. However, it is not clear how expensive that really is. On the other hand, at least it is a formalized query language -- right now we are on the path to inventing our own with the server API with a "query language" made up of other API calls.

Persevere might be a possibility. Here is an older comparison with CouchDB. However, I have not looked in depth at it. I may ask Kris Zyp more about it and how it relates to the issues above. I have admired it from afar for a while. While it would be nice to get other features like built-in comet support, I am not sure it will address our fundamental issues any differently than say, MongoDB. It seems like an update-in-place model is used with queries run at runtime. But definitely worth more of a look.

Something else?

What did I miss? Bad formulation of the problem? Missing design solution with the tools we have now?