Sunday, March 14, 2010

RequireJS, kicking some AST

RequireJS has an optimization tool that can combine and minify your scripts. It uses Google's Closure Compiler to do the minification. Recently, but after the RequireJS 0.8.0 release, I ported over the CSS optimizations from the Dojo build system, so the optimization tool now inlines @import calls and remove comments from CSS files.

The script combining still has some rough edges though, and it was mainly due to me trying to use suboptimal regexp calls to find require() and require.def() calls in the files, so the dependencies for a script could be traced.

So I finally took the dive into Abstract Syntax Trees (ASTs) to do the work. What is an AST? An analogy that works for me: an AST is to JavaScript source as the DOM API is to HTML source. The AST has methods for walking through the nodes in the JS code structure, and you can get properties on a node.

Figuring out how to generate an AST from scratch can be a bit of work, but since I was already using Closure Compiler, I just used an AST it can generate.

Since the optimization tool for RequireJS is written in JavaScript, which makes calls into Java-land to do file access and minification calls, I wanted the same approach for working with the AST -- do my work in JavaScript, but call the Java methods for the AST walking and source transform.

My task was fairly simple -- I just wanted to find require() or require.def() calls that used strings for module names and dependencies, pull those calls out of the file, then just execute those calls to work out the dependencies.

The end result was this file:
http://github.com/jrburke/requirejs/blob/master/build/jslib/parse.js

The basic idea of the script:
//Set up shortcut to long Java package name,
//and create a Compiler instance.
var jscomp = Packages.com.google.javascript.jscomp,
compiler = new jscomp.Compiler(),

//The parse method returns an AST.
//astRoot is a kind of Node for the AST.
//Comments are not present as nodes in the AST.
astRoot = compiler.parse(jsSourceFile),
node = astRoot.getChildAtIndex(0);

//Use Node methods to get child nodes, and their types.
if (node.getChildAtIndex(1).getFirstChild().getType() === CALL) {
//Convert this call node and its children to JS source.
//This generated source does not have comments and
//may not be space-formatted exactly the same as the input
//source
var codeBuilder = new jscomp.Compiler.CodeBuilder();
compiler.toSource(codeBuilder, 1, node);

//Return the JavaScript source.
//Need to use String() to convert the Java String
//to a JavaScript String.
return String(codeBuilder.toString());
}
Thanks to the Closure Compiler team for doing the hard work and open sourcing the code. It looks like Closure Compiler deals with two AST formats -- one is perhaps an older one generated by Rhino, while the other one is a more custom one? It seems like I was getting back the Rhino-based Nodes for the methods I called.

I was tempted to try to go direct to just use Rhino for the AST, but decompiling the AST into source looked harder to do, and from what I recall, Rhino has a newer AST API in the trunk code. I believe the one in Closure Compiler is the older one? All that added up to me being wary of that path.

Most of the time spent was trying to figure out the Java invocations to get the code parsed, understand the tree structure, deal with Java-to-JavaScript translation issues and then figure out the Java invocations to convert a subtree back into source.

I am glad I finally stepped into working with a real AST. While some of the AST calls are a bit awkward (at least for me as a JavaScript person), it is a lot better than trying to use regexps for it. I still need to do more testing, but I feel more confident in the robustness of the solution now.

If you see how I can do it better, point me in the right direction!

4 comments:

Unknown said...

You're right about the two ASTs. Closure Compiler uses the latest version of Rhino for parsing (the "new AST"). An older version of the Rhino ASTs is manipulated by all the optimizations (the "old" AST), so there's a conversion step that converts the new ASTs to the old ASTs.

Generally, you won't see the new ASTs when developing in Closure Compiler; everything you'll be walking or manipulating are the old ASTs.

The Closure Compiler team did this for two reasons. First (and most practically), Closure Compiler was started years ago, so most of the code was written for an older version of the Rhino parser. When the team switched to the latest Rhino, they found two trees had subtle differences in how they represented the same code, so switching all the optimizations over to the new tree would be hard to do correctly. Converting the trees was the easiest way to move to the most recent Rhino parser.

More importantly, compilers generally have two separate representations of the program. The "front end" has all the syntactic checking, and uses a parse tree that's close to the grammar of the language. The "back end" of the compiler is doing all the optimizing, and generally requires a representation of the program that's closer to the actual computations being performed. The plan is for the "old" AST in Closure Compiler to become more of an intermediate representation, and by using something different than Rhino, means that the old AST can be changed to whatever the compiler developers need.

Check out "three address code" and "static single assignment" for details on how traditional compilers deal with the front end / back end representations of the program.

James Burke said...

Robert: thanks for the info, and the pointers to "three address code" and "static single assignment". Definitely new stuff to me!

Todd Cullen said...

How do you get the Closure compiler to invoke the javascript parse.js file inside requireJS? I've been searching for a way to access a javascript AST from within JS for a couple days. The V8 compiler is well written but only accessible for C++. Narcissus (JS parser) can run on Spidermonkey but that's a serious pain to build from source. This method looks ideal :) Thanks for writing about it.

James Burke said...

Todd Cullen: I use Rhino via Java to run parse.js, and I include Closure Compiler in the classpath. Here is an example.

That file runs a build.js file, which loads parse.js via Rhino's load().

If you do not want to use Java, you may want to take a look at UglifyJS that is supposed to give you an AST but can run in environments like Node (disclaimer, I have not tried it yet, but I plan to at some point).