A fast, custom-written parser for JavaScriptCore. Human creativity VS code generator.

Perhaps the first question to pop out of somebody's mind is why do we need a fast JavaScript parser? A browser does so many things, why do we focus on JavaScript parsing, which probably consumes only a small amount of time? Earlier, we were thinking alike, but oprofile yielded some surprising results. The JavaScript parser was often responsible for 15-20% of the runtime of libWebKit. Further analysis has shown that popular web-sites usually come with big JavaScript source files (about 200-400k). These JS files contain a lot of source code, many are brower-specific. The expensive operation here is the syntax checking of the whole source code, as the browser must reject the execution of invalid JS files.

The parsers represent the front-end part of both compiled and dynamic languages. The two most important parts of a parser are the syntax checker and the Abstract Syntax Tree (AST) generator. Let's start with a short overview about the tasks performed by the parser in WebKit. The AST format of WebKit is rather complex, it contains several attributes, which can be calculated at parsing time. Such attributes are "right has assignment" or "catch has eval" for example. ( "a + b * (c = d)" "right has assignment" attribute is true for both the addition and the multiply AST nodes) Furthermore, constant folding is performed by the parser as well. WebKit generates AST for the functions right before their first execution, which is an important optimization, since both performance (faster startup time) and memory consumption (no need to allocate memory for unused function bodies) benefit from it. However, the syntax checking is repeated on these nested function bodies before their first execution. The situation gets worse, if the function is nested inside a nested function.

// program body: parsed 1x
function f() {
    // f() parsed 2x
    function g() {
        // g() parsed 3x
        function h() {
            // h() parsed 4x
        }
        h();
    }
    g()
}
f()

Although the syntax checking is repeated multiple times for a nested function body, the AST is generated only once. The custom-written parser can be configured to generate AST for nested functions above a given depth. If we set this depth argument to 2 in the previous example, the program body and f is parsed only once, and both g and h are parsed twice. In some situations this could be a useful feature.

Now let's continue this post with the overview of the custom-written parser. The parser is basically a predictive parser, since it does not use any fall-back mechanism. Predictive parsers belong to the family of recursive descent parsers. Since it is a hand-written parser, we can empoly speed optimization techniques, which are not covered by the theory.

Probably the most important technique is grouping. In practice, there are many language constructs, which are parsed in a similar way, but there are small differences. Like a "var" statement inside or outside of a "for" statement. A parser generator often duplicates a large amount of code, as it cannot see the intention of a particular rule. The same thing holds for compilers: a clever developer always makes faster code, as no matter how the compiler optimizes the code, it cannot see the intentions of the developer.

The custom-written parser has a simple operation model. It employs a stack to store the parsing data. The parsing data can contain tokens, AST nodes, anything, which may depend on future tokens. Think about the following expression statement: "a + b;" The first 3 tokens are simply pushed to the stack ("b" will be the topmost item), since it could be an expression like "a + b * c;", so the parser cannot convert the expression to any node yet. The last token is a semicolon, which causes the topmost 3 items to convert to an addition node with resolve identifier nodes as left and right children. Since the semicolon also terminates the statement, an expression statement node is created which contains the addition node as its only child. The expression statement is added to the statement list of the current scope. However, if the next token would be an equal operator ( == ), which has a lower precedence than the addition operator, the addition node would still be created, but the expression statement node is not (yet). The stack would contain the addition node and the equal operator. The custom-written parser is greedy in a way that it creates nodes as soon as possible. There is another 'parsing status' variable, which is used to filter out tokens for faster syntax checking. For example, if this variable is set to UnaryMode, we only accept unary operators or primary expressions. A multiply operator ( * ) surely causes a syntax error in this case, we don't need to perform any further checks. If someone is interested more how the parser works, just write some simple JS code and put a breakpoint into Grammar<...>::parse(), and use next/step commands to follow the parsing.

The results of SunSpider with --parse-only tests:

TEST                           COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:                   2.13x as fast     69.5ms +/- 1.3%   32.6ms +/- 1.8%     significant

=============================================================================

  jquery:                      2.14x as fast     10.7ms +/- 3.2%    5.0ms +/- 0.0%     significant
    1.3.2:                     2.14x as fast     10.7ms +/- 3.2%    5.0ms +/- 0.0%     significant

  mootools:                    2.17x as fast     11.3ms +/- 3.1%    5.2ms +/- 5.8%     significant
    1.2.2-core-nc:             2.17x as fast     11.3ms +/- 3.1%    5.2ms +/- 5.8%     significant

  prototype:                   2.05x as fast     12.7ms +/- 2.7%    6.2ms +/- 4.9%     significant
    1.6.0.3:                   2.05x as fast     12.7ms +/- 2.7%    6.2ms +/- 4.9%     significant

  concat:                      2.15x as fast     34.8ms +/- 0.9%   16.2ms +/- 1.9%     significant
    jquery-mootools-prototype: 2.15x as fast     34.8ms +/- 0.9%   16.2ms +/- 1.9%     significant

A list about the advantages of a custom-written parser

    - Grouping. Actually, the parsing itself is less than one third of the code size, the other two third is related to AST generation. Code duplications are not needed for the AST generation.
    - No need for shadow structures. In case of the custom-written parser, the extra data (line info, attributes), which is required only by the AST generator, is built together with the parsing data. This approach decreases memory consumption, since we don't need to allocate NodeInfo structures.
    - Easy to maintain. The parser has a simple design. The code itself is big, but if one doesn't read it as a book, it is easly to understand how it works.
    - Robust code. Syntax changes can be implemented fairly easy.
    - Control over the workings of the parser. Better resistance to slow down when something is changed. (Even simple rule changes can slow down a generated parser, see an example here)
    - Fast. And this is what matters in the long run. Especially for the customers.

The latest source code of the custom-written parser can be downloaded from the bugzilla of WebKit. The reviewers prefer a different approach for JavaScript parsing, so this work will not go mainline.

Anonymous (not verified) - 02/27/2010 - 14:11

Keep up the good work. A note: dynamic languages can be compiled (Python, for example), so "both compiled and dynamic languages" is not really correct.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • No HTML tags allowed
  • Lines and paragraphs break automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Fill in the blank