More JavaCC optimizations

24 Jan 2008

Paul Cager has been improving JavaCC again - this time he reduced the amount of object allocation done by a JavaCC-generated lexer.

This began with a nicely detailed bug filed by s_fuhrm that showed that a new StringBuffer is being recreated for every token that's parsed when we could really just reuse one StringBuffer and clear it out after each match. The change that Paul implemented is especially nice in that it also eliminates an if statement (a null check), so that's an extra performance boost.

The only gotcha is that if you've been using the image variable in your lexical actions, you'll start getting different results. For example, suppose you had a lexical specification like this:

TOKEN_MGR_DECLS : {
  static StringBuffer lastB = new StringBuffer();
  static void p() {
    System.out.println("lastB is : " + lastB);
  }
}
TOKEN : {
  <A : 'a'> { p(); }
  | <B : 'b' (['1'-'9'])* > { p() ; lastB = image; }
}

With JavaCC 4.0, the image would never be reused and with input data of b12 a b42 you'd get output like this:

lastB is :
lastB is : b12
lastB is : b12

In other words, that image object that lastB is referencing would stick around. With this change in place, image (like the Matrix) is reloaded and you'll get this:

lastB is :
lastB is : a
lastB is : b42

One solution is to use matchedToken.image instead - or you could just call toString on the image reference to get a copy of the String. You can see an example of this on page 59 of Generating Parsers with JavaCC. Finally, if you want to give your grammar a whirl with this change, I've posted a new javacc.jar built from the latest code in CVS here. Enjoy!