Thursday, August 20, 2009

GWT and regular expressions

I am working on a transcription framework for the Open Siddur/Jewish Liturgy projects. The framework is being developed as a web application using Google Web Toolkit, which allows me not to worry about writing cross browser Javascript. GWT compiles Java to optimized Javascript and provides a widget framework for making nice web applications.

The project uses the eXist XML database to server and query the data. Some of the data is pointed to by relative URIs, and I needed to resolve those URIs to request the new data.

GWT provides a subset of the JRE, which does not include java.net.URI. GWT does not provide a method to resolve relative URIs AFAIK, and so I decided to implement java.net.URI. Checking up here I found that java's URI doesn't implement the latest specification (rfc3986 AFAIK). So I looked up rfc3986 and found the nifty regular expression that can parse and validate URIs correctly. So now it was simple, right? Just use java.util.regex.Pattern. But of course GWT's subset of the JRE does not include Pattern. So I hunted down a partial Pattern implementation for GWT and found one by Robert Hansen over at http://java2s.com/. Robert's library wraps the Javascript regular expresion implementation.

So now comes the tricky part. Worked fine in Firefox, but as usual there was a problem with IE. This is because the regular expression requires "Non-Participating Groups", and are supposed to return undefined values for a group that aren't there.

From http://tools.ietf.org/html/rfc3986#appendix-B:

      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
       12            3  4          5       6  7        8 9

      http://www.ics.uci.edu/pub/ietf/uri/#Related
    results in the following subexpression matches:
      $1 = http:
      $2 = http
      $3 = //www.ics.uci.edu
      $4 = www.ics.uci.edu
      $5 = /pub/ietf/uri/
      $6 = <undefined>
      $7 = <undefined>
      $8 = #Related
      $9 = Related

Instead of undefined values, IE returns empty strings, making it impossible to differentiate between an empty value, and a non-existant value. Turns out regular expressions are altogether a mess among the browsers, and not IE alone. Well, this made easy, standard compliant parsing a whole lot harder.

To the rescue comes XRegExp a Javascript library that wraps, fixes and adds functionality to the major browsers' regular expression implementations. So I include the source file in my project, do some manual coding, testing to see if IE returnes the correct values, and lo! it does!

So I run the application again, but no go. It just doesn't work. I spent hours compiling, debugging (it only bugs in IE, so I had to get IE8 for its debugger), recompiling, stepping through, reading through the XRegExp source code, and then GWT's generated source code (not the easiest to read). What XRegExp does to fix the browser's functionality, is to override the String.match() function to its own match() function. For some reason, XRegExp's match() function wasn't getting called at all. I looked at the prototype of String, and it did not point to XRegExp's match(). I finally realized that all of GWT's generated Javascript is executed inside a child frame, and XRegExp only changes the String prototype of the top level frame, resulting in the same output as before.

So I asked for possible solutions to the problem on ##gwt, and was told that all functions that require access to the top-level window should use $wnd to access it. So I fiddled around for awhile and managed to create a create a (deep) copy of the string with the correct prototype and get the match result.

var result = new $wnd.String(text).match(new $wnd.RegExp(regExp));

It was then just a matter of implementing Reference Resolution (http://tools.ietf.org/html/rfc3986#section-5), which I mostly completed, finally being able to resolve relative URIs on the client.

No comments: