[Wikitext-l] Revisiting lexer lookahead

Fri Aug 27 15:41:16 UTC 2010

In my post about tokenizing wikitext, I had overlooked one important
detail in the lexer.  The combination of lookahead and context
sensitive token productions will not play well together.

The token productions executes actions.  For the sake of this
discussion, these actions can be divided into two categories: context
updating actions, and other actions.

If during lookahead, the same context updating actions are not
executed as they would be when actually consuming the input stream,
the lookahead will be based on an incorrect context, and hence, its
conclusion may be incorrect.  Meanwhile, other actions must not be
executed during a lookahead.  Furthermore, the context must be
restored to the point where the lookahead started after it has
completed.

The problem now is that writing lookahead rules that executes the
actions correctly will become very complex and prone to error.  Also,
invoking a new instance of the lexer to perform the lookahead seems
like a lot of overhead.  I will instead explore a more elegant
solution: speculative execution.

The implementation is very simple, at a decision point, save a "mark"
that contains a mark in the input stream, a copy of the context and a
mark in the output channels.  Speculatively execute one of the
decisions.  The execution will eventually reach either a success
condition or a failure condition.  On a success condition, the mark is
removed and the execution permanently accepted.  On a failure, the
input stream is rewinded, the context restored, the produced tokens
are taken back, and the alternative decision is taken.

Now, the antlr C runtime doesn't actually support taking back tokens
that have been produced.  It happens to be easy to add such a feature,
but I don't think that it is likely to be included in an official
antlr release.  So the drawback is that we will permanently be
depending on a customized runtime.

There are six different situations where speculative execution would
be initiated:

+------------------+---------------------+-------------+---------------------+
|Token             |Start condition      |Success      |Failure 
condition    |
|production        |                     |condition    
|                     |
+------------------+---------------------+-------------+---------------------+
|HEADING_BEGIN     |1-6 '=' characters at|A matching   |Otherwise EOL or 
EOF |
|                  |BOL                  |number of    
|                     |
|                  |                     |'='s 
followed|                     |
|                  |                     |by zero or   
|                     |
|                  |                     |more spaces  
|                     |
|                  |                     |and EOL      
|                     |
+------------------+---------------------+-------------+---------------------+
|INDENT            |1 or more spaces or  |EOL or EOF   |Block element 
token  |
|                  |tabs at BOL          |             
|                     |
|                  |                     |             
|                     |
|                  |                     |             
|                     |
+------------------+---------------------+-------------+---------------------+
|INTERNAL_LINK     |'[[Link|'            |']]'         |'[[' or 
EOF          |
|                  |                     |             
|                     |
+------------------+---------------------+-------------+---------------------+
|EXTERNAL_LINK     |'[http://example.com |']' not      |'[' (including 
start |
|                  |'                    |inside       |of INTERNAL_LINK 
and |
|                  |                     |INTERNAL_LINK|MEDIA_LINK), 
EOL,    |
|                  |                     |or 
MEDIA_LINK|EOF.                 |
|                  |                     |             
|                     |
|                  |                     |             
|                     |
+------------------+---------------------+-------------+---------------------+
|MEDIA_LINK        |'[[File:image.png|'  |']]'         |MEDIA_LINK 
start     |
|                  |                     |             |condition, 
EOF       |
|                  |                     |             
|                     |
|                  |                     |             
|                     |
+------------------+---------------------+-------------+---------------------+
|MEDIA_LINK_CAPTION|'|' <non-option>     |']]'         |another 
'|'          |
|                  |                     |             |<non-option>, 
EOF    |
|                  |                     |             
|                     |
|                  |                     |             
|                     |
+------------------+---------------------+-------------+---------------------+

Notice that 'EOF' is a condition that must be matched.  This is a
problem as it is impossible to match the empty string at EOF---the
runtime simply will not call the lexer rules when the input stream has
reached its end.  This means that an additional feature is required
from the runtime: the ability to install an 'end-of-file-action', with
the ability to rewind the input stream.  This is, fortunately, trivial
to implement.

The media link caption speculation is required if the current behavior
is to be reproduced exactly.  However, I don't think that the extreme
complexity of reproducing this behavior is justified, so I will write
a separate post on how to make a reasonable simplification of this
later.

One important question is whether speculative execution could
potentially lead to super linear worst case scanning times.  So, how
would the worst possibly crafted input look like?

HEADING_BEGIN and INDENT cannot be combined.  Also, it is in
all cases impossible for the start condition to occur inside of a
speculative production (assuming that medialinks are restricted
from nesting).  On top of that, external links are disabled inside
internal links.  So, at most three speculative executions will ever
be initiated at the same time. The worst case should look like this:

= [[File:image.png| [[Link|  [http://example.com very long line of text 
running to EOF ... [[

The execution would look like this:

try heading1, try media link, try internal link, (external link
disabled so don't try that here), fail internal link, reverse, try
external link, fail external link, reverse, fail medialink, reverse,
try internal link, fail internal link, reverse try external link, fail
external link, reverse, fail heading1, reverse, try medialink, try
internal link, fail internal link, reverse, try external link, fail
external link, reverse, link, fail media link, reverse, try internal
link, fail internal link, reverse, try external link, fail external
link, reverse, succeed.

So, some contents may be scanned 12 times, but this still means worst
case linear time.  This should be investigated more rigorously,
however.  It would also be possible to use a hash table with failed
speculations to avoid the redundant retries altogheter.  (I'm actually
storing the last failure point for any speculation, so the line in the
above example would only be scanned 4 times.  However, the 12 times
scanning can still be provoked by adding additional start conditions.)

It should be pointed out that the only speculation that is expected to
ever fail inside a well formed wiki page is the INDENT speculation.
Any other failure is either due to a typo or due to a malicious user
trying to DoS attack the server.  If 12 scans is the worst
possible scenario, such an attack will not be very efficient.

/Andreas