In my post about tokenizing wikitext, I had overlooked one important detail in the lexer. The combination of lookahead and context sensitive token productions will not play well together.
The token productions executes actions. For the sake of this discussion, these actions can be divided into two categories: context updating actions, and other actions.
If during lookahead, the same context updating actions are not executed as they would be when actually consuming the input stream, the lookahead will be based on an incorrect context, and hence, its conclusion may be incorrect. Meanwhile, other actions must not be executed during a lookahead. Furthermore, the context must be restored to the point where the lookahead started after it has completed.
The problem now is that writing lookahead rules that executes the actions correctly will become very complex and prone to error. Also, invoking a new instance of the lexer to perform the lookahead seems like a lot of overhead. I will instead explore a more elegant solution: speculative execution.
The implementation is very simple, at a decision point, save a "mark" that contains a mark in the input stream, a copy of the context and a mark in the output channels. Speculatively execute one of the decisions. The execution will eventually reach either a success condition or a failure condition. On a success condition, the mark is removed and the execution permanently accepted. On a failure, the input stream is rewinded, the context restored, the produced tokens are taken back, and the alternative decision is taken.
Now, the antlr C runtime doesn't actually support taking back tokens that have been produced. It happens to be easy to add such a feature, but I don't think that it is likely to be included in an official antlr release. So the drawback is that we will permanently be depending on a customized runtime.
There are six different situations where speculative execution would be initiated:
+------------------+---------------------+-------------+---------------------+ |Token |Start condition |Success |Failure condition | |production | |condition | | +------------------+---------------------+-------------+---------------------+ |HEADING_BEGIN |1-6 '=' characters at|A matching |Otherwise EOL or EOF | | |BOL |number of | | | | |'='s followed| | | | |by zero or | | | | |more spaces | | | | |and EOL | | +------------------+---------------------+-------------+---------------------+ |INDENT |1 or more spaces or |EOL or EOF |Block element token | | |tabs at BOL | | | | | | | | | | | | | +------------------+---------------------+-------------+---------------------+ |INTERNAL_LINK |'[[Link|' |']]' |'[[' or EOF | | | | | | +------------------+---------------------+-------------+---------------------+ |EXTERNAL_LINK |'[http://example.com |']' not |'[' (including start | | |' |inside |of INTERNAL_LINK and | | | |INTERNAL_LINK|MEDIA_LINK), EOL, | | | |or MEDIA_LINK|EOF. | | | | | | | | | | | +------------------+---------------------+-------------+---------------------+ |MEDIA_LINK |'[[File:image.png|' |']]' |MEDIA_LINK start | | | | |condition, EOF | | | | | | | | | | | +------------------+---------------------+-------------+---------------------+ |MEDIA_LINK_CAPTION|'|' <non-option> |']]' |another '|' | | | | |<non-option>, EOF | | | | | | | | | | | +------------------+---------------------+-------------+---------------------+
Notice that 'EOF' is a condition that must be matched. This is a problem as it is impossible to match the empty string at EOF---the runtime simply will not call the lexer rules when the input stream has reached its end. This means that an additional feature is required from the runtime: the ability to install an 'end-of-file-action', with the ability to rewind the input stream. This is, fortunately, trivial to implement.
The media link caption speculation is required if the current behavior is to be reproduced exactly. However, I don't think that the extreme complexity of reproducing this behavior is justified, so I will write a separate post on how to make a reasonable simplification of this later.
One important question is whether speculative execution could potentially lead to super linear worst case scanning times. So, how would the worst possibly crafted input look like?
HEADING_BEGIN and INDENT cannot be combined. Also, it is in all cases impossible for the start condition to occur inside of a speculative production (assuming that medialinks are restricted from nesting). On top of that, external links are disabled inside internal links. So, at most three speculative executions will ever be initiated at the same time. The worst case should look like this:
= [[File:image.png| [[Link| [http://example.com very long line of text running to EOF ... [[
The execution would look like this:
try heading1, try media link, try internal link, (external link disabled so don't try that here), fail internal link, reverse, try external link, fail external link, reverse, fail medialink, reverse, try internal link, fail internal link, reverse try external link, fail external link, reverse, fail heading1, reverse, try medialink, try internal link, fail internal link, reverse, try external link, fail external link, reverse, link, fail media link, reverse, try internal link, fail internal link, reverse, try external link, fail external link, reverse, succeed.
So, some contents may be scanned 12 times, but this still means worst case linear time. This should be investigated more rigorously, however. It would also be possible to use a hash table with failed speculations to avoid the redundant retries altogheter. (I'm actually storing the last failure point for any speculation, so the line in the above example would only be scanned 4 times. However, the 12 times scanning can still be provoked by adding additional start conditions.)
It should be pointed out that the only speculation that is expected to ever fail inside a well formed wiki page is the INDENT speculation. Any other failure is either due to a typo or due to a malicious user trying to DoS attack the server. If 12 scans is the worst possible scenario, such an attack will not be very efficient.
/Andreas
wikitext-l@lists.wikimedia.org