Gabriel Wicke, 16/03/2014 22:52:
In any case there needs to be some analysis on how
much existing
wikitext would be affected by this. This can be done with a dump grepper
(we have one in the parsoid repo).
I didn't manage to use that
(<https://www.mediawiki.org/wiki/Talk:Parsoid/Setup#dumpGrepper.js>) but
I tried some grepping. Do you have something specific in mind? As long
as we terminate the list item when there is another list prefix at the
beginning of the next line (i.e. a new list item), disruption should be
minimal, I'd think?
From the looks of it, most such unclosed tags are <small> tags which
are applied to multiple items of a list. I don't know how legal/sane
that can be considered but for "multiline tags" I think we can settle on
some stricter definition if one doesn't exist yet, mostly I'd say
blockquote pre span div (and pre is already handled, though buggily).[1]
As for the PHP parser, if we're lucky maybe it's enough to combine the
lines in question after
$textLines = StringUtils::explode( "\n", $text );
and before the "List generation" block? It might also be an occasion to
fix some of the bugs with that <pre> block as byproduct.
<https://git.wikimedia.org/blob/mediawiki%2Fcore.git/HEAD/includes%2Fparser%2FParser.php#L2368>
Nemo
[1] $ time bzgrep --perl-regexp -c
'^[#*:;]+.*<(blockquote|span|div|pre)( |>)((?!</\1).)*$'
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
2602
real 7m40.029s
user 7m24.904s
sys 0m14.621s
vs.
$ time bzgrep --perl-regexp -c
'^[#*:;]+.*<(blockquote|span|div|pre|center|code|del|b|em|i|u|font|s|small|strike|strong)(
|>)((?!</\1).)*$'
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
7344
real 8m7.467s
user 7m52.158s
sys 0m14.813s