Gabriel Wicke, 16/03/2014 22:52:
In any case there needs to be some analysis on how much existing wikitext would be affected by this. This can be done with a dump grepper (we have one in the parsoid repo).
I didn't manage to use that (https://www.mediawiki.org/wiki/Talk:Parsoid/Setup#dumpGrepper.js) but I tried some grepping. Do you have something specific in mind? As long as we terminate the list item when there is another list prefix at the beginning of the next line (i.e. a new list item), disruption should be minimal, I'd think?
From the looks of it, most such unclosed tags are <small> tags which are applied to multiple items of a list. I don't know how legal/sane that can be considered but for "multiline tags" I think we can settle on some stricter definition if one doesn't exist yet, mostly I'd say blockquote pre span div (and pre is already handled, though buggily).[1]
As for the PHP parser, if we're lucky maybe it's enough to combine the lines in question after $textLines = StringUtils::explode( "\n", $text ); and before the "List generation" block? It might also be an occasion to fix some of the bugs with that <pre> block as byproduct. https://git.wikimedia.org/blob/mediawiki%2Fcore.git/HEAD/includes%2Fparser%2FParser.php#L2368
Nemo
[1] $ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div|pre)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 2602
real 7m40.029s user 7m24.904s sys 0m14.621s
vs.
$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div|pre|center|code|del|b|em|i|u|font|s|small|strike|strong)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 7344
real 8m7.467s user 7m52.158s sys 0m14.813s