Lists in wikitext are a pain. Proof: number of duplicates of https://bugzilla.wikimedia.org/1581. Aryeh Gregor proposed in 2007 an amendment that would IMHO solve many problems and find no opposers in principle, "Have multiline tags not terminate the list item until the tag is terminated": https://bugzilla.wikimedia.org/show_bug.cgi?id=9996#c5 Does someone disagree? How hard is it to implement? (Probably rather or very hard, but who knows.)
For more general context of how I was brought to this problem, and to discuss alternative solutions for my specific issue, please refer to https://meta.wikimedia.org/wiki/Help_talk:List#List-agnostic_markup_insertions instead.
Nemo
On 03/16/2014 01:49 PM, Federico Leva (Nemo) wrote:
How hard is it to implement? (Probably rather or very hard, but who knows.)
In general this is pretty hard to implement cleanly without a DOM. In Parsoid we currently implement list handling on the token stream, but could probably move it to the DOM in the longer run. Doing the same in the PHP parser is harder, and might not be worth it.
In any case there needs to be some analysis on how much existing wikitext would be affected by this. This can be done with a dump grepper (we have one in the parsoid repo).
Gabriel
Gabriel Wicke, 16/03/2014 22:52:
In any case there needs to be some analysis on how much existing wikitext would be affected by this. This can be done with a dump grepper (we have one in the parsoid repo).
I didn't manage to use that (https://www.mediawiki.org/wiki/Talk:Parsoid/Setup#dumpGrepper.js) but I tried some grepping. Do you have something specific in mind? As long as we terminate the list item when there is another list prefix at the beginning of the next line (i.e. a new list item), disruption should be minimal, I'd think?
From the looks of it, most such unclosed tags are <small> tags which are applied to multiple items of a list. I don't know how legal/sane that can be considered but for "multiline tags" I think we can settle on some stricter definition if one doesn't exist yet, mostly I'd say blockquote pre span div (and pre is already handled, though buggily).[1]
As for the PHP parser, if we're lucky maybe it's enough to combine the lines in question after $textLines = StringUtils::explode( "\n", $text ); and before the "List generation" block? It might also be an occasion to fix some of the bugs with that <pre> block as byproduct. https://git.wikimedia.org/blob/mediawiki%2Fcore.git/HEAD/includes%2Fparser%2FParser.php#L2368
Nemo
[1] $ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div|pre)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 2602
real 7m40.029s user 7m24.904s sys 0m14.621s
vs.
$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div|pre|center|code|del|b|em|i|u|font|s|small|strike|strong)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 7344
real 8m7.467s user 7m52.158s sys 0m14.813s
Gabriel Wicke, 16/03/2014 22:52:
On 03/16/2014 01:49 PM, Federico Leva (Nemo) wrote:
How hard is it to implement? (Probably rather or very hard, but who knows.)
In general this is pretty hard to implement cleanly without a DOM. In Parsoid we currently implement list handling on the token stream, but could probably move it to the DOM in the longer run. Doing the same in the PHP parser is harder, and might not be worth it.
In any case there needs to be some analysis on how much existing wikitext would be affected by this. This can be done with a dump grepper (we have one in the parsoid repo).
I wasn't able to use that one, but I made some simple counts that I forgot in a screen and didn't post.
nemobis@dumps-2:~$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div|pre)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 2602
real 7m40.029s user 7m24.904s sys 0m14.621s nemobis@dumps-2:~$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div|pre|center|code|del|b|em|i|u|font|s|small|strike|strong)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 7344
real 8m7.467s user 7m52.158s sys 0m14.813s nemobis@dumps-2:~$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 2588
real 7m45.508s user 7m29.936s sys 0m14.581s nemobis@dumps-2:~$ time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^#' ; time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^*' 1655833
real 6m10.518s user 6m10.935s sys 0m8.141s 154647451
real 6m13.748s user 6m18.012s sys 0m8.377s nemobis@dumps-2:~$ time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^:' ; time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^;' 981035
real 6m2.725s user 6m11.815s sys 0m7.804s 148563
real 6m6.082s user 6m14.855s sys 0m8.157s
wikitext-l@lists.wikimedia.org