Gabriel Wicke, 16/03/2014 22:52:
On 03/16/2014 01:49 PM, Federico Leva (Nemo) wrote:
How hard is it to implement? (Probably rather or very hard, but who knows.)
In general this is pretty hard to implement cleanly without a DOM. In Parsoid we currently implement list handling on the token stream, but could probably move it to the DOM in the longer run. Doing the same in the PHP parser is harder, and might not be worth it.
In any case there needs to be some analysis on how much existing wikitext would be affected by this. This can be done with a dump grepper (we have one in the parsoid repo).
I wasn't able to use that one, but I made some simple counts that I forgot in a screen and didn't post.
nemobis@dumps-2:~$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div|pre)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 2602
real 7m40.029s user 7m24.904s sys 0m14.621s nemobis@dumps-2:~$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div|pre|center|code|del|b|em|i|u|font|s|small|strike|strong)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 7344
real 8m7.467s user 7m52.158s sys 0m14.813s nemobis@dumps-2:~$ time bzgrep --perl-regexp -c '^[#*:;]+.*<(blockquote|span|div)( |>)((?!</\1).)*$' /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 2588
real 7m45.508s user 7m29.936s sys 0m14.581s nemobis@dumps-2:~$ time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^#' ; time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^*' 1655833
real 6m10.518s user 6m10.935s sys 0m8.141s 154647451
real 6m13.748s user 6m18.012s sys 0m8.377s nemobis@dumps-2:~$ time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^:' ; time pbzip2 -d -c /public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2 | grep -c -E '^;' 981035
real 6m2.725s user 6m11.815s sys 0m7.804s 148563
real 6m6.082s user 6m14.855s sys 0m8.157s