Gabriel Wicke, 16/03/2014 22:52:
On 03/16/2014 01:49 PM, Federico Leva (Nemo) wrote:
How hard is it to implement? (Probably rather or
very hard, but who knows.)
In general this is pretty hard to implement cleanly without a DOM. In
Parsoid we currently implement list handling on the token stream, but
could probably move it to the DOM in the longer run. Doing the same in
the PHP parser is harder, and might not be worth it.
In any case there needs to be some analysis on how much existing
wikitext would be affected by this. This can be done with a dump grepper
(we have one in the parsoid repo).
I wasn't able to use that one, but I made some simple counts that I
forgot in a screen and didn't post.
nemobis@dumps-2:~$ time bzgrep --perl-regexp -c
'^[#*:;]+.*<(blockquote|span|div|pre)( |>)((?!</\1).)*$'
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
2602
real 7m40.029s
user 7m24.904s
sys 0m14.621s
nemobis@dumps-2:~$ time bzgrep --perl-regexp -c
'^[#*:;]+.*<(blockquote|span|div|pre|center|code|del|b|em|i|u|font|s|small|strike|strong)(
|>)((?!</\1).)*$'
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
7344
real 8m7.467s
user 7m52.158s
sys 0m14.813s
nemobis@dumps-2:~$ time bzgrep --perl-regexp -c
'^[#*:;]+.*<(blockquote|span|div)( |>)((?!</\1).)*$'
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
2588
real 7m45.508s
user 7m29.936s
sys 0m14.581s
nemobis@dumps-2:~$ time pbzip2 -d -c
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
| grep -c -E '^#' ; time pbzip2 -d -c
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
| grep -c -E '^*'
1655833
real 6m10.518s
user 6m10.935s
sys 0m8.141s
154647451
real 6m13.748s
user 6m18.012s
sys 0m8.377s
nemobis@dumps-2:~$ time pbzip2 -d -c
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
| grep -c -E '^:' ; time pbzip2 -d -c
/public/dumps/public/itwiki/20140302/itwiki-20140302-pages-articles.xml.bz2
| grep -c -E '^;'
981035
real 6m2.725s
user 6m11.815s
sys 0m7.804s
148563
real 6m6.082s
user 6m14.855s
sys 0m8.157s