Re: [Wikitext-l] nested definition lists

10 Nov 2011


      Trevor,
...
Can we deconstruct the current parser's processing steps and build a set
of rules that must be followed?
I think the commonly-used structures are quite clearly defined, but the
behaviour of these strange permutations is quite unspecified. The parser
output for the case reported in the bug already changed in the meantime..
...
I think we need to get a dump of English Wikipedia and start using a
simple PEG parser to scan through it looking for patterns and figuring
out how often certain things are used - if ever.
I just ran an en-wiki article dump through a zcat/tee/grep pipeline:
pattern			count		example
------------------------------------------------------------------
^			548498738 	(total number of lines)
^;			681495
^;[^:]+:		153997		; bla : blub
^[;:*#]+;[^:]+:		3817		*; bla : blub
^;;                     2332
^[:;*#]*;[^:]*::        41		most probably ;::
^[;:*#]*;[^:]+::	17		;; bla :: blub
Nested definition lists are not exactly common. Lines starting with ';;'
often appear as comments in code listings. The most common other
application appears to be indentation and emphasis. Any change in the
produced structure that keeps indentation and bolding should thus avoid
breaking pages.
...
Ward Cunninham had a setup that could do this sort of thing on a
complete en-wiki dump in like 10-15 minutes, and a fraction of the dump
(still tens of thousands of article in size) in under a minute. We
supposedly have access to him and his mad science laboratory - now would
be a good time to get that going.
Will keep him in mind- we'll need to perform quite a few checks like
these while tweaking the parser. A pipeline with two grep patterns and
wc -l at the end ran just under 6 minutes on my notebook, so it is
actually quite doable. The javascript parser would take quite a bit
longer though ;)
Cheers,
Gabriel

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Wikitext-l] nested definition lists