[Foundation-l] Why is the software out of reach of the community?

Tim Starling tstarling at wikimedia.org
Tue Jan 13 16:13:59 UTC 2009


Brian wrote:
> Thank for your answers.
> 
> ParserFunctions are my specific example of how the current development
> process is very, very broken, and out of touch with the community.
> According to Jimbo's user page (his bolded): "*Any changes to the software
> must be gradual and reversible.* We need to make sure that any changes
> contribute positively to the community, as ultimately determined by
> everybody in Wikipedia, in full consultation with the community consensus."
> 
> I believe that the introduction of ParserFunctions to MediaWiki was not done
> with community consensus and has led to an extremely  fast devolution in
> wiki syntax. Further, the usability of Wikipedia has declined at a rate
> proportional to the adoption of parser functions.

The evolution of templates, and then ParserFunctions, was led by community
demand and was widely encouraged by the community. I was concerned about
the usability implications of ParserFunctions, but the community
demonstrated its intent to ignore any usability concerns by implementing
complex templates, very similar to the ones seen today, using the
parameter default mechanism alone. Resistance to this trend seemed very weak.

The decline of usability in the template namespace has been driven by
technically-minded editors who are proud of their ability to make use of
an arcane and cryptic syntax to produce ever more complex feats of text
processing. This is an editorial issue and I cannot accept responsibility
for it.

However, I am aware that I enabled this process, by implementing the few
simple features that they needed. I regret my role in it. That's one of
the reasons why I've been resisting the constant community pressure to
enable StringFunctions, which I believe will lead to compiler-like
functionality implemented in the template namespace. Instead, I've been
trying to steer development in the direction of a readable embedded
programming language.

If you want a wiki with infoboxes (and I suppose I do since I wrote one of
them in the pre-template era using an Excel VBA macro), then I suppose we
need some form of template feature. The problem with present-day parser
functions is that they are terribly ugly, excessively punctuated, dense to
the point of unreadability, with very limited commenting and
self-documentation.

I believe that the solution to this problem lies in borrowing concepts
from software engineering, such as variables, functions, minimally
parenthesized programming languages, libraries, objects, etc. I know that
many template programmers cannot program in a traditional programming
language, but I have a feeling they could if they wanted to. I certainly
find PHP programming much easier than template programming, after a few
years of familiarity with both.

I'm also aware that most (non-template) Wikipedia editors have no desire
to learn how to program, and do not believe that it should be necessary in
the course of editing articles. I think that with enough development time,
a suitable platform in MediaWiki could connect these two types of editors.
For example there could be an easy-to-use form-based template invocation
generator, with forms written by the same technically minded editors who
write arcane templates today. Citations could be inserted into articles by
invoking a popup box and entering text into clearly labelled form fields.


>From another post:
> We do not even have a parser. I am sure you know that MediaWiki does not
> actually parse. It is 5000 lines worth of regexes, for the most part.

"Parser" is a convenient and short name for it.

I've reviewed all of the regexes, and I stand by the vast majority of
them. The PCRE regular expression module is a versatile text scanning
language, which is compiled to bytecode and executed in a VM, very much
like PHP. It just so happens that for most text processing tasks where
there is a choice between PHP or PCRE, PCRE is faster. In certain special
cases, it's possible to gain extra performance by using primitive text
scanning functions like strpos() which are implemented in C. Where this is
possible, I have done so. But if you want to, say, find the first match
from a list of strings in a single subject, searching from a given offset,
then the fastest way to do it in standard PHP is a regex with the /S modifier.

In two cases, I found the available algorithms accessible from standard
PHP to be inconveniently slow, so I wrote the FSS and wikidiff2 extensions
in C and C++ respectively.

Perhaps, like so many computer science graduates, you are enamored with
the taxonomy of formal grammars and the parsers that go with them. There
are a number of problems with these traditional solutions.

Firstly, there are theoretical problems. The concept of a regular grammar
is not versatile enough to describe languages such as XML, and not
descriptive enough to allow unambiguous parse tree production from a
language like wikitext. It's trivial to invent irregular grammars which
can be nonetheless processed in linear time. My aims for wikitext, namely
that it be easy for humans to write but fast to convert to HTML, do not
coincide well with the taxonomy of formal grammars.

Secondly, there are practical problems. Past projects attempting to parse
wikitext using flex/bison or similar schemes have failed to achieve the
performance of the present parser, which is surprising because I didn't
think I was setting the bar very high. You can bet that if I ever rewrote
it in C++ myself, it would be much faster. The PHP compiler community is
currently migrating away from LALR towards a regex-based parser called
re2c, mostly for performance reasons.

Thirdly, there is the fact that certain phases of MediaWiki's parser are
already very similar to the textbook parsers and can be analysed in those
terms. The main difference is that our parser is better optimised. For
example, the preprocessor acts like a recursive descent parser, but with a
non-recursive frontend (using an internal stack), a caching phase, and a
parse tree expansion phase with special-case recursive to iterative
transformations to minimise stack depth.

Yet another post:
> I don't believe a computer scientist would have a huge problem writing 
> a proper parser. Are any of the core developers computer scientists?

Frankly, as an ex-physicist, I don't find the field of computer science
particularly impressive, either in terms of academic rigour or practical
applications. I think my time would be best spent working as a software
engineer for a cause that I believe in, rather than going back to
university and studying another socially-disconnected field.

-- Tim Starling





More information about the wikimedia-l mailing list