Wikitech-l November 2007

wikitech-l@lists.wikimedia.org

90 participants
79 discussions

Parser practicum
by Jay R. Ashworth 17 Nov '07

17 Nov '07

It has been proposed, informally, that wikitext be modified to prefer, and then eventually require, new markers for bold and italic text inline. Two suggested, and very similar approaches, are currently on the table: //italics//, **bold**, //**bold** italics// and /italics/, *bold*, /*bold* italics/ Could you each please post your personal favorite hobby-horse counter case which you feel would make parsing these constructs difficult so we can all pick it apart? I'll start: Everyone says that *bold* (ie: using the single character versions in general) would conflict with the use of asterisks for list marking. To see how big a problem this would actually be entails finding out how many bold markings occur at the beginning of hard parapgraphs, since list items *must* be at the beginning of a hard paragraph, and then determining how hard it would be to distinguish them. I see three cases: *List item Easy: only one asterisk, beginning of graf. Obviously list item. *Bold sentence.* Also easy, asterisk at beginning of graf is matched by one that's just before white space. This one's probably the hardest, you have to look ahead a fair piece to find the matching bold-off to be sure. *list item with a *bold* word Similarly easy; the bold word tags are matched. This one would be harder if list items were regularly very long; in my experience, they're not. No, four: The only thing that makes this difficult, as far as I can see, is if you want to permit turning off bold mid-word, like this: But can you really call it *truth*iness? I know we probably permit that now, but it does deprive us of "bold-off is an asterisk followed by a \W token" rule that makes other things easy. So again: is "turning bold and italics off between two alphanumeric characters" a thing which actually *happens*, much? Cheers, -- jra -- Jay R. Ashworth Baylink jra(a)baylink.com Designer The Things I Think RFC 2100 Ashworth & Associates http://baylink.pitas.com '87 e24 St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274

8 29

Re: [Wikitech-l] [MediaWiki-CVS] SVN: [27514] trunk/phase3
by Roan Kattouw 17 Nov '07

17 Nov '07

brion(a)svn.wikimedia.org schreef: > Revision: 27514 > Author: brion > Date: 2007-11-15 04:24:49 +0000 (Thu, 15 Nov 2007) > > Log Message: > ----------- > Revert r27151 -- allows session fixation attacks. > Just get a user to visit a URL with the user ID and token you like in the query string (say, in an <img> referenced in a page you convince them to go to or post for their review) and their login session will be replaced with the one you provided. > I don't see how this is bad: you can try and trick another user into doing something *logged in as you*, so it will appear as if *you* did it. Why not do it yourself if it's gonna be logged under your name anyway? Besides, this login session substitution only works for the API: the UI completely ignores the lg* stuff, and your cookies aren't overwritten. Now if this provided a way for a non-sysop to trick a sysop into deleting an article, I would acknowledge that that's a security issue. I don't see the issue here, however: the delete, or whatever it is you're trying to do, is gonna be logged with the attacker's name, checked against the attacker's permissions, etc. Also, the session is not really a session: the API doesn't spit out any cookies outside of the login module (that I'm aware of, anyway). I fail to see the security issue here. Roan Kattouw (Catrope)

6 7

{{CURRENTUSER}} magic word?
by Virgil Ierubino 16 Nov '07

16 Nov '07

Quick feature suggestion - a {{CURRENTUSER}} magic word similar to {{PAGENAME}} magic words, which returns the username of the currently logged in user, or else the IP of the un-logged in user. This would be very useful for general per-user customisation of the site, delivering user-centric information, etc., through templates.

13 24

[ANNOUNCEMENT] New mailing list for wikitext and parser discussions
by Domas Mituzas 16 Nov '07

16 Nov '07

Hi! We created new mailing list wikitext-l(a)wikimedia.org, and for now assigned Steve Bennett to handle the administrative part of that. Please move wikitext and parser related discussions there, as they lately became off-topic. This list would better see actual working (and better/faster) realizations of parsers, instead of theoretical discussions on something nobody is working on anyway. Best regards, -- Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]

3 3

Re: [Wikitech-l] [MediaWiki-CVS] SVN: [27512] trunk/phase3
by Simetrical 16 Nov '07

16 Nov '07

On 11/14/07, brion(a)svn.wikimedia.org <brion(a)svn.wikimedia.org> wrote: > # Default signatures for all languages. Do not duplicate to other languages > -'signature' => '[[$1|$2]] ([[$3|$4]])', # default signature for registered users > -'signature-ip' => '[[$1|$2]] ([[$3|$4]])', # default signature for anonymous users > +'signature' => '[[User:$1|$2]]', # default signature for registered users > +'signature-anon' => '[[Special:Contributions/$1|$2]]', # default signature for anonymous users These now need to be localized, I guess, unless we're okay with English special page names showing up by default in Chinese text or whatnot. Of course they work, but that's not very ideal. Do we have some magic word to get a special page name, so we could change those to {{subst:ns:user}} and {{subst:specialpage:contributions}} or something? (I assume subst: will work okay here, since I think it works in custom signatures.)

1 0

Parser: Coolest parser trick ever :)
by Steve Bennett 16 Nov '07

16 Nov '07

Try this: [[Foo|<pre>Magic link!</pre>]] (for extra points try and predict what will happen first...) This one is less exciting, but also odd: [[Foo|Not what I <gallery>Image:foo.jpg</gallery> expected...]] Steve

6 12

Rather than bad syntax outputting gibberish or errors, alter parser to output intuitively (was: EBNF grammar project status?)
by Virgil Ierubino 16 Nov '07

16 Nov '07

We've been discussing the problem that everything is valid in Wikitext, meaning that even the wild code noted by Steve Bennett: > Try this: > > [[Foo|<pre>Magic link!</pre>]] > > (for extra points try and predict what will happen first...) > > This one is less exciting, but also odd: > > [[Foo|Not what I <gallery>Image:foo.jpg</gallery> expected...]] ...comes out as SOMETHING. As he noted however, it's highly unexpected. People have suggested that we output error messages instead of unexpected results (me included) but I have changed my mind as this is clearly user unfriendly. What I think we should do instead is just alter the parser to output things more intuitively. This move would directly break current usage, BUT ONLY IN CASES OF *UNEXPECTED OUTPUT*. The breakage would only be on the kinds of output no one intends ANYWAY. The kind of thing I'm talking about is rendering: ;Life, The Universe and Everything: :Forty Two correctly (i.e. a definition title with a colon at the end, followed by the definition, rather than being followed by an empty definition). There are a large number of other tweaks we could make which, if made, would only improve rendering - they would eliminate the cases where unexpected results are arrived at. Because ultimately that's what we're talking about. We want to eliminate unexpected results, and we've suggested swapping them out for error messages. Why not just swap them out for expected results? P.S. In the case of something like [[Foo:Bar|Baz<pre>Foo</pre>Bar]] the syntax should either be taken as so wrong that it outputs literally, or, highly preferred, it should work out that the user is very likely trying to attempt the equivalent of the following: [[Foo:Bar|Baz]]<pre>[[Foo:Bar|Foo]]</pre>[[Foo:Bar|Bar]]

5 4

Is template information stored in Database?
by howard chen 15 Nov '07

15 Nov '07

Hello, A page such as: http://en.wikipedia.org/wiki/Die_Hard, you can see a movie template which contains many useful information such as director name, starring etc. Is this information stored as just as a form of wiki markup in the article table, or any more meaningful structure stored in the database? Thanks. Howard.

7 11

Parsing italics/bold
by Steve Bennett 15 Nov '07

15 Nov '07

What's the best way to approach parsing a long string of formatted text: 1) Treat each incidence of ''' or '' as an element to be translated into , , , or , using state ("context"?) to determine which 2) Have a rule that treats an entire run of '''........''' as a single element, to be transformed into ........ I'm not even considering the much-discussed ambiguities of apostrophes. Assuming simple, possibly well-formed but at least not pathological input, which way is best? A lot of our assumptions about how to parse come from parsing programming languages, but I can't think of an analogous programming language feature: ''' doesn't nest, so it's not like an if-block, and its contents has to be parsed, so it's not like a comment. At best it seems vaguely like an inline compiler flag, a #DEFINE/#UNDEFINE in C, or an OPTION BASE statement in VB, all of which clearly change state and don't require block terminators. The downside of 1) is it seems to tie us to HTML, and rely on this external entity (the browser) to make sense of the begin/end tokens we spit out. It also requires keeping track of state... The downside of 2) is it seems difficult to fail gracefully if there is no closing token or if overlapping bold/italics are found. At best, a section of text might have to be parsed twice. At worst, it will be much more pedantic than our current parser, and will ignore improper bold/italics altogether. Suggestions? Steve

6 28

Re: [Wikitech-l] A Modest Proposal on grammar and parsers
by Merlijn van Deen 15 Nov '07

15 Nov '07

First of all, I have to admit I have not read all 50 emails, but here are my two cents. Most importantly, I think we should stop storing wikitext. Storing wikitext makes it hard to make changes in the syntax, because it would break pretty much every existing page. Wikitext is an ambiguous way of storing 'the way it is meant'; XML is a clear way of doing this. As the text is compressed, using wikitext or XML does not make that big of a difference. However, XML makes parsing much easier. Yes, it will need two steps, but when regenerating the page from the database, it's much easier (no ugly regexps, just a simple SAX parser). Besides, as a pywikipedia developer, I'd like to have XML output ;) To change the format to XML (and updating the wikitext format at the same time) means we need four important things: an 'old wikitext'->XML converter, a XML->'good wikitext' converter, a 'good wikitext'->XML converter and a XML->HTML parser. (s/converter/parser, if you care about the exact words). The 'good wikitext' and html parsers should be fairly easy; the first is just plain hard. I have tried to build a parser by using standard systems, but I have given up, and I have built a basic lexer + parser by hand. It is by no means complete, and I have not worked on it for some time, as Ping Yeh has a more complete implementation. He was busy refactoring it, but with little time, there was little progress. My code is available at http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikiparser/ - feel free to take a look at it :) To summarize: We should switch to storing a much more descriptive format so changes in the wikitext format do not break anything: the wikitext can just be generated from the XML, in whichever format you want. This means it should be able to use (cleaned up) mediawiki wikitext, wikicreole or many other systems - per user. (Although as far as I can see wikicreole isn't available as context free grammar either..) --valhallasw P.S. Some people, when confronted with a problem, think `I know, I'll use regular expressions.' Now they have two problems. --jwz

7 14

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l November 2007