It has been proposed, informally, that wikitext be modified to prefer,
and then eventually require, new markers for bold and italic text
inline.
Two suggested, and very similar approaches, are currently on the table:
//italics//, **bold**, //**bold** italics//
and
/italics/, *bold*, /*bold* italics/
Could you each please post your personal favorite hobby-horse counter
case which you feel would make parsing these constructs difficult so we
can all pick it apart?
I'll start:
Everyone says that *bold* (ie: using the single character versions in
general) would conflict with the use of asterisks for list marking. To
see how big a problem this would actually be entails finding out how
many bold markings occur at the beginning of hard parapgraphs, since
list items *must* be at the beginning of a hard paragraph, and then
determining how hard it would be to distinguish them.
I see three cases:
*List item
Easy: only one asterisk, beginning of graf. Obviously list item.
*Bold sentence.*
Also easy, asterisk at beginning of graf is matched by one that's just
before white space. This one's probably the hardest, you have to look
ahead a fair piece to find the matching bold-off to be sure.
*list item with a *bold* word
Similarly easy; the bold word tags are matched. This one would be
harder if list items were regularly very long; in my experience,
they're not.
No, four:
The only thing that makes this difficult, as far as I can see, is if
you want to permit turning off bold mid-word, like this:
But can you really call it *truth*iness?
I know we probably permit that now, but it does deprive us of "bold-off
is an asterisk followed by a \W token" rule that makes other things
easy.
So again: is "turning bold and italics off between two alphanumeric
characters" a thing which actually *happens*, much?
Cheers,
-- jra
--
Jay R. Ashworth Baylink jra(a)baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://baylink.pitas.com '87 e24
St Petersburg FL USA http://photo.imageinc.us +1 727 647 1274
brion(a)svn.wikimedia.org schreef:
> Revision: 27514
> Author: brion
> Date: 2007-11-15 04:24:49 +0000 (Thu, 15 Nov 2007)
>
> Log Message:
> -----------
> Revert r27151 -- allows session fixation attacks.
> Just get a user to visit a URL with the user ID and token you like in the query string (say, in an <img> referenced in a page you convince them to go to or post for their review) and their login session will be replaced with the one you provided.
>
I don't see how this is bad: you can try and trick another user into
doing something *logged in as you*, so it will appear as if *you* did
it. Why not do it yourself if it's gonna be logged under your name
anyway? Besides, this login session substitution only works for the API:
the UI completely ignores the lg* stuff, and your cookies aren't
overwritten. Now if this provided a way for a non-sysop to trick a sysop
into deleting an article, I would acknowledge that that's a security
issue. I don't see the issue here, however: the delete, or whatever it
is you're trying to do, is gonna be logged with the attacker's name,
checked against the attacker's permissions, etc. Also, the session is
not really a session: the API doesn't spit out any cookies outside of
the login module (that I'm aware of, anyway).
I fail to see the security issue here.
Roan Kattouw (Catrope)
Quick feature suggestion - a {{CURRENTUSER}} magic word similar to
{{PAGENAME}} magic words, which returns the username of the currently logged
in user, or else the IP of the un-logged in user. This would be very useful
for general per-user customisation of the site, delivering user-centric
information, etc., through templates.
Hi!
We created new mailing list wikitext-l(a)wikimedia.org, and for now
assigned Steve Bennett to handle the administrative part of that.
Please move wikitext and parser related discussions there, as they
lately became off-topic.
This list would better see actual working (and better/faster)
realizations of parsers, instead of theoretical discussions on
something nobody is working on anyway.
Best regards,
--
Domas Mituzas -- http://dammit.lt/ -- [[user:midom]]
On 11/14/07, brion(a)svn.wikimedia.org <brion(a)svn.wikimedia.org> wrote:
> # Default signatures for all languages. Do not duplicate to other languages
> -'signature' => '[[$1|$2]] ([[$3|$4]])', # default signature for registered users
> -'signature-ip' => '[[$1|$2]] ([[$3|$4]])', # default signature for anonymous users
> +'signature' => '[[User:$1|$2]]', # default signature for registered users
> +'signature-anon' => '[[Special:Contributions/$1|$2]]', # default signature for anonymous users
These now need to be localized, I guess, unless we're okay with
English special page names showing up by default in Chinese text or
whatnot. Of course they work, but that's not very ideal. Do we have
some magic word to get a special page name, so we could change those
to {{subst:ns:user}} and {{subst:specialpage:contributions}} or
something? (I assume subst: will work okay here, since I think it
works in custom signatures.)
Try this:
[[Foo|<pre>Magic link!</pre>]]
(for extra points try and predict what will happen first...)
This one is less exciting, but also odd:
[[Foo|Not what I <gallery>Image:foo.jpg</gallery> expected...]]
Steve
We've been discussing the problem that everything is valid in Wikitext,
meaning that even the wild code noted by Steve Bennett:
> Try this:
>
> [[Foo|<pre>Magic link!</pre>]]
>
> (for extra points try and predict what will happen first...)
>
> This one is less exciting, but also odd:
>
> [[Foo|Not what I <gallery>Image:foo.jpg</gallery> expected...]]
...comes out as SOMETHING. As he noted however, it's highly unexpected.
People have suggested that we output error messages instead of unexpected
results (me included) but I have changed my mind as this is clearly user
unfriendly.
What I think we should do instead is just alter the parser to output things
more intuitively. This move would directly break current usage, BUT ONLY IN
CASES OF *UNEXPECTED OUTPUT*. The breakage would only be on the kinds of
output no one intends ANYWAY.
The kind of thing I'm talking about is rendering:
;Life, The Universe and Everything:
:Forty Two
correctly (i.e. a definition title with a colon at the end, followed by the
definition, rather than being followed by an empty definition). There are a
large number of other tweaks we could make which, if made, would only
improve rendering - they would eliminate the cases where unexpected results
are arrived at.
Because ultimately that's what we're talking about. We want to eliminate
unexpected results, and we've suggested swapping them out for error
messages. Why not just swap them out for expected results?
P.S. In the case of something like [[Foo:Bar|Baz<pre>Foo</pre>Bar]] the
syntax should either be taken as so wrong that it outputs literally, or,
highly preferred, it should work out that the user is very likely trying to
attempt the equivalent of the following:
[[Foo:Bar|Baz]]<pre>[[Foo:Bar|Foo]]</pre>[[Foo:Bar|Bar]]
Hello,
A page such as: http://en.wikipedia.org/wiki/Die_Hard, you can see a
movie template which contains many useful information such as director
name, starring etc.
Is this information stored as just as a form of wiki markup in the
article table, or any more meaningful structure stored in the
database?
Thanks.
Howard.
What's the best way to approach parsing a long string of formatted text:
1) Treat each incidence of ''' or '' as an element to be translated into
<b>, <i>, </b>, or </i>, using state ("context"?) to determine which
2) Have a rule that treats an entire run of '''........''' as a single
element, to be transformed into <b>.......</b>.
I'm not even considering the much-discussed ambiguities of apostrophes.
Assuming simple, possibly well-formed but at least not pathological input,
which way is best?
A lot of our assumptions about how to parse come from parsing programming
languages, but I can't think of an analogous programming language feature:
''' doesn't nest, so it's not like an if-block, and its contents has to be
parsed, so it's not like a comment. At best it seems vaguely like an inline
compiler flag, a #DEFINE/#UNDEFINE in C, or an OPTION BASE statement in VB,
all of which clearly change state and don't require block terminators.
The downside of 1) is it seems to tie us to HTML, and rely on this external
entity (the browser) to make sense of the begin/end tokens we spit out. It
also requires keeping track of state...
The downside of 2) is it seems difficult to fail gracefully if there is no
closing token or if overlapping bold/italics are found. At best, a section
of text might have to be parsed twice. At worst, it will be much more
pedantic than our current parser, and will ignore improper bold/italics
altogether.
Suggestions?
Steve
First of all, I have to admit I have not read all 50 emails, but here are
my two cents.
Most importantly, I think we should stop storing wikitext. Storing
wikitext makes it hard to make changes in the syntax, because it would
break pretty much every existing page. Wikitext is an ambiguous way of
storing 'the way it is meant'; XML is a clear way of doing this. As the
text is compressed, using wikitext or XML does not make that big of a
difference.
However, XML makes parsing much easier. Yes, it will need two steps, but
when regenerating the page from the database, it's much easier (no ugly
regexps, just a simple SAX parser). Besides, as a pywikipedia developer,
I'd like to have XML output ;)
To change the format to XML (and updating the wikitext format at the same
time) means we need four important things: an 'old wikitext'->XML
converter, a XML->'good wikitext' converter, a 'good wikitext'->XML
converter and a XML->HTML parser. (s/converter/parser, if you care about
the exact words). The 'good wikitext' and html parsers should be fairly
easy; the first is just plain hard.
I have tried to build a parser by using standard systems, but I have given
up, and I have built a basic lexer + parser by hand. It is by no means
complete, and I have not worked on it for some time, as Ping Yeh has a
more complete implementation. He was busy refactoring it, but with little
time, there was little progress. My code is available at
http://svn.wikimedia.org/viewvc/pywikipedia/trunk/pywikiparser/ - feel
free to take a look at it :)
To summarize: We should switch to storing a much more descriptive format
so changes in the wikitext format do not break anything: the wikitext can
just be generated from the XML, in whichever format you want. This means
it should be able to use (cleaned up) mediawiki wikitext, wikicreole or
many other systems - per user. (Although as far as I can see wikicreole
isn't available as context free grammar either..)
--valhallasw
P.S.
Some people, when confronted with a problem, think `I know, I'll use
regular expressions.' Now they have two problems. --jwz