Hi all,
First of all, I'm sending a big thank you to everyone who contributes code to MediaWiki. You all make this social revolution possible.
Based on Jimmy Wales' Wikimania keynote and other sources, I know that much of the community recognizes WYSIWYG editing as an important feature for enabling greater participation. There's a comment, however, from Brion on the discussion page of the MediaWiki roadmap http://www.mediawiki.org/wiki/MediaWiki_roadmap that it isn't in their immediate plans.
A page on Meta http://meta.wikimedia.org/wiki/WYSIWYG_editor lists several experiments and implementation ideas, but is anyone seriously tackling this problem? Should we take up a collection to hire a developer to work on this? Or maybe to offer X PRIZE style bounty for a successful implementation? I'm in for $25!
Sincerely, Matthew Simoneau http://www.matthewsim.com/
Matthew Simoneau wrote:
Hi all,
First of all, I'm sending a big thank you to everyone who contributes code to MediaWiki. You all make this social revolution possible.
Based on Jimmy Wales' Wikimania keynote and other sources, I know that much of the community recognizes WYSIWYG editing as an important feature for enabling greater participation. There's a comment, however, from Brion on the discussion page of the MediaWiki roadmap http://www.mediawiki.org/wiki/MediaWiki_roadmap that it isn't in their immediate plans.
A page on Meta http://meta.wikimedia.org/wiki/WYSIWYG_editor lists several experiments and implementation ideas, but is anyone seriously tackling this problem? Should we take up a collection to hire a developer to work on this? Or maybe to offer X PRIZE style bounty for a successful implementation? I'm in for $25!
well, we recently started to integrate Yulup (http://www.yulup.org/) resp. the Neutron protocol (http://neutron.wyona.org/) which I think would make editing really simple and also allow to connect many other editors without much effort, but because of our unfamiliarity of the MediaWiki code our efforts stalled.
I would be very happy to restart, but would need some help from someone familiar with the MediaWiki code (instead of money ;-) I think for someone experienced it might take 1-2 days (also see http://neutron.wyona.org/#getting_started).
Cheers
Michael
Sincerely, Matthew Simoneau http://www.matthewsim.com/
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Matthew Simoneau wrote:
A page on Meta http://meta.wikimedia.org/wiki/WYSIWYG_editor lists several experiments and implementation ideas, but is anyone seriously tackling this problem? Should we take up a collection to hire a developer to work on this? Or maybe to offer X PRIZE style bounty for a successful implementation? I'm in for $25!
These are dead-ends, don't waste your time or money on anything that tries to tack on an HTML editor widget and requires translating back and forth (an inherently lossy operation which will never work correctly over thousands of edits of an article's lifetime).
Something like WikiWizard is more interesting: (http://jspwiki.org/wiki/WikiWizard site seems down at the moment)
A proper wysiwyg system either needs to be built around the wiki markup directly -- in which case you need a proper definition of the markup and a standard, clean parser if you want to do anything more than syntax highlighting and such pretties that WikiWizard does -- or else the markup needs to be built around the editor -- such as dumping the wiki syntax entirely and using an HTML-based markup directly. (I believe a number of the 'intranet'-oriented wiki-like systems do this.)
We're not likely to dump our markup in the next few years. Nothing else is likely to happen without the markup being formalized.
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
Brion Vibber wrote:
We're not likely to dump our markup in the next few years. Nothing else is likely to happen without the markup being formalized.
I think it would make a lot of sense to standardize the wiki syntax and maybe also to introduce an XML representation of Wiki (whereas this might be more difficult than it seems).
It seems to me that a person who is being considered to be an authority within the Wiki World has to take the lead on this. It's silly, but it seems like people tick this way ...
Cheers
Michael
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFzNKywRnhpk1wk44RAgl/AJ0VNBJf+35OUCO24u+HRwiVGvgd5gCfdDJ/ GqC1oj/tVGDSwjFI8nCL6Mo= =uGPb -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, 2007-09-02 at 11:59 -0800, Brion Vibber wrote:
These are dead-ends, don't waste your time or money on anything that tries to tack on an HTML editor widget and requires translating back and forth (an inherently lossy operation which will never work correctly over thousands of edits of an article's lifetime).
This is distressing news. Does this mean that we're not doing Wikiwyg?
Are Wikia and SocialText still working on that project?
~Evan
________________________________________________________________________ Evan Prodromou evan@prodromou.name http://evan.prodromou.name/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Evan Prodromou wrote:
On Fri, 2007-09-02 at 11:59 -0800, Brion Vibber wrote:
These are dead-ends, don't waste your time or money on anything that tries to tack on an HTML editor widget and requires translating back and forth (an inherently lossy operation which will never work correctly over thousands of edits of an article's lifetime).
This is distressing news. Does this mean that we're not doing Wikiwyg?
Are Wikia and SocialText still working on that project?
Last I heard they gave it up and are sponsoring another, apparently identical project. I don't expect it to fare any better.
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
On 11/02/07, Brion Vibber brion@pobox.com wrote:
Last I heard they gave it up and are sponsoring another, apparently identical project. I don't expect it to fare any better.
Such beautiful cynicism, so poetic, so carefree.
Probably, if anybody ever wants this kind of functionality done, we need to direct them to start helping us defining the parser behaviour. I say this, but of course, defining the behaviour of the behemoth we have now is a task rather akin to removing all the sand from the Sahara.
Rob Church
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Rob Church wrote:
On 11/02/07, Brion Vibber brion@pobox.com wrote:
Last I heard they gave it up and are sponsoring another, apparently identical project. I don't expect it to fare any better.
Such beautiful cynicism, so poetic, so carefree.
Probably, if anybody ever wants this kind of functionality done, we need to direct them to start helping us defining the parser behaviour.
Well I've been advocating that since the second I heard of such projects. If they start doing it, let me know. ;)
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
On 2/11/07, Brion Vibber brion@pobox.com wrote:
Rob Church wrote:
Probably, if anybody ever wants this kind of functionality done, we need to direct them to start helping us defining the parser behaviour.
Well I've been advocating that since the second I heard of such projects. If they start doing it, let me know. ;)
I think a lot of people have *started* doing it. It's *finishing* that's the tricky bit. :P
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Simetrical wrote:
On 2/11/07, Brion Vibber brion@pobox.com wrote:
Rob Church wrote:
Probably, if anybody ever wants this kind of functionality done, we need to direct them to start helping us defining the parser behaviour.
Well I've been advocating that since the second I heard of such projects. If they start doing it, let me know. ;)
I think a lot of people have *started* doing it. It's *finishing* that's the tricky bit. :P
Ok, if they *stop* doing it let me know! ;)
Seriously, though, we should all get together and make a concerted effort.
I would recommend making a start by looking at the XML structure used by Magnus' conversion tools; if we need to adapt that further to integrate properly or if we can just poke around with that?
Having a clear data model is a definite must; making the parser for it clean and fast can follow. :)
- -- brion vibber (brion @ pobox.com / brion @ wikimedia.org)
On 2/11/07, Brion Vibber brion@pobox.com wrote:
Seriously, though, we should all get together and make a concerted effort.
I would recommend making a start by looking at the XML structure used by Magnus' conversion tools; if we need to adapt that further to integrate properly or if we can just poke around with that?
Having a clear data model is a definite must; making the parser for it clean and fast can follow. :)
I stopped working the XML thing because of time issues ;-) and one particularly nasty problem. Currently, wiki2xml needs to replace templates while parsing. What it *should* do is optionally offer a way to just put in a template placeholder XML tag, consisting of template name and passed values. However, as our wiki code basically allows templates everywhere, this will easily break the other XML that is generated.
However, it worked OK (last time I checked) with most of our wiki code except conditional/calculation template functions.
Magnus
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Magnus Manske wrote:
However, it worked OK (last time I checked) with most of our wiki code except conditional/calculation template functions.
I'd recommend modeling those off of XSLT. xsl:if and xsl:choose
Simetrical wrote:
On 2/11/07, Brion Vibber brion@pobox.com wrote:
Rob Church wrote:
Probably, if anybody ever wants this kind of functionality done, we need to direct them to start helping us defining the parser behaviour.
Well I've been advocating that since the second I heard of such projects. If they start doing it, let me know. ;)
I think a lot of people have *started* doing it. It's *finishing* that's the tricky bit. :P
As one of the many people who's done so, I agree. :) The problem is that ~80% of wikimarkup is pretty straightforward to parse using standard methods, another 10-15% can be done without huge difficulty using known-but-less-standard methods, and the remaining 5% doesn't fit well at all into any of the normal models of lexing/parsing. And I'm not even talking about the mess with template substitution, which is a whole different can of worms. Relatively recent improvements in tool infrastructure, like Bison adding GLR parsing, help somewhat, but the last few percent is still a tough nut to crack. It's annoying and tricky enough work that I think it will only get done if: 1) someone is paid to do it; or 2) someone can find a way to work it into a school project.
-Mark
Probably, if anybody ever wants this kind of functionality done, we need to direct them to start helping us defining the parser behaviour.
Well I've been advocating that since the second I heard of such projects. If they start doing it, let me know. ;)
I think a lot of people have *started* doing it. It's *finishing* that's the tricky bit. :P
As one of the many people who's done so, I agree. :) The problem is that ~80% of wikimarkup is pretty straightforward to parse using standard methods, another 10-15% can be done without huge difficulty using known-but-less-standard methods, and the remaining 5% doesn't fit well at all into any of the normal models of lexing/parsing.
[...snip...]
-Mark
Can I maybe suggest please giving some examples that you encountered of the 10-15% hard category, and the 5% very hard category?
I ask so that if anyone feels tempted to start on defining the behaviour, we can gently suggest doing the harder stuff *first* (with examples), thus hopefully preventing the situation where we have multiple unfinished 80%-done definitions, and no 100%-complete formal definitions.
All the best, Nick.
Nick Jenkins wrote:
As one of the many people who's done so, I agree. :) The problem is that ~80% of wikimarkup is pretty straightforward to parse using standard methods, another 10-15% can be done without huge difficulty using known-but-less-standard methods, and the remaining 5% doesn't fit well at all into any of the normal models of lexing/parsing.
[...snip...]
-Mark
Can I maybe suggest please giving some examples that you encountered of the 10-15% hard category, and the 5% very hard category?
I ask so that if anyone feels tempted to start on defining the behaviour, we can gently suggest doing the harder stuff *first* (with examples), thus hopefully preventing the situation where we have multiple unfinished 80%-done definitions, and no 100%-complete formal definitions.
All the best, Nick.
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Rendered in HTML, the first reads <i><b>hello</b> hi</i>, and the second reads <b><i>hi</i> hello</b>. The problem is that the meaning of the first 5 quotes changes based on the order in which the bold and italic regions close - which is not determined while scanning left-to-right.
Another example:
'''hello ''hi''' there''
MediaWiki renders this as <b>hello <i>hi</i></b><i> there</i>, properly handling overlapping formatting.
There are ways to deal with these... putting off the resolution until a later pass is the only way I know of that deals with the first one, and it's a bit touchy. Manageable, but touchy.
If I have time this summer, I'm going to look at formalizing a parser again... see if I can make a start on hammering out a somewhat more formal structure for handling some of the tougher cases, after I've tackled OLPC's CrossMark.
- Eric Astor
Eric Astor wrote:
There are ways to deal with these... putting off the resolution until a later pass is the only way I know of that deals with the first one, and it's a bit touchy. Manageable, but touchy.
Clarification: The cases I mentioned can't be handled by a context-free parser - but both of them are parseable if ''' is parsed as a TOGGLE-BOLD item, and '' is handled similarly as a TOGGLE-ITALICS. That lets you escape the context-sensitivity, and handle the rendering properly on the output end while scanning left-to-right.
- Eric
On 2/11/07, Eric Astor eastor1@swarthmore.edu wrote:
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Rendered in HTML, the first reads <i><b>hello</b> hi</i>, and the second reads <b><i>hi</i> hello</b>. The problem is that the meaning of the first 5 quotes changes based on the order in which the bold and italic regions close - which is not determined while scanning left-to-right.
This is where we could redefine the behavior slightly. Have ''''' always be <b><i>. Then, if ''' occurs first, output </i></b><i>. On the other hand, from what you say next, I'm not sure that will help.
Another example:
'''hello ''hi''' there''
MediaWiki renders this as <b>hello <i>hi</i></b><i> there</i>, properly handling overlapping formatting.
There are ways to deal with these... putting off the resolution until a later pass is the only way I know of that deals with the first one, and it's a bit touchy. Manageable, but touchy.
Well, we could just output invalid XML for both these cases and then fix it in the Sanitizer/Tidy pass, I guess. In some clearly defined manner, of course, perhaps stated informally in the grammar, or formally as a separate non-parsing algorithm.
Simetrical wrote:
This is where we could redefine the behavior slightly. Have ''''' always be <b><i>. Then, if ''' occurs first, output </i></b><i>.
Not necessary. Much easier if you just define ''''' as a separate token.
''''' a '' b ''' => <b><i>a</i>b</b> ''''' a ''' b '' => <i><b>a</b>b</i>
On 2/12/07, Eric Astor eastor1@swarthmore.edu wrote:
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Just to check: Is changing MediaWiki sytnax absolutely out of the question? Using // and ** for italic and bold respectively would solve that problem, be more consistent and more intuitive, and probably not be excessively difficult to phase in, right?
Steve
On 2/11/07, Steve Bennett stevagewp@gmail.com wrote:
How much of this task is possible without looking at the code base? How much can be done merely by inspection? Would it be possible for someone (Brion, yourself?) to structure this apparently very large task in such a way that non-developers like me would be able to contribute bite-size pieces of work? Such a task might look something like "Define the grammar for the [[Category:...]] statement". Do we have anything like that?
Well, the ideal is to get as much of the grammar as possible formalized, not merely informally summarized. Formalization should be possible with some inspection and testing, but of course you have to have a good understanding of formal grammars (I don't, for one, at least not yet). There are a number of wiki pages scattered around Meta and MediaWiki.org containing attempts at formalizations, which of course anyone can contribute to.
On 2/11/07, Steve Bennett stevagewp@gmail.com wrote:
Just to check: Is changing MediaWiki sytnax absolutely out of the question?
Major changes are probably out of the question for the foreseeable future, just because of the annoyance of transition, with months and months of people using the wrong syntax and screwing up articles. Tweaks to the behavior in corner cases are, of course, fine.
Steve Bennett wrote:
Just to check: Is changing MediaWiki sytnax absolutely out of the question? Using // and ** for italic and bold respectively would solve that problem, be more consistent and more intuitive, and probably not be excessively difficult to phase in, right?
Many would say that any such discussion is a dead end. However, I think that is a bit narrow-minded. A syntax change on Wikipedia might not be very likely, but you can of course change MediaWiki for use on your own wiki website.
Changing to // and ** doesn't necessarily make the wiki syntax any more BNF parsable than today, does it? You can still write ***** and what's that supposed to mean? If clarity is wanted, the best would probably be <i> and <b>.
The current wiki syntax cannot be described in simple BNF, but it is not impossible to parse. The MediaWiki engine successfully converts it to HTML and, in the reverse direction, users who intend to accomplish a result in bold and italics are able to convert this intension into wiki syntax.
Thus, it is also possible to write a program that converts the current Wikipedia dump into using <i> and <b> rather than apostrophes, and then back again to traditional wiki syntax. Since <i> and <b> are already supported, you could make it a policy on your own wiki whether apostrophes should be deprecated. To enforce such a policy, every stored article can be converted. Such a policy is not very likely on Wikipedia, though.
What you can do is to run some experiments on the existing dump. How many cases are there where ''''' is hard to resolve? Did anybody count?
It is of course possible to write articles with unbalanced apostrophes. If I write '''hey'' it will render as '<i>hey</i>, and that's also how a conversion program should leave it. How many such user mistakes are there in the current dump? Perhaps somebody is already running a robot to find and fix such errors?
Lars Aronsson wrote:
Steve Bennett wrote:
Just to check: Is changing MediaWiki sytnax absolutely out of the question? Using // and ** for italic and bold respectively would solve that problem, be more consistent and more intuitive, and probably not be excessively difficult to phase in, right?
Many would say that any such discussion is a dead end. However, I think that is a bit narrow-minded. A syntax change on Wikipedia might not be very likely, but you can of course change MediaWiki for use on your own wiki website.
Changing to // and ** doesn't necessarily make the wiki syntax any more BNF parsable than today, does it? You can still write ***** and what's that supposed to mean? If clarity is wanted, the best would probably be <i> and <b>.
Actually, // and ** are at least as clear, and are most definitely parsable by a fixed-lookahead context-free grammar - even an unaugmented LL(k) grammar could probably handle it. <i> and <b> are unambiguous, but ugly and language-dependent. MediaWiki's current behavior "fixes" many of the issues with its ambiguous bold/italics representation with little ad-hoc DWIM-type behavior. It works, but cannot be represented by a CFG and is difficult to extend.
Oh - and ***** would most likely be disambiguated to <b>*</b>. Easy to handle in a CFG with lookahead, and almost certainly what the user meant.
The current wiki syntax cannot be described in simple BNF, but it is not impossible to parse. The MediaWiki engine successfully converts it to HTML and, in the reverse direction, users who intend to accomplish a result in bold and italics are able to convert this intension into wiki syntax.
Slight disagreement on terms: The current syntax is convertible to HTML - it is not parseable. At the least, it is not currently parsed... No internal representation is generated, and the system just makes something on the close order of 50 regex passes per page to convert it into HTML.
Thus, it is also possible to write a program that converts the current Wikipedia dump into using <i> and <b> rather than apostrophes, and then back again to traditional wiki syntax. Since <i> and <b> are already supported, you could make it a policy on your own wiki whether apostrophes should be deprecated. To enforce such a policy, every stored article can be converted. Such a policy is not very likely on Wikipedia, though.
Possible - though difficult. I'd actually welcome someone creating a program to convert MediaWiki's syntax from apostrophes to <i> and <b>, as that could technically provide a more formal specification of how the MediaWiki parser handles apostrophes. However, at the moment, the only such program we have is MediaWiki itself.
- Eric
On 2/12/07, Lars Aronsson lars@aronsson.se wrote:
The current wiki syntax cannot be described in simple BNF, but it is not impossible to parse.
No, just very difficult, with lots of corner cases. Which makes it effectively impossible to guarantee that any tool other than MediaWiki itself (which is correct by definition) can parse the wikitext correctly in all cases. And perhaps more importantly, parse time for MediaWiki itself is, as I recall, somewhere on the order of 800ms. That's totally unacceptable for real-time use like WYSIWYG.
If a lossless intermediate representation such as XML could be developed, in theory, that would be ideal. Then the XML could be served for WYSIWYG or other clients and used for rendering, while the wikitext could be served to those who want to use it. The difficulty (if not impossibility) is in making it lossless: you have to be able to convert back and forth without changing anything.
*That* is probably the most interesting question right now from the perspective of stuff like WYSIWYG and third-party use. Formally parsing the current syntax is hopeless. But if we develop a mapping such that the entire enwiki database can be roundtripped, as a test case, *that* will allow all sorts of stuff to work. Once we have an intermediate XML representation, that could probably be turned directly into a rendered page (even including all skin elements) with just XSL, after template substitution. And that could probably be done in real time in most modern languages, including JavaScript. At least I hope.
On Mon, 2007-12-02 at 11:17 -0500, Simetrical wrote:
No, just very difficult, with lots of corner cases. Which makes it effectively impossible to guarantee that any tool other than MediaWiki itself (which is correct by definition) can parse the wikitext correctly in all cases. And perhaps more importantly, parse time for MediaWiki itself is, as I recall, somewhere on the order of 800ms. That's totally unacceptable for real-time use like WYSIWYG.
So, it's probably worth noting that the Wikiwyg library (http://www.wikiwyg.net/ ) doesn't actually parse Wikitext.
What happens is that at the user's request (clicking "edit this page", for example), the HTML of the page (output by MW itself) is put into "editable mode" on the client side. This uses some fancy features built into the two main browsers to provide WYSIWYG editing of HTML content.
If the user chooses to switch over to "wikitext mode", or when they choose to save, then a process walks the DOM tree of the HTML content and outputs wikitext. This is either shown in the "wikitext mode" window or submit back to the server. (NB: this is a very tricky process that works fine for simple markup, sections, tables, etc. and is extremely difficult for things like magic words, transclusion, category listings, etc. It can be done, but we'll probably need to make some changes to our parser output to e.g. "tag" a <span> of text as coming from a particular magic word.)
If the user switches from wikitext mode to WYSIWYG edit mode (or preview mode, or even "raw HTML editing" mode), the library uses Ajax to submit the wikitext it has back to a special page on the server to be rendered, and then shows the results of that rendering.
If you think about it, it's really a clever system. MediaWiki is excellent at converting wikitext to HTML, and you really shouldn't have any other software do that job. And the client nowadays is really good at parsing HTML into a DOM tree, so it should take on the job of the HTML -> Wikitext conversion. The actual conversions only happen at times that the user has asked for a change of mode or submit for a save, so the delay isn't unexpected or frustrating for the user.
~Evan
________________________________________________________________________ Evan Prodromou evan@prodromou.name http://evan.prodromou.name/
On Mon, Feb 12, 2007 at 10:39:02AM +0100, Lars Aronsson wrote:
Changing to // and ** doesn't necessarily make the wiki syntax any more BNF parsable than today, does it? You can still write ***** and what's that supposed to mean? If clarity is wanted, the best would probably be <i> and <b>.
We *don't*, per se, want clarity.
We want easy-to-typeness, as much as possible without impairing clarity.
In my humble perception.
Cheers, -- jr 'hobby horse, yes' a
What you can do is to run some experiments on the existing dump. How many cases are there where ''''' is hard to resolve? Did anybody count?
It is of course possible to write articles with unbalanced apostrophes. If I write '''hey'' it will render as '<i>hey</i>, and that's also how a conversion program should leave it. How many such user mistakes are there in the current dump?
I can't answer that question for a current dump, but I can answer it for a dump of EN that's about 15 months old (this was done as part of http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Wiki_Syntax ).
The formats and figures are shown below, and I've added examples to show a single line that would cause it to be logged (assuming the rest of the wikitext in that article is well-formed). Basically they're all about _balance_ - if you open a bit of paired syntax, you should close it. Some syntaxes must be closed on the same line (e.g. ''' ), and some must be closed in the same article (e.g. {| ).
Note however that these figures are from only a few months after a previous run (I think - it *was* a while ago), so the figures for now I'm guestimating would be between 2 and 4 times higher - because it's been so long since it was last done, and because there are probably more contributions now, and "wikitext format errors introduced" is probably directly proportional to the number of contributions.
---------------------------------------------- mysql> select format, count(*) as count from malformed_page group by format order by count desc; +-------------------------+-------+ | format | count | +-------------------------+-------+ | '' | 7161 | example: this is a ''test | ''' | 1248 | example: this is a '''test | '' and ''' | 1155 | example: this is a '''test'' | ''' and '' | 1091 | example: this is a ''test''' | ] | 587 | example: this is a] test | [ and ]] | 507 | example: this is a [test]] | ]] | 417 | example: this is a test]] | [[ | 413 | example: this is a [[test | [ | 372 | example: this is [a test | [[ and ] | 347 | example: this is a [[test] | {| | 261 | example: {| (and never close it) | |} | 238 | example: |} (and never open it) | --> | 67 | example: <!-- blah --> --> | <div> | 60 | example: <div> <div> blah </div> | {{ | 46 | example: {{ {{delete}} | <!-- | 43 | example: <!-- <!-- blah --> | </div> | 39 | example: <div> blah </div> </div> | }} | 34 | example: {{delete}} }} | ]] and [[ | 24 | example: this is a ]]test[[ | == and === | 20 | example: ==heading=== | ] and [ | 14 | example: this is a ]test[ | [[image: | 11 | example: [[image: [[image:test.gif]] | === and == | 8 | example: ===heading== | '' and [[ | 5 | example: this ''is a [[test | [ and '' | 5 | example: this [is a ''test | '' and ]] | 5 | example: this ''is a]] test | <code> | 5 | example: <code> <code> for i=1 </code> | </pre> | 4 | example: <pre> for i=1 </pre> </pre> | </nowiki> | 4 | example: <nowiki> for i=1 </nowiki> </nowiki> | '' and ] | 3 | etc .... | ]] and ''' | 3 | | ]] and '' | 3 | | ] and '' | 2 | | </math> | 2 | | [[ and '' | 2 | | '' and [[ and ] | 2 | | </code> | 2 | | === and ==== | 2 | | '' and ''' and ] | 2 | | [ and ''' | 1 | | ]] and '' and ''' | 1 | | [[ and ] and [ | 1 | | [ and ]] and '' | 1 | | ''' and [[ | 1 | | <math> | 1 | | ==== and === | 1 | | ]] and [[ and '' | 1 | | ] and [ and '' | 1 | | ''' and '' and ]] | 1 | | ''' and '' and [[ and [ | 1 | | ''' and ]] | 1 | | ]] and ] and ''' and '' | 1 | | [ and '' and ''' | 1 | +-------------------------+-------+ 53 rows in set (0.57 sec)
mysql> ----------------------------------------------
Note: ''''' was treated as ''' + '' (rather than as a separate category), so it will be mixed in with the above figures for ''' and ''.
Perhaps somebody is already running a robot to find and fix such errors?
Not that I'm aware of - humans were better at it anyway, because some of the above were false positives (e.g. some math formulas), and the ''' and '' & ''' and '' tests had lots of false positives. If a robot went around blindly automatically fixing these, it'd be banned for vandalism. However, some automated approach could be good, as it's an ongoing problem with no real closure (like sorting the mail), so people eventually have enough of doing it (as I did), and move onto other stuff.
All the best, Nick.
On Mon, Feb 12, 2007 at 03:24:23PM +1100, Steve Bennett wrote:
On 2/12/07, Eric Astor eastor1@swarthmore.edu wrote:
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Just to check: Is changing MediaWiki sytnax absolutely out of the question? Using // and ** for italic and bold respectively would solve that problem, be more consistent and more intuitive, and probably not be excessively difficult to phase in, right?
With reference to my similar note elsewhere, let us *please* attempt to utilize reflexes everyone[1] already has, rather than generating even more new markup tags?
Cheers, -- jra [1] Everyone who already has reflexes at all; let's not penalize the smart people *again*.
Nick Jenkins wrote:
Can I maybe suggest please giving some examples that you encountered of the 10-15% hard category, and the 5% very hard category?
To really give you the creeps,
1<b>2''3</b>4''5
is rendered as
1<b>2<i>3</i>4</b>5
which is just wrong, and IMHO /should/ be rendered as
1<b>2<i>3</i></b><i>4</i>5
But don't worry, my wiki2xml tool doesn't do it correctly either, though it does generate valid XML without tidyhtml...
Magnus
Hoi, With the realisisation that this is also a problem that prevents Neapolitan from being written as it is standard practiced, the Neapolitan language uses '' in its language. The result of the MediaWiki practice of using the ' in bold and italic breaks functionality in the first place. Given that it is also a problem in the parser itself, maybe a good reason to ditch this abomination. Thanks, GerardM
Eric Astor schreef:
Nick Jenkins wrote:
As one of the many people who's done so, I agree. :) The problem is that ~80% of wikimarkup is pretty straightforward to parse using standard methods, another 10-15% can be done without huge difficulty using known-but-less-standard methods, and the remaining 5% doesn't fit well at all into any of the normal models of lexing/parsing.
[...snip...]
-Mark
Can I maybe suggest please giving some examples that you encountered of the 10-15% hard category, and the 5% very hard category?
I ask so that if anyone feels tempted to start on defining the behaviour, we can gently suggest doing the harder stuff *first* (with examples), thus hopefully preventing the situation where we have multiple unfinished 80%-done definitions, and no 100%-complete formal definitions.
All the best, Nick.
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Rendered in HTML, the first reads <i><b>hello</b> hi</i>, and the second reads <b><i>hi</i> hello</b>. The problem is that the meaning of the first 5 quotes changes based on the order in which the bold and italic regions close - which is not determined while scanning left-to-right.
Another example:
'''hello ''hi''' there''
MediaWiki renders this as <b>hello <i>hi</i></b><i> there</i>, properly handling overlapping formatting.
There are ways to deal with these... putting off the resolution until a later pass is the only way I know of that deals with the first one, and it's a bit touchy. Manageable, but touchy.
If I have time this summer, I'm going to look at formalizing a parser again... see if I can make a start on hammering out a somewhat more formal structure for handling some of the tougher cases, after I've tackled OLPC's CrossMark.
- Eric Astor
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
Gerard Meijssen wrote:
Hoi, With the realisisation that this is also a problem that prevents Neapolitan from being written as it is standard practiced, the Neapolitan language uses '' in its language. The result of the MediaWiki practice of using the ' in bold and italic breaks functionality in the first place. Given that it is also a problem in the parser itself, maybe a good reason to ditch this abomination. Thanks, GerardM
I was wondering... would it be possible for nap: users to enter "smart-quotes", namely ’’ or ’’? The advantage of this character over ' is that you can represent it in plain text (no HTML code), even in Category tags and page titles. It also looks nicer if you have a good font. You could, of course, add ’ to a <charinsert> tag in [[MediaWiki:Edittools]] to make this character easier to insert.
Minh Nguyen schreef:
Gerard Meijssen wrote:
Hoi, With the realisisation that this is also a problem that prevents Neapolitan from being written as it is standard practiced, the Neapolitan language uses '' in its language. The result of the MediaWiki practice of using the ' in bold and italic breaks functionality in the first place. Given that it is also a problem in the parser itself, maybe a good reason to ditch this abomination. Thanks, GerardM
I was wondering... would it be possible for nap: users to enter "smart-quotes", namely ’’ or ’’? The advantage of this character over ' is that you can represent it in plain text (no HTML code), even in Category tags and page titles. It also looks nicer if you have a good font. You could, of course, add ’ to a <charinsert> tag in [[MediaWiki:Edittools]] to make this character easier to insert.
Hoi, The problem is that the practice of people /are/ writing Neapolitan in this way. You do not ask people who write Dutch to use the "proper" character in stead of the ij combination. When people type, they expect the characters to be there on the keyboard. For "smart-quotes" to be used by everyone you expect to much. It just will not happen. Thanks, GerardM
Gerard Meijssen wrote:
Minh Nguyen schreef:
I was wondering... would it be possible for nap: users to enter "smart-quotes", namely ’’ or ’’? The advantage of this character over ' is that you can represent it in plain text (no HTML code), even in Category tags and page titles. It also looks nicer if you have a good font. You could, of course, add ’ to a <charinsert> tag in [[MediaWiki:Edittools]] to make this character easier to insert.
Hoi, The problem is that the practice of people /are/ writing Neapolitan in this way. You do not ask people who write Dutch to use the "proper" character in stead of the ij combination. When people type, they expect the characters to be there on the keyboard. For "smart-quotes" to be used by everyone you expect to much. It just will not happen. Thanks, GerardM
What about rendering ' key on nap ’ ? Would this conflict with anything? Bold and italic would be done via <b> <i>, and being a different character, wouldn't break other wikis references (e.g. interwikis).
-----Original Message----- From: wikitech-l-bounces@lists.wikimedia.org [mailto:wikitech-l-bounces@lists.wikimedia.org] On Behalf Of Eric Astor Sent: 12 February 2007 03:03 To: Wikimedia developers Subject: Re: [Wikitech-l] WYSIWYG (or WYSIWYM or WYSIWYM) status?
Nick Jenkins wrote:
As one of the many people who's done so, I agree. :) The
problem is
that ~80% of wikimarkup is pretty straightforward to parse using standard methods, another 10-15% can be done without huge
difficulty
using known-but-less-standard methods, and the remaining
5% doesn't fit
well at all into any of the normal models of lexing/parsing.
[...snip...]
-Mark
Can I maybe suggest please giving some examples that you
encountered of
the 10-15% hard category, and the 5% very hard category?
I ask so that if anyone feels tempted to start on defining
the behaviour,
we can gently suggest doing the harder stuff *first* (with
examples),
thus hopefully preventing the situation where we have
multiple unfinished
80%-done definitions, and no 100%-complete formal definitions.
All the best, Nick.
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Rendered in HTML, the first reads <i><b>hello</b> hi</i>, and the second reads <b><i>hi</i> hello</b>. The problem is that the meaning of the first 5 quotes changes based on the order in which the bold and italic regions close - which is not determined while scanning left-to-right.
Another example:
'''hello ''hi''' there''
MediaWiki renders this as <b>hello <i>hi</i></b><i> there</i>, properly handling overlapping formatting.
There are ways to deal with these... putting off the resolution until a later pass is the only way I know of that deals with the first one, and it's a bit touchy. Manageable, but touchy.
Think the easiest method (and nearer to be able to keep as it a single pass) is to use DOM. Guarentees valid XML output always, which I believe the MediaWiki parser doesn't always do.
Also can easly going back and fixing up the DOM tree, if the parser has made an initial wrong choice. Like
'''italics''
It might start out as <b>italics</b>, but seeing '' its can be corrected to '<i>italics</i>.
Jared
Jared Williams wrote:
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Rendered in HTML, the first reads <i><b>hello</b> hi</i>, and the second reads <b><i>hi</i> hello</b>. The problem is that the meaning of the first 5 quotes changes based on the order in which the bold and italic regions close - which is not determined while scanning left-to-right.
Another example:
'''hello ''hi''' there''
MediaWiki renders this as <b>hello <i>hi</i></b><i> there</i>, properly handling overlapping formatting.
There are ways to deal with these... putting off the resolution until a later pass is the only way I know of that deals with the first one, and it's a bit touchy. Manageable, but touchy.
Think the easiest method (and nearer to be able to keep as it a single pass) is to use DOM. Guarentees valid XML output always, which I believe the MediaWiki parser doesn't always do.
Also can easly going back and fixing up the DOM tree, if the parser has made an initial wrong choice. Like
'''italics''
It might start out as <b>italics</b>, but seeing '' its can be corrected to '<i>italics</i>.
Jared
Keeping an abstract tree as an intermediate representation helps, but does not fix, this problem. Dealing with things like '''italics'' is non-trivial in any case, as if we're going to retain this behavior, no context-free grammar (at least with fixed lookahead) can possibly suffice.
Whatever happens to handle this, it will have to be at a separate stage from the original parsing. What remains is a question of how many extra stages we will need.
- Eric
Eric Astor wrote:
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Another example:
'''hello ''hi''' there''
My flexbisonparse handles these correctly. I admit they weren't easy, but it definitely wasn't what stopped flexbisonparse from being successful.
On Sun, Feb 11, 2007 at 10:02:33PM -0500, Eric Astor wrote:
Just one example - probably of the 5% very hard category:
'''''hello''' hi'' vs. '''''hi'' hello'''
Rendered in HTML, the first reads <i><b>hello</b> hi</i>, and the second reads <b><i>hi</i> hello</b>. The problem is that the meaning of the first 5 quotes changes based on the order in which the bold and italic regions close - which is not determined while scanning left-to-right.
Another example:
'''hello ''hi''' there''
MediaWiki renders this as <b>hello <i>hi</i></b><i> there</i>, properly handling overlapping formatting.
There are ways to deal with these... putting off the resolution until a later pass is the only way I know of that deals with the first one, and it's a bit touchy. Manageable, but touchy.
I know no one ever likes this question, but I'm going to ask it again anyway:
Is that problem easier or harder to deal with than whatever problems you would have if you just redefined bold to *this* and italics to _that_?
Everyone keeps saying that causes horrible collisions, but *I* don't think that most of them are that difficult to disambig.
Cheers -- jra
On 2/12/07, Jay R. Ashworth jra@baylink.com wrote:
Is that problem easier or harder to deal with than whatever problems you would have if you just redefined bold to *this* and italics to _that_?
Everyone keeps saying that causes horrible collisions, but *I* don't think that most of them are that difficult to disambig.
* collides horribly with list syntax. Even worse than **.
On 2/12/07, Jay R. Ashworth jra@baylink.com wrote:
With reference to my similar note elsewhere, let us *please* attempt to utilize reflexes everyone[1] already has, rather than generating even more new markup tags?
Cheers, -- jra [1] Everyone who already has reflexes at all; let's not penalize the smart people *again*.
Only a vanishingly tiny percentage of our target audience (which is all of humanity) is familiar with Usenet-style emphasis conventions. I wouldn't assign them much weight.
On 12/02/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Only a vanishingly tiny percentage of our target audience (which is all of humanity) is familiar with Usenet-style emphasis conventions.
It's bloody criminal, isn't it?
Rob Church
On Mon, Feb 12, 2007 at 08:25:54PM +0000, Rob Church wrote:
On 12/02/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Only a vanishingly tiny percentage of our target audience (which is all of humanity) is familiar with Usenet-style emphasis conventions.
It's bloody criminal, isn't it?
I know you're being at least 50% snarky there, Rob, but I a) think that it is, in fact, criminal, and b) still don't think that penalizing the people who do is a good idea, absent overwhelming other evidence.
And let's just note here that *Microsoft Word does this translation in realtime during input*, assuming you don't turn it off.
So, clearly, it's not *that* abstruse.
Cheers, -- jra
Why not add *this* and _this_ to the "Extended MediaWiki" markup, then just make vanilla MW syntax one of the many light-markups supported? So when you author an article, you can pick from MediaWiki, Extended MediaWiki, Markdown, Textile, APT, Usenet, Plain Text, XHTML, etc?
I'm only partly kidding. It makes sense for Wikipedia to have exactly one supported syntax, but for other wikis, having the flexibility to pick a markup would alleviate these kinds of disputes. Perhaps putting in enough hooks to allow extension devs to make their own parser grammars? (Not sure if there are enough hooks for this now btw).
More seriously though, I think adding any new magic to the MediaWiki parser is just fueling the fire surrounding the parsability of the syntax - which I am against. Also, regarding parsability, has anyone attempted to write Doxia[1] plugins for MW syntax? I'm considering giving it a whirl if there isn't any prior art.
[1] http://maven.apache.org/doxia/
On 2/12/07, Jay R. Ashworth jra@baylink.com wrote:
On Mon, Feb 12, 2007 at 08:25:54PM +0000, Rob Church wrote:
On 12/02/07, Simetrical Simetrical+wikilist@gmail.com wrote:
Only a vanishingly tiny percentage of our target audience (which is all of humanity) is familiar with Usenet-style emphasis conventions.
It's bloody criminal, isn't it?
I know you're being at least 50% snarky there, Rob, but I a) think that it is, in fact, criminal, and b) still don't think that penalizing the people who do is a good idea, absent overwhelming other evidence.
And let's just note here that *Microsoft Word does this translation in realtime during input*, assuming you don't turn it off.
So, clearly, it's not *that* abstruse.
Cheers,
-- jra
Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Mon, Feb 12, 2007 at 02:52:22PM -0600, Jim Wilson wrote:
Why not add *this* and _this_ to the "Extended MediaWiki" markup, then just make vanilla MW syntax one of the many light-markups supported? So when you author an article, you can pick from MediaWiki, Extended MediaWiki, Markdown, Textile, APT, Usenet, Plain Text, XHTML, etc?
I'm only partly kidding. It makes sense for Wikipedia to have exactly one supported syntax, but for other wikis, having the flexibility to pick a markup would alleviate these kinds of disputes. Perhaps putting in enough hooks to allow extension devs to make their own parser grammars? (Not sure if there are enough hooks for this now btw).
This would only work if MW transitioned its internal storage to something standardized and parseable, like XML... I *think*.
Cheers, -- jr 'not that there's anything wrong with that... :-)' a
Jay R. Ashworth wrote:
On Mon, Feb 12, 2007 at 02:52:22PM -0600, Jim Wilson wrote:
Why not add *this* and _this_ to the "Extended MediaWiki" markup, then just make vanilla MW syntax one of the many light-markups supported? So when you author an article, you can pick from MediaWiki, Extended MediaWiki, Markdown, Textile, APT, Usenet, Plain Text, XHTML, etc?
I'm only partly kidding. It makes sense for Wikipedia to have exactly one supported syntax, but for other wikis, having the flexibility to pick a markup would alleviate these kinds of disputes. Perhaps putting in enough hooks to allow extension devs to make their own parser grammars? (Not sure if there are enough hooks for this now btw).
This would only work if MW transitioned its internal storage to something standardized and parseable, like XML... I *think*.
Not neccesary: We have a parser class. Make it independent of the rest of the code. Then you can make 'alternative parsers'. I volunteer for making the 'plain text' one ;)
On Mon, Feb 12, 2007 at 03:22:20PM -0500, Simetrical wrote:
On 2/12/07, Jay R. Ashworth jra@baylink.com wrote:
Is that problem easier or harder to deal with than whatever problems you would have if you just redefined bold to *this* and italics to _that_?
Everyone keeps saying that causes horrible collisions, but *I* don't think that most of them are that difficult to disambig.
- collides horribly with list syntax. Even worse than **.
It would require that lists be tagged with "* " rather than "*", but that should be enough to disambig it, no? And it's not as bad a collision as might be obvious, either, I don't think: List item markers are always at hard-BOL, and are never matched until hard-EOL.
On 2/12/07, Jay R. Ashworth jra@baylink.com wrote:
With reference to my similar note elsewhere, let us *please* attempt to utilize reflexes everyone[1] already has, rather than generating even more new markup tags?
Cheers, -- jra [1] Everyone who already has reflexes at all; let's not penalize the smart people *again*.
Only a vanishingly tiny percentage of our target audience (which is all of humanity) is familiar with Usenet-style emphasis conventions. I wouldn't assign them much weight.
I gather that you wouldn't. I think that view is short-sighted.
Cheers, -- jra
On 2/12/07, Rob Church robchur@gmail.com wrote:
Probably, if anybody ever wants this kind of functionality done, we need to direct them to start helping us defining the parser behaviour. I say this, but of course, defining the behaviour of the behemoth we have now is a task rather akin to removing all the sand from the Sahara.
How much of this task is possible without looking at the code base? How much can be done merely by inspection? Would it be possible for someone (Brion, yourself?) to structure this apparently very large task in such a way that non-developers like me would be able to contribute bite-size pieces of work? Such a task might look something like "Define the grammar for the [[Category:...]] statement". Do we have anything like that?
Steve
On 2/11/07, Evan Prodromou evan@prodromou.name wrote:
This is distressing news. Does this mean that we're not doing Wikiwyg?
http://www.wikiwyg.net/
Are Wikia and SocialText still working on that project?
We are still working on it. We launched the first stage today, which is Wikiwyg editing for new pages. You can try it out at http://entertainment.wikia.com/index.php?title=Create_Opinion
We've not yet solved the problem of converting back and forward, so, for now, Wikiwyg is only there when you start a new page. When you save it, you get only wikitext.
Angela
wikitech-l@lists.wikimedia.org