Dajoo is a sourceforge project try to build a personal wiki-based platform to help people manage their knowledge. It was designed to own a plugin system in the kernel by which we can develop various tools for different purpose.
The markup it adopted is compatible with Mediawiki, and now interoperability with Mediawiki is under developing. On my personal computer, now it can access Wikipedia freely. I will release the next version with the interoperability features in September.
The project is now in the early stage, and you can download and try it at http://sourceforge.net/projects/dajoo/
Thanks.
[[zh:User:Mountain]]
On Thu, Aug 17, 2006 at 09:54:51AM +0800, mingli yuan wrote:
Dajoo is a sourceforge project try to build a personal wiki-based platform to help people manage their knowledge. It was designed to own a plugin system in the kernel by which we can develop various tools for different purpose.
The markup it adopted is compatible with Mediawiki, and now interoperability with Mediawiki is under developing. On my personal computer, now it can access Wikipedia freely. I will release the next version with the interoperability features in September.
You might want to take note, if you didn't realize it already, that there's a project underway to formalize the very wikitext grammar with which you're compatible. I know someone posted a pointer last week, but I don't have it handy; I suspect they'll chime in.
Cheers, -- jra
Jay Ashworth wrote:
...there's a project underway to formalize the very wikitext grammar with which you're compatible. I know someone posted a pointer last week, but I don't have it handy; I suspect they'll chime in.
("ding, dong. ding, dong.")
There's http://www.mediawiki.org/wiki/Markup_spec which is a prose description, with several more formal BNF descriptions hanging off of it as subpages (see http://www.mediawiki.org/ wiki/Special:Search/Markup_spec/BNF/ for a list).
Jay R. Ashworth wrote:
there's a project underway to formalize the very wikitext grammar with which you're compatible.
There are efforts to produce something formal-like for some subset of the wikitext syntax. To my knowledge, no useful formal grammar can be produced for the complete syntax.
On 8/16/06, Ivan Krstić krstic@solarsail.hcs.harvard.edu wrote:
There are efforts to produce something formal-like for some subset of the wikitext syntax. To my knowledge, no useful formal grammar can be produced for the complete syntax.
Out of curiosity, why not? What bits of markup screw over the project?
Simetrical wrote:
Out of curiosity, why not? What bits of markup screw over the project?
A better question is which bits don't ;)
Eric?
-----Original Message----- From: Ivan Krstic [mailto:krstic@solarsail.hcs.harvard.edu] Sent: Wednesday, August 16, 2006 11:21 PM To: Wikimedia developers Cc: Eric Astor Subject: Re: [Wikitech-l] Dajoo: a Java-based offline editor/viewer
Simetrical wrote:
Out of curiosity, why not? What bits of markup screw over the project?
A better question is which bits don't ;)
Eric?
*sighs* Well, that's my cue. My sincere apologies for when this starts to ramble and lose coherency - I'm a bit tired today, and I've been focusing mostly on other things since the Wikimania Hacking Days.
Let's see here. Please consider this an incomplete, unreliable list, meant solely as an indication of the basic problems encountered when attempting to formalize MediaWiki's wikitext... And I'm no expert on parsing, except in that I've spent a large part of the summer constructing parsers for essentially unparseable languages. Basic point, though, is that MediaWiki wikitext is INCREDIBLY context-sensitive.
Single case that shows something interesting: '''hi''hello'''hi'''hello''hi'''
Try running it through MediaWiki, and what do you get? <b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b>
In other words, you've discovered that the current syntax supports improper nesting of markup, in a rather unique fashion. I don't know of any way to duplicate this in any significantly formal system, although I believe a multiple-pass parser *might* be capable of handling it. In fact, some sort of multiple-pass parser (the MediaWiki parser) obviously can.
Also, templates need to be transcluded before most of the parsing can take place, since in the current system, the text may leave some syntactically-significant constructs incomplete, finishing them in the transclusion stage...
Beyond that... *pulls up his mostly-aborted attempt at a parser*
Indefinite lookahead is required in some places, particularly in headings and magic variables - and for almost any other multi-part markup, if we want to do what the current parser does (ignore incomplete markup, treating it as if it had been properly escaped). This even holds true for bold and italics, since you need indefinite lookahead to be able to tell whether the first three quotes in '''this'' should be parsed as ''', <i>', or <b>. The situation gets even worse when you try to allow for improper nesting.
Other places require fixed, but large, amounts of lookahead... freelinks require at least 9 characters, for example. Technically, I'll admit that a GLR parser (or a backtracking framework) could manage even the indefinite lookahead that I mentioned... but it's still problematic, since the grammar is left ambiguous in certain cases.
Oh, right - and we'd need to special-case every tag-style piece of markup, including every allowed HTML tag, since formal grammars generally can't reference previously-matched text. This also applies to the heading levels - we'd need separate ad-hoc constructs for each level of heading we wanted to support, duplicating a lot of the grammar between each one.
There are other complications as well - again, this list should be considered both incomplete and, possibly, inconsistent with reality.
In summary, for most definitions of formal, it is impossible to write formal grammars for most significant subsets of current MediaWiki syntax. I had significant success with a regex-based grammar specification (using Martel), backed by a VERY general backend capable of back-tracking and other clever tricks (mxTextTools) - but the recursive structure is virtually impossible to handle in a regex-based framework.
- Eric Astor
P.S. As indicated above, I honestly feel that the difficulties aren't insurmountable - if you're willing to build an appropriate parsing framework, which will be semi-formal at best.
P.P.S. When possible, in my *copious* free time (</sarcasm>), I'm hoping to take another frontend to mxTextTools (SimpleParse, to be specific), modify it sufficiently to support all the necessary features, and then build something capable of parsing the current MediaWiki syntax (although I might have to drop support for improper nesting). I've no idea if or when this might happen, but I'm considering it a long-term goal if the current situation doesn't improve.
On Thu, Aug 17, 2006 at 12:23:57AM -0400, Simetrical wrote:
On 8/17/06, Eric Astor eastor1@swarthmore.edu wrote:
[snip]
XML, anyone? ;)
It wouldn't help. The problems are semantic, not syntactical.
I think. :-)
Cheers, -- jr 'IE: no, no, no, no, no!' a
In any case, Eric, have you thought of any ideas to simplify the syntax while maintaining reverse-compatibility (except perhaps in highly unlikely corner cases)? For instance, what if "==Text" translated to "<h2>Text</h2>", while "==Text==" still did as well? That would eliminate the need for unlimited lookahead for headings, reducing it to one-character lookahead for the first five levels and zero-character for the sixth. In fact, it would probably cause little breakage if the same were done with opening wikilinks, template calls, and so on. Any other thoughts?
The context-sensitivity of apostrophes probably isn't avoidable, unfortunately.
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
It wouldn't help. The problems are semantic, not syntactical.
I think. :-)
Nope, we're talking 100% syntax. '''hi''hello'''hi'''hello''hi''' isn't ambiguous if it's stored as <b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b> to begin with, and neither do you need unlimited lookahead to know that <h2> is a header tag and not a literal string. Every problem Eric is having would be eliminated if we switched to XML for internal storage, because all Eric is doing is trying to write a formal grammar — and a formal XML grammar is part of the official XML specification.
On Thu, Aug 17, 2006 at 01:56:29AM -0400, Simetrical wrote:
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
It wouldn't help. The problems are semantic, not syntactical.
I think. :-)
Nope, we're talking 100% syntax. '''hi''hello'''hi'''hello''hi''' isn't ambiguous if it's stored as <b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b> to begin with, and
Well that's fine... but *getting from Eric's example to yours* is a problem of semantics: what did the user *mean*. Without disambiguating rules, there may be no way to tell.
Please make the following sentence gramatically correct, solely by adding punctuation:
John where Nancy had had had had had had had had had had had a better effect.
Semantics.
neither do you need unlimited lookahead to know that <h2> is a header tag and not a literal string. Every problem Eric is having would be eliminated if we switched to XML for internal storage, because all Eric is doing is trying to write a formal grammar ??? and a formal XML grammar is part of the official XML specification.
Fine, but a) those are not the only problems he's having, and b) No [[Flag Day]]s.
Cheers, -- jra
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
Fine, but a) those are not the only problems he's having, and b) No [[Flag Day]]s.
Ok, that's the third time you've used the expression "flag day" on this list. After the second time, I looked it up and wasn't enlightened. What the hell do you mean? :)
Steve
On 8/17/06, Steve Bennett stevage@gmail.com wrote:
Ok, that's the third time you've used the expression "flag day" on this list. After the second time, I looked it up and wasn't enlightened. What the hell do you mean? :)
It's at the bottom of the English [[Flag Day]] article:
"Flag day is also a term used in discussing computer systems to denote a change which will require a complete restart or conversion of a sizable body of software or data. This usage of the term originates from an obscure such change in the [[Multics]] operating system's definition, which was scheduled for the US's Flag Day, June 14th, 1966. [http://www.catb.org/jargon/html/F/flag-day.html]"
On 8/17/06, Simetrical Simetrical+wikitech@gmail.com wrote:
It's at the bottom of the English [[Flag Day]] article:
My bad. I should make it a separate article - the two concepts are only coincidentally related.
Steve
On Thu, Aug 17, 2006 at 05:21:00PM +0200, Steve Bennett wrote:
On 8/17/06, Simetrical Simetrical+wikitech@gmail.com wrote:
It's at the bottom of the English [[Flag Day]] article:
My bad. I should make it a separate article - the two concepts are only coincidentally related.
I added it; you'll note my edit comment, querying whether it was time for a forman disambig page. I didn't put it up top because it seemed awfully cluttered up there already.
More details are in the linked Jargon file entry.
Cheers, -- jra
On Thu, Aug 17, 2006 at 09:46:23AM +0200, Steve Bennett wrote:
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
Fine, but a) those are not the only problems he's having, and b) No [[Flag Day]]s.
Ok, that's the third time you've used the expression "flag day" on this list. After the second time, I looked it up and wasn't enlightened. What the hell do you mean? :)
For our purposes here, a Flag Day would be any change in the parser that would require a pass over all article bodies to change markup from an old to a new style, in whatever degree... as well as requiring changes in people's heads, and any code which otherwise deals with wikitext.
I don't think that a Flag Day for some exceedingly esoteric construction which needs to be cleaned up to make a formal parser necessary is completely impossible, but it would have to be pretty negligible, pretty important, or both... it goes back to that circle I mentioned.
And clearly, brion thinks that the absolute (never) is called for, which means that what I think doesn't matter -- and he's better informed than I am.
Cheers, -- jra
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
I don't think that a Flag Day for some exceedingly esoteric construction which needs to be cleaned up to make a formal parser necessary is completely impossible, but it would have to be pretty negligible, pretty important, or both... it goes back to that circle I mentioned.
So what if we had a "lossless" wikisyntax to XML converter? It seems like that wouldn't be an impossibility (given we're already parsing wikisyntax to _HTML_).
What are the reactions to e.g. converting the backend to use that XML storage, then enforcing it on the editor side, as well?
Obviously we'd have to be clever on the conversion (like making VERY sure it's a "lossless" switch, and finding a computationally feasible way to get it done - maybe update every article as it's touched?).
To my way of thinking, if we had an XML backend store and a reliable conversion path, then we could: a) Provide wikisyntax editing to those who want it (by filtering through the converter) b) Develop meaningful wysiwyg editing tools without having to first reimplement the wikisyntax parser in javascript and every other language we want to touch. c) Allow direct access to the XML, making all kinds of researchers happy. d) Incrementally roll out changes to bring things more in line with Semantic Web, again with conversion paths.
Engineering wise, a "lossless" path to me could be developed by developing these components: 1. Wikisyntax <-> WikiXML converters. 2. WikiXML -> HTML renderer.
Determing that it is working properly can be done by testing against the Wikipedia corpus. If we can go from WikiXML to Wikisyntax and back, byte-exact, we've acheived our goal. Maybe it's ok to relax that restriction (especially if we can determine in some other way the page is corrupt or invalid - or maybe we have a list of exceptions), but I think it's one that's both acheivable and reasonable.
We may also want to do validation on the HTML render path; if we want to be really strict we can require that the conversion path gives identical output (perhaps sans whitespace?) to the current parser & renderer.
Once we have everything in XML, there are a number of good tools and standards to enable us to be Unicode compliant, to do various kinds of conversions and updates on the XML, and otherwise process our data, so we can evolve it forward to meet our needs.
In any case - if we find that having a lossless path would satisfy the constraints, then those who are interested can focus on writing a validation framework... and then they can go implement it. :)
On Thu, Aug 17, 2006 at 07:17:11PM -0700, Ben Garney wrote:
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
I don't think that a Flag Day for some exceedingly esoteric construction which needs to be cleaned up to make a formal parser necessary is completely impossible, but it would have to be pretty negligible, pretty important, or both... it goes back to that circle I mentioned.
So what if we had a "lossless" wikisyntax to XML converter? It seems like that wouldn't be an impossibility (given we're already parsing wikisyntax to _HTML_).
What are the reactions to e.g. converting the backend to use that XML storage, then enforcing it on the editor side, as well?
I Am Not A Wikimedia Foundation Employee.
That said, there are two sorts of Flag Days: those which affect users and those which don't.
What you're suggesting here (and Simetrical has suggested before) would -- assuming that conversion is *really* lossless, which I don't know can be guaranteed at the moment -- only flag programs, not hundreds of thousands of heads.
That makes it just a tad liklier to ever happen.
Obviously we'd have to be clever on the conversion (like making VERY sure it's a "lossless" switch, and finding a computationally feasible way to get it done - maybe update every article as it's touched?).
Yeah; and which processor that XML<->WT conversion happened on would be critical for load reasons...
To my way of thinking, if we had an XML backend store and a reliable conversion path, then we could: a) Provide wikisyntax editing to those who want it (by filtering through the converter) b) Develop meaningful wysiwyg editing tools without having to first reimplement the wikisyntax parser in javascript and every other language we want to touch. c) Allow direct access to the XML, making all kinds of researchers happy. d) Incrementally roll out changes to bring things more in line with Semantic Web, again with conversion paths.
Engineering wise, a "lossless" path to me could be developed by developing these components:
- Wikisyntax <-> WikiXML converters.
- WikiXML -> HTML renderer.
Determing that it is working properly can be done by testing against the Wikipedia corpus. If we can go from WikiXML to Wikisyntax and back, byte-exact, we've acheived our goal. Maybe it's ok to relax that restriction (especially if we can determine in some other way the page is corrupt or invalid - or maybe we have a list of exceptions), but I think it's one that's both acheivable and reasonable.
Hmmm...
We may also want to do validation on the HTML render path; if we want to be really strict we can require that the conversion path gives identical output (perhaps sans whitespace?) to the current parser & renderer.
I, personally, am a bit less concerned there: it's like advertising photography (where the printed ad in the magazine actually has to match the PMS color of the object) vs color pictures of the local parade in the newspaper (where you only care that the clown's face is 'pretty').
Once we have everything in XML, there are a number of good tools and standards to enable us to be Unicode compliant, to do various kinds of conversions and updates on the XML, and otherwise process our data, so we can evolve it forward to meet our needs.
In any case - if we find that having a lossless path would satisfy the constraints, then those who are interested can focus on writing a validation framework... and then they can go implement it. :)
Aw, *ick*.
Y'all are talking me into it.
Quick! Somebody talk me back out of it! :-)
Cheers, -- jr "isn't it a good thing it doesn't matter what I think?" a
Join the dark side, Jay. You know you want to. ;)
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
Yeah; and which processor that XML<->WT conversion happened on would be critical for load reasons...
Well, I wouldn't expect WT -> XML to be any slower than our current parser (we could just modify it to output the XML instead of the wikitext); actually, it would be quite a lot faster, no doubt, because all sorts of things like template substitutions wouldn't have to occur. So it would probably be slow but manageable.
XML -> WT should be virtually instantaneous, as should XML -> HTML, because parse time would be one-pass at C speed (so probably a few milliseconds or less, as opposed to 800 ms for the current parser). There would of course be all sorts of weirdness in XML -> WT, so it might not be as fast as XML -> HTML, but the WMF gets a *hell* of a lot more cache misses on page views than on page edits, I would bet a substantial sum of money.
On Fri, Aug 18, 2006 at 02:26:29AM -0400, Simetrical wrote:
Join the dark side, Jay. You know you want to. ;)
Perhaps, but you ain't my daddy. :-)
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
Yeah; and which processor that XML<->WT conversion happened on would be critical for load reasons...
Well, I wouldn't expect WT -> XML to be any slower than our current parser (we could just modify it to output the XML instead of the wikitext); actually, it would be quite a lot faster, no doubt, because all sorts of things like template substitutions wouldn't have to occur. So it would probably be slow but manageable.
You're suggesting to ship XML to the browser and let XSSSL do the translation actually *in* the browser?
That makes me a bit queasy, though perhaps it shouldn't.
XML -> WT should be virtually instantaneous, as should XML -> HTML, because parse time would be one-pass at C speed (so probably a few milliseconds or less, as opposed to 800 ms for the current parser).
Hmmm...
There would of course be all sorts of weirdness in XML -> WT, so it might not be as fast as XML -> HTML, but the WMF gets a *hell* of a lot more cache misses on page views than on page edits, I would bet a substantial sum of money.
Oh, I'm sure. Except on the top 5% current events-y pages.
Cheers, -- jra
On 8/18/06, Jay R. Ashworth jra@baylink.com wrote:
You're suggesting to ship XML to the browser and let XSSSL do the translation actually *in* the browser?
Actually, I wasn't. Interesting idea, but I don't think there's any point, since a wikiXML -> HTML parser probably wouldn't be a noticeable bottleneck and so we may as well maintain support for non-XSL-supporting clients (does your Blackberry support XSL? :) ). I hadn't actually thought of using XSL definitions at all, but in fact, that may be the obvious choice. I'll have to look at that some more . . . I never paid much attention to it before now.
There would of course be all sorts of weirdness in XML -> WT, so it might not be as fast as XML -> HTML, but the WMF gets a *hell* of a lot more cache misses on page views than on page edits, I would bet a substantial sum of money.
Oh, I'm sure. Except on the top 5% current events-y pages.
Those too. Those get tons of edits, but even more views.
On Fri, Aug 18, 2006 at 02:28:53PM -0400, Simetrical wrote:
On 8/18/06, Jay R. Ashworth jra@baylink.com wrote:
You're suggesting to ship XML to the browser and let XSSSL do the translation actually *in* the browser?
Actually, I wasn't. Interesting idea, but I don't think there's any point, since a wikiXML -> HTML parser probably wouldn't be a noticeable bottleneck and so we may as well maintain support for non-XSL-supporting clients (does your Blackberry support XSL? :) ).
It does not.
:-)
I hadn't actually thought of using XSL definitions at all, but in fact, that may be the obvious choice. I'll have to look at that some more . . . I never paid much attention to it before now.
Yeah; I gather you can do it server side as well...
There would of course be all sorts of weirdness in XML -> WT, so it might not be as fast as XML -> HTML, but the WMF gets a *hell* of a lot more cache misses on page views than on page edits, I would bet a substantial sum of money.
Oh, I'm sure. Except on the top 5% current events-y pages.
Those too. Those get tons of edits, but even more views.
Yes, but those views *hit* the cache.
Cheers, -- jra
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
Well that's fine... but *getting from Eric's example to yours* is a problem of semantics: what did the user *mean*. Without disambiguating rules, there may be no way to tell.
There is in a WYSIWYG editor, because the user will make sure that what's displayed is what he wants. :)
Fine, but a) those are not the only problems he's having, and b) No [[Flag Day]]s.
a) They're the only problems he's having with constructing a formal grammar, which is all I was talking about, and b) why not, provided i) the current parser is kept as a legacy and modified to convert to the XML format rather than straight to HTML and ii) the XML can be converted into roughly the current wikitext on demand for those who'd prefer to work with it? The goal would still be accomplished, in that reusers would have a much easier time dealing with our data (yes, I know that's not my original stated goal).
But c) I can see we're likely never going to agree on even this watered-down version of my argument, and d) what we think probably doesn't matter unless one of us is willing to try writing up an implementation, so e) I think it's best to drop this line of discussion (again, and yes I know I was the one who re-brought it up).
On Thu, Aug 17, 2006 at 09:52:39AM -0400, Simetrical wrote:
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
Well that's fine... but *getting from Eric's example to yours* is a problem of semantics: what did the user *mean*. Without disambiguating rules, there may be no way to tell.
There is in a WYSIWYG editor, because the user will make sure that what's displayed is what he wants. :)
Quite so. :-)
Fine, but a) those are not the only problems he's having, and b) No [[Flag Day]]s.
a) They're the only problems he's having with constructing a formal grammar, which is all I was talking about, and b) why not, provided i) the current parser is kept as a legacy and modified to convert to the XML format rather than straight to HTML and ii) the XML can be converted into roughly the current wikitext on demand for those who'd prefer to work with it? The goal would still be accomplished, in that reusers would have a much easier time dealing with our data (yes, I know that's not my original stated goal).
Well, if you could guarantee a perfect roundtrip from XML to WT, sure, I guess. But I don't know that WT is structurally clean enough to make that workable.
But c) I can see we're likely never going to agree on even this watered-down version of my argument, and d) what we think probably doesn't matter unless one of us is willing to try writing up an implementation, so e) I think it's best to drop this line of discussion (again, and yes I know I was the one who re-brought it up).
Hee. Yeah; probably.
Cheers, -- jra
On 8/17/06, Eric Astor eastor1@swarthmore.edu wrote:
Single case that shows something interesting: '''hi''hello'''hi'''hello''hi'''
Try running it through MediaWiki, and what do you get? <b>hi<i>hello</i></b><i>hi<b>hello</b></i><b>hi</b>
That's awesome :)
In other words, you've discovered that the current syntax supports improper nesting of markup, in a rather unique fashion. I don't know of any way to duplicate this in any significantly formal system, although I believe a multiple-pass parser *might* be capable of handling it. In fact, some sort of multiple-pass parser (the MediaWiki parser) obviously can.
Is this not the sort of "backwards compatibility" that we could safely do without? Does anyone intentionally use that kind of construct?
Also, templates need to be transcluded before most of the parsing can take place, since in the current system, the text may leave some syntactically-significant constructs incomplete, finishing them in the transclusion stage...
That's sort of a given, isn't it? What's the downside of doing transclusion first?
if it had been properly escaped). This even holds true for bold and italics, since you need indefinite lookahead to be able to tell whether the first three quotes in '''this'' should be parsed as ''', <i>', or <b>. The situation gets even worse when you try to allow for improper nesting.
Personally I find the rules for multiple apostrophes very strange and unpredictable - and hence worth changing. I was really surprised when I sat down one to day test what happens when you stack one, two, three...ten apostrophes. Not what I expected at all. No takers to replace ''' with // or something?
Other places require fixed, but large, amounts of lookahead... freelinks require at least 9 characters, for example. Technically, I'll admit that a
What's a freelink?
GLR parser (or a backtracking framework) could manage even the indefinite lookahead that I mentioned... but it's still problematic, since the grammar is left ambiguous in certain cases.
Oh, right - and we'd need to special-case every tag-style piece of markup, including every allowed HTML tag, since formal grammars generally can't reference previously-matched text. This also applies to the heading levels - we'd need separate ad-hoc constructs for each level of heading we wanted to support, duplicating a lot of the grammar between each one.
I don't understand, can you give an example?
P.S. As indicated above, I honestly feel that the difficulties aren't insurmountable - if you're willing to build an appropriate parsing framework, which will be semi-formal at best.
What would such a thing look like, formal BNE rules mixed in with text like "Actually if FOO is "boo" then special case Z is invoked..."?
Steve
On 8/17/06, Steve Bennett stevage@gmail.com wrote:
Is this not the sort of "backwards compatibility" that we could safely do without? Does anyone intentionally use that kind of construct?
Maybe, maybe not, but people expect that if there's a ''' open, then ''' will close it. They won't expect an intervening '' (or [[ or ] or anything else) to affect matters. Besides, what would you suggest they type, '''hi''''''''hello'''hi'''hello''''''''hi'''? God only knows what that would do.
That's sort of a given, isn't it? What's the downside of doing transclusion first?
Well, I don't think it's so much a downside as something that's impossible to work into a formal grammar. I'm guessing the issue is that templates mix syntax with semantics: the semantics of the template influence the output of the parse tree. So a pass to replace all the templates has to be done before you can even start talking about a formal grammar. But it's a given, yes, as you say.
What's a freelink?
A URL-like thing that was typed without any particular surrounding syntax (it gets autolinked). Similar lookahead would presumably be necessary for RFCs, ISBNs, and PMIDs (okay, that's enough to convince me to agree that they should be ditched :) ). In general, a lookahead of no more than one character is considered desirable.
Oh, right - and we'd need to special-case every tag-style piece of markup, including every allowed HTML tag, since formal grammars generally can't reference previously-matched text. This also applies to the heading levels - we'd need separate ad-hoc constructs for each level of heading we wanted to support, duplicating a lot of the grammar between each one.
I don't understand, can you give an example?
He *seems* to be saying that you'd have to make special rules for each allowed HTML tag, and presumably each allowed attribute and property thereof, and maybe even every combination of them (!). Would there be any advantage in leaving those out of the grammar and keeping Parser and Sanitizer separate as they are now?
On 8/17/06, Simetrical Simetrical+wikitech@gmail.com wrote:
On 8/17/06, Steve Bennett stevage@gmail.com wrote:
Is this not the sort of "backwards compatibility" that we could safely do without? Does anyone intentionally use that kind of construct?
Maybe, maybe not, but people expect that if there's a ''' open, then ''' will close it. They won't expect an intervening '' (or [[ or ] or anything else) to affect matters. Besides, what would you suggest they type, '''hi''''''''hello'''hi'''hello''''''''hi'''? God only knows what that would do.
I don't think anyone would feel confident predicting what happens in any of these cases. Mostly it comes down to "try it and see".
In the case of '''fooo''boooo'''.... well clearly something has gone wrong somewhere. However we choose to interpret after that should be undefined.
Seriously though, whoever came up with ''..'' for italics and '''...''' for bold was, um, making life difficult!
Well, I don't think it's so much a downside as something that's impossible to work into a formal grammar. I'm guessing the issue is
Is it? Can't the formal grammar simply apply *after* all transclusions? Like in C you have the "preprocessor grammar" and then the actual grammar of the rest of the language (or am I dreaming)?
A URL-like thing that was typed without any particular surrounding syntax (it gets autolinked). Similar lookahead would presumably be necessary for RFCs, ISBNs, and PMIDs (okay, that's enough to convince me to agree that they should be ditched :) ). In general, a lookahead of no more than one character is considered desirable.
What can I say, I don't like these "freelinks". They just don't seem clean. Normal text which spontaneously turns into a link without any special punctuation or anything. Hmm.
He *seems* to be saying that you'd have to make special rules for each allowed HTML tag, and presumably each allowed attribute and property thereof, and maybe even every combination of them (!). Would there be any advantage in leaving those out of the grammar and keeping Parser and Sanitizer separate as they are now?
I don't get why we even allow HTML tags, other than convenience. It's not like the final output of the encyclopaedia is guaranteed to bear any resemblance to a web page...
For instance, why do we support <b>? We have '''... It's just not clean. (I dare someone to reply that ''' is semantic markup...heh.)
Steve
On Thu, Aug 17, 2006 at 05:12:22PM +0200, Steve Bennett wrote:
A URL-like thing that was typed without any particular surrounding syntax (it gets autolinked). Similar lookahead would presumably be necessary for RFCs, ISBNs, and PMIDs (okay, that's enough to convince me to agree that they should be ditched :) ). In general, a lookahead of no more than one character is considered desirable.
What can I say, I don't like these "freelinks". They just don't seem clean. Normal text which spontaneously turns into a link without any special punctuation or anything. Hmm.
Parsers don't have to be single pass... and ours isn't now.
Is it?
He *seems* to be saying that you'd have to make special rules for each allowed HTML tag, and presumably each allowed attribute and property thereof, and maybe even every combination of them (!). Would there be any advantage in leaving those out of the grammar and keeping Parser and Sanitizer separate as they are now?
I don't get why we even allow HTML tags, other than convenience. It's not like the final output of the encyclopaedia is guaranteed to bear any resemblance to a web page...
For instance, why do we support <b>? We have '''... It's just not clean. (I dare someone to reply that ''' is semantic markup...heh.)
It is; I believe it renders as <strong>, not <bold>.
Cheers, -- jra
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
Parsers don't have to be single pass... and ours isn't now.
Is it?
Nope.
It is; I believe it renders as <strong>, not <bold>.
Nope. It used to render as <strong>, but it was used as <b>, so it was changed to output <b>. Dedication to semantic content includes the ability to recognize when content does *not* have semantic value, and use non-semantic tags. The "semantic web" is not people replacing <b> with <strong> and <s> with <del>, so that they aren't using deprecated markup; it's people using tags to add meaning to the content. If you aren't adding meaning, you shouldn't use semantic tags, and in practice, '' and ''' are not used in any way exclusively to denote emphasis, but rather to denote all sorts of things.
As a concrete example of this, people using screenreaders complained that all sorts of words would be incongruously emphasized (movie titles, loan words, you name it). Therefore, they were changed to the non-semantic <i> and <b>, which are typically not emphasized by screenreaders (although they may be by some).
On Thu, Aug 17, 2006 at 01:27:16PM -0400, Simetrical wrote:
On 8/17/06, Jay R. Ashworth jra@baylink.com wrote:
Parsers don't have to be single pass... and ours isn't now.
Is it?
Nope.
It is; I believe it renders as <strong>, not <bold>.
Nope. It used to render as <strong>, but it was used as <b>, so it was changed to output <b>. Dedication to semantic content includes the ability to recognize when content does *not* have semantic value, and use non-semantic tags. The "semantic web" is not people replacing <b> with <strong> and <s> with <del>, so that they aren't using deprecated markup; it's people using tags to add meaning to the content. If you aren't adding meaning, you shouldn't use semantic tags, and in practice, '' and ''' are not used in any way exclusively to denote emphasis, but rather to denote all sorts of things.
As a concrete example of this, people using screenreaders complained that all sorts of words would be incongruously emphasized (movie titles, loan words, you name it). Therefore, they were changed to the non-semantic <i> and <b>, which are typically not emphasized by screenreaders (although they may be by some).
Aha!
Got it. Excellent point.
Cheers, -- jra
On 8/17/06, Simetrical Simetrical+wikitech@gmail.com wrote:
content. If you aren't adding meaning, you shouldn't use semantic tags, and in practice, '' and ''' are not used in any way exclusively to denote emphasis, but rather to denote all sorts of things.
In particular, I don't think ''' is *ever* used to add emphasis. It's used to highlight key words, or occasionally to point out that a linked article has Lots Of Good Stuff, but due to those meanings, we tend to avoid using it for emphasis. Italics ('') are probably 50/50 emphasis or album titles....
Steve
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Steve Bennett wrote:
In particular, I don't think ''' is *ever* used to add emphasis. It's used to highlight key words, or occasionally to point out that a linked article has Lots Of Good Stuff, but due to those meanings, we tend to avoid using it for emphasis. Italics ('') are probably 50/50 emphasis or album titles....
The use of bold to indicate emphasis in typographic circles is strongly discouraged.
In terms of italics, they should be used for emphasis, though the <em> tags are more suited for the task. Although most major style guides (such as MLA) recommend the use of italics for book titles, it is technically improper to mark them with <i> since this precludes a simple CSS change to change the style from one format to another. Ideally, it would probably be <span class="book">.
Now, if any editor actually wants to type that all out...
On 8/17/06, Edward Z. Yang edwardzyang@thewritingpot.com wrote:
Although most major style guides (such as MLA) recommend the use of italics for book titles, it is technically improper to mark them with <i> since this precludes a simple CSS change to change the style from one format to another. Ideally, it would probably be <span class="book">.
Now, if any editor actually wants to type that all out...
More like <cite class="book">. Too bad IE doesn't support generated content; then we could have <cite class="article"> as well, with auto-quotes. :)
On 8/17/06, Edward Z. Yang edwardzyang@thewritingpot.com wrote:
In terms of italics, they should be used for emphasis, though the <em> tags are more suited for the task. Although most major style guides (such as MLA) recommend the use of italics for book titles, it is technically improper to mark them with <i> since this precludes a simple CSS change to change the style from one format to another. Ideally, it would probably be <span class="book">.
Now, if any editor actually wants to type that all out...
I guess the smallest we could do it would be {{t|Gone with the Wind}} - it's not inconceivable.
Steve
On 8/17/06, Steve Bennett stevage@gmail.com wrote:
I guess the smallest we could do it would be {{t|Gone with the Wind}}
- it's not inconceivable.
Heh, it's actually only 2 characters longer than the corresponding raw wikitext: ''Gone with the Wind''. But so much uglier.
Steve
Steve Bennett wrote:
On 8/17/06, Steve Bennett stevage@gmail.com wrote:
I guess the smallest we could do it would be {{t|Gone with the Wind}}
- it's not inconceivable.
Heh, it's actually only 2 characters longer than the corresponding raw wikitext: ''Gone with the Wind''. But so much uglier.
But think of the power that you can potentially bring {{u|Gone With the Wind}} - underline it {{uc|Gone With the Wind}} - Uppercase it {{r|Gone With the Wind}} - Reverse it {{sup|Gone With the Wind}} - Super script it {{r,sup|Gone With the Wind}} - Reverse and Superscript it {{s|Gone WIth teh Wind}} - Sub-Title
and so on and so on.
I jest of course.....maybe.
r
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Ron Hall wrote:
But think of the power that you can potentially bring {{u|Gone With the Wind}} - underline it {{uc|Gone With the Wind}} - Uppercase it {{r|Gone With the Wind}} - Reverse it {{sup|Gone With the Wind}} - Super script it {{r,sup|Gone With the Wind}} - Reverse and Superscript it {{s|Gone WIth teh Wind}} - Sub-Title and so on and so on. I jest of course.....maybe.
I do hope your jesting. That would totally kill the whole point of switching to an uglier syntax.
Edward Z. Yang wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Ron Hall wrote:
But think of the power that you can potentially bring {{u|Gone With the Wind}} - underline it {{uc|Gone With the Wind}} - Uppercase it {{r|Gone With the Wind}} - Reverse it {{sup|Gone With the Wind}} - Super script it {{r,sup|Gone With the Wind}} - Reverse and Superscript it {{s|Gone WIth teh Wind}} - Sub-Title and so on and so on. I jest of course.....maybe.
I do hope your jesting. That would totally kill the whole point of switching to an uglier syntax.
But look at the power of the richness of the options. Literally could reduce everything to a single syntax with a bazillion options. Of course in might start looking like lisp (bad lisp for sure), but think of the shortened learning curve.
Yes I'm still jesting. I lurk in the darkness just waiting for odd things like this to indeed make a pseudo-point about how with only a slight exaggeration someone might get the idea that I "foolishly" float here. Yes it is ugly and even bad. I'm sure a better solution will be found.
r
On 8/17/06, Ron Hall ron.hall@mcgill.ca wrote:
But think of the power that you can potentially bring {{u|Gone With the Wind}} - underline it {{uc|Gone With the Wind}} - Uppercase it {{r|Gone With the Wind}} - Reverse it {{sup|Gone With the Wind}} - Super script it {{r,sup|Gone With the Wind}} - Reverse and Superscript it {{s|Gone WIth teh Wind}} - Sub-Title
Dunno about the others, {{sup|...}} behaves exactly as you describe, on en at least.
One possible benefit of such templates is that they become documented in a predictable place. Simply search for template:sup and you'll find what it does, and when to and not to use it. How would you find out what <sup> did?
Steve
Dunno about the others, {{sup|...}} behaves exactly as you describe, on en at least.
Really?!? I'm gonna hafta' dig deepa' wid this stuff :)
One possible benefit of such templates is that they become documented in a predictable place. Simply search for template:sup and you'll find what it does, and when to and not to use it. How would you find out what <sup> did?
Same way I learned just about every other tag/directive/class, etc. I read a book (or two or seven). I'm a great believer in reading then doing, though there is no experience quite like the one afforded by raw empiricalism (warts and all). There is no better teacher than making a mistake and no better student that the one that admits the error and learns from it. As Piet Hein said (and I paraphrase) "The secret to success is to err and err and err, but less and less and less."
But that's just me.
r
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Moin,
On Thursday 17 August 2006 20:51, Ron Hall wrote:
Steve Bennett wrote:
On 8/17/06, Steve Bennett stevage@gmail.com wrote:
I guess the smallest we could do it would be {{t|Gone with the Wind}}
- it's not inconceivable.
Heh, it's actually only 2 characters longer than the corresponding raw wikitext: ''Gone with the Wind''. But so much uglier.
But think of the power that you can potentially bring {{u|Gone With the Wind}} - underline it {{uc|Gone With the Wind}} - Uppercase it {{r|Gone With the Wind}} - Reverse it {{sup|Gone With the Wind}} - Super script it {{r,sup|Gone With the Wind}} - Reverse and Superscript it {{s|Gone WIth teh Wind}} - Sub-Title
POD is a I<fun> B<language>, C<widely> used, but L<hated|see hated> by some.
:)
Best wishes,
Tels
- -- Signed on Thu Aug 17 23:41:41 2006 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email.
"People who are rather more than six feet tall and nearly as broad across the shoulders often have uneventful journeys. People jump out at them from behind rocks then say things like, 'Oh. Sorry. I thought you were someone else.'" -- Terry Pratchett
On 17/08/06, Steve Bennett stevage@gmail.com wrote:
Other places require fixed, but large, amounts of lookahead... freelinks require at least 9 characters, for example. Technically, I'll admit that a
What's a freelink?
I think he's referring to the parser's behaviour of replacing anything that looks like a valid UR[IL] when the protocol is in a specific configuration array.
Rob Church
On 8/18/06, Rob Church robchur@gmail.com wrote:
On 17/08/06, Steve Bennett stevage@gmail.com wrote:
Other places require fixed, but large, amounts of lookahead... freelinks require at least 9 characters, for example. Technically, I'll admit that a
What's a freelink?
I think he's referring to the parser's behaviour of replacing anything that looks like a valid UR[IL] when the protocol is in a specific configuration array.
You mean, when the characters that immediately precede "://" are in some defined list like "ftp", "http" etc? Ok, from a quick test, I see that we recognise http, ftp, https, gopher, and mailto, but not vrml or unsurprisingly file.
And presumably the 9 characters is counting the case https://a
Btw, anyone want another funny corner case (as I gather these things are called?) http:// <- not recognised as a "magic word" http://. <- recognised as a "magic word", but only the "http://" is linked - the . isn't.
Heh. Are we going to put that in the formal grammar?
Steve
Steve
On 18/08/06, Steve Bennett stevage@gmail.com wrote:
You mean, when the characters that immediately precede "://" are in some defined list like "ftp", "http" etc? Ok, from a quick test, I see that we recognise http, ftp, https, gopher, and mailto, but not vrml or unsurprisingly file.
It's completely configurable by the site administrator. Incidentally, adding "file://" to the array doesn't guarantee success, since many modern browsers will point-blank refuse to render such links in their default configuration.
Rob Church
On Fri, Aug 18, 2006 at 05:41:49PM +0200, Steve Bennett wrote:
You mean, when the characters that immediately precede "://" are in some defined list like "ftp", "http" etc? Ok, from a quick test, I see that we recognise http, ftp, https, gopher, and mailto, but not vrml or unsurprisingly file.
And presumably the 9 characters is counting the case https://a
Btw, anyone want another funny corner case (as I gather these things are called?) http:// <- not recognised as a "magic word" http://. <- recognised as a "magic word", but only the "http://" is linked - the . isn't.
Heh. Are we going to put that in the formal grammar?
I don't think we would need to.
That's not actually "part of wikitext". It's a special case, implemented by the browser in a late pass to make users' lives easier, as several other things are which are not "really" part of wikitext.
Strictly speaking, parser functions aren't part of wikitext either, I don't think...
Cheers, -- jra
On 8/18/06, Steve Bennett stevage@gmail.com wrote:
Heh. Are we going to put that in the formal grammar?
Oh, it gets better. Ok, here's a test for everyone. Without actually testing these, predict how each of the following will be parsed:
*http://%5B%5Bfoo%5D%5D.com *http://www.%5B%5Bfoo%5D%5D.com *[[http://foo.com]] *[[the site http://foo.com is cool]] *[[http://foo.com is cool]] *[[foo http://foo.com]] *[[ http://foo.com]] *[[http://foo.com ]]
The difference in the tooltips generated by the last two is particularly interesting!
How did you score?
Steve
On Fri, Aug 18, 2006 at 05:55:45PM +0200, Steve Bennett wrote:
On 8/18/06, Steve Bennett stevage@gmail.com wrote:
Heh. Are we going to put that in the formal grammar?
Oh, it gets better. Ok, here's a test for everyone. Without actually testing these, predict how each of the following will be parsed:
*http://%5B%5Bfoo%5D%5D.com *http://www.%5B%5Bfoo%5D%5D.com *[[http://foo.com]] *[[the site http://foo.com is cool]] *[[http://foo.com is cool]] *[[foo http://foo.com]] *[[ http://foo.com]] *[[http://foo.com ]]
I very strongly suspect that no one who hasn't lived intimately with the parser code (that's, what, 4 or 5 people? :-) could predict what those things would do; they all seem implementation defined to me.
Or almost all...
They do illustrate why making a late pass to hotlink URLs might not be a safe approach, though.
Cheers, -- jra
On 18/08/06, Jay R. Ashworth jra@baylink.com wrote:
I very strongly suspect that no one who hasn't lived intimately with the parser code (that's, what, 4 or 5 people? :-) could predict what those things would do; they all seem implementation defined to me.
Being intimate with the parser leads to bad, bad things. Post-commit, it's like waking up the morning after to find your one-nighter's led to her stealing your cash and anything else you had on you.
Rob Church
On Fri, Aug 18, 2006 at 06:08:56PM +0100, Rob Church wrote:
On 18/08/06, Jay R. Ashworth jra@baylink.com wrote:
I very strongly suspect that no one who hasn't lived intimately with the parser code (that's, what, 4 or 5 people? :-) could predict what those things would do; they all seem implementation defined to me.
Being intimate with the parser leads to bad, bad things. Post-commit, it's like waking up the morning after to find your one-nighter's led to her stealing your cash and anything else you had on you.
Well, that's why we're on about this, right? Trying to replace Hooker 0.6 with Wife 1.0?
Cheers, -- jra
On Wed, Aug 16, 2006 at 11:15:34PM -0400, Ivan Krsti?? wrote:
Jay R. Ashworth wrote:
there's a project underway to formalize the very wikitext grammar with which you're compatible.
There are efforts to produce something formal-like for some subset of the wikitext syntax. To my knowledge, no useful formal grammar can be produced for the complete syntax.
I suspect this will be circular:
How complete a spec you can produce will depend on how much you're willing to change when you find out you can't specify it completely.
How much such a spec will be permitted to prescribe change will depend on how much change it needs to prescribe.
I don't know how useful it will be to have wikitext specified strictly, and I don't think we'll be able to tell until we see how far off we are, and what might need to be tweaked.
It wouldn't be *completely* necessary to have a flag day for all possible changes to wikitext...
Cheers, -- jr 'see also [[mode bit]]' a
Jay R. Ashworth wrote:
I don't know how useful it will be to have wikitext specified strictly, and I don't think we'll be able to tell until we see how far off we are, and what might need to be tweaked.
This was discussed at hacking days. Brion's pronouncement is that the current syntax will admit essentially no backwards-incompatible changes.
On 8/16/06, Jay R. Ashworth jra@baylink.com wrote:
I don't know how useful it will be to have wikitext specified strictly,
Well, obviously the idea is to feed it to yacc or something, so as much as possible would be good. But we might be stuck with a couple of extra passes in the PHP, or manually coded into the C parser, if it turns out to really not be usefully specifiable. (But hey, C++ and Perl can get away with grammars that aren't LALR(1), why can't we?)
Ivan Krstić wrote:
Jay R. Ashworth wrote:
there's a project underway to formalize the very wikitext grammar with which you're compatible.
There are efforts to produce something formal-like for some subset of the wikitext syntax. To my knowledge, no useful formal grammar can be produced for the complete syntax.
Just to mention this again...
[[Parsing expression grammar]]s are rather powerful tools for this sort of thing; they are extremely useful for situations where you have ad-hoc BNF grammars with ambiguity and/or added constraints, and in particular where the boundary between the lexical and syntactic layers is ill-defined, as in Wikitext, which often results in multiple possible parses or unlimited lookahead.
PEGs can be generated directly from BNF, with the application of tweaks to avoid left-recursion.
[[Packrat parser]]s implement PEGs in linear time. The basic idea is brute-force top-down matching with memoization to prevent a combinatorial explosion: this seemingly naive approach actually performs very well in practice, and the amount of memory consumed is surprisingly small, by the standards of modern computers. They can be compiled into native code using a parser-generator approach, and can usually be sped up/slimmed down quite considerably using a bit of hackery, where small but common special case parts of the grammar can be replaced by hand-written code using fast algorthms such as regexps.
There are a number of GPL'd implementations of packrat parsers already available in a variety of languages.
-- Neil
On Thu, Aug 17, 2006 at 08:53:34AM +0100, Neil Harris wrote:
Just to mention this again...
And I'm glad you did, Neil, because I don't reacall having seen it before.
[[Packrat parser]]s implement PEGs in linear time. The basic idea is brute-force top-down matching with memoization to prevent a combinatorial explosion: this seemingly naive approach actually performs very well in practice, and the amount of memory consumed is surprisingly small, by the standards of modern computers. They can be compiled into native code using a parser-generator approach, and can usually be sped up/slimmed down quite considerably using a bit of hackery, where small but common special case parts of the grammar can be replaced by hand-written code using fast algorthms such as regexps.
There are a number of GPL'd implementations of packrat parsers already available in a variety of languages.
Sounds worth looking into. Do you have any experience working with them? Who's the Hot Guy on this topic?
Cheers, -- jra
On 8/17/06, mingli yuan mingli.yuan@gmail.com wrote:
The project is now in the early stage, and you can download and try it at http://sourceforge.net/projects/dajoo/
Hi
Very interesting project. Maybe we can work together on a Java based offline editor and Mediawiki parser renderer?
My main project page is here: http://sourceforge.net/projects/plog4u
I'm also trying to create a Javascript based version with the help of the Google Web Toolkit "Java to JavaScript" translator.
I created a client-side GWT based wikipedia syntax renderer and editor demo: * http://bliki.info/wiki/ Current GWT based source code download: * http://bliki.info/wiki/download/info.bliki.wiki.zip
wikitech-l@lists.wikimedia.org