Is wikitext an HTML shorthand language, or a real markup language?

List overview All Threads
Download

newer

older

oh dear, a use for an image in an...

Progress

Steve Bennett

10 Dec 2007 10 Dec '07

9:33 a.m.

I gather that when wikis were first invented, it was more or less assumed that everyone knew some HTML and the wikitext syntax language was simply a shorthand. However, is this still the case, or should it be considered a markup language in its own right?

Here's a simple example to demonstrate the difference:

---- :one :two

:three

:four ----

If you consider wikitext to be a markup/formatting/display language, then you would expect there to be little or no gap between "one" and "two", a much bigger gap between "two and "three", and twice as big again between "three" and "four".

That's not what happens. Instead, it's converted to this:

<dl> <dd>one</dd> <dd>two</dd> </dl> <dl> <dd>three</dd> </dl> <p><br /></p> <dl> <dd>four</dd> </dl>

The significant thing is that the only difference between one/two and two/three is that the latter is two separate "definition lists" rather than two list items in the same list. The visual difference is minute.

So, to properly use the : operator, you need to know how the : is converted into HTML, then how that HTML will render in most browsers.

Is this really what we want? Don't we generally want the wikitext to render the way the user expects it to, rather than how HTML dictates it should render? Should we consider going as far as to convert the above into <span> tags with styles to indent a certain distance from the left, rather than abusing the <dl> tag this way?

Opinions and comments please!

Steve

Show replies by date

Jared Williams

10 Dec 10 Dec

10:23 a.m.

...

-----Original Message----- From: wikitext-l-bounces@lists.wikimedia.org [mailto:wikitext-l-bounces@lists.wikimedia.org] On Behalf Of Steve Bennett Sent: 10 December 2007 01:33 To: Wikitext-l Subject: [Wikitext-l] Is wikitext an HTML shorthand language,or a real markup language?

I gather that when wikis were first invented, it was more or less assumed that everyone knew some HTML and the wikitext syntax language was simply a shorthand. However, is this still the case, or should it be considered a markup language in its own right?

Here's a simple example to demonstrate the difference:

:one :two

:three

:four

If you consider wikitext to be a markup/formatting/display language, then you would expect there to be little or no gap between "one" and "two", a much bigger gap between "two and "three", and twice as big again between "three" and "four".

That's not what happens. Instead, it's converted to this:

<dl> <dd>one</dd> <dd>two</dd> </dl> <dl> <dd>three</dd> </dl> <p><br /></p> <dl> <dd>four</dd> </dl>

The significant thing is that the only difference between one/two and two/three is that the latter is two separate "definition lists" rather than two list items in the same list. The visual difference is minute.

I'm not sure I'm understanding the problem, IE7, Mozilla2.0.0.11 and Opera 9.5 all render the html much as you describe in the paragraph after the wikimarkup.

...

So, to properly use the : operator, you need to know how the : is converted into HTML, then how that HTML will render in most browsers.

Is this really what we want? Don't we generally want the wikitext to render the way the user expects it to, rather than how HTML dictates it should render? Should we consider going as far as to convert the above into <span> tags with styles to indent a certain distance from the left, rather than abusing the <dl> tag this way?

Opinions and comments please!

Yeah, I think some of the use of indenting is a bit nutty. For instance

::::{| | cell |}

Would be far better to use CSS margin-left & -right to indent, imo.

Another abuse that I think is far worse than this one is..

;foo :bar

Is turned into something along the lines of...

Why is the last whitespace in a <dt> turned into a  ? Surely it'd can't be for aligning anything?

Jared

Steve Bennett

11 Dec 11 Dec

9 a.m.

On 12/10/07, Jared Williams jared.williams1@ntlworld.com wrote:

...

I'm not sure I'm understanding the problem, IE7, Mozilla2.0.0.11 and Opera 9.5 all render the html much as you describe in the paragraph after the wikimarkup.

For me, firefox gives about 12 pixels gap between the first two, 16 pixels between the second two, and 43 pixels between the third two. The gap is bigger, but not usefully, saliently bigger.

...

Why is the last whitespace in a <dt> turned into a  ? Surely it'd can't be for aligning anything?

"Why" is a difficult question when discussing the current parser.

Steve

Thomas Dalton

10 Dec 10 Dec

9:43 p.m.

...

If you consider wikitext to be a markup/formatting/display language, then you would expect there to be little or no gap between "one" and "two", a much bigger gap between "two and "three", and twice as big again between "three" and "four".

I disagree with that assertion. HTML is a markup language by any definition I know and it collapses all whitespace to a single space.

I would say that wikitext *is* a shorthand for HTML on the grounds that it is defined only by how it is rendered as HTML (ie. by what parser.php does). Two carriage returns isn't defined as being the start of a new paragraph, it's defined as being the start of a new <p> block. That would change if we put together a proper grammar, etc.

Steve Bennett

11 Dec 11 Dec

9:03 a.m.

On 12/11/07, Thomas Dalton thomas.dalton@gmail.com wrote:

...

I would say that wikitext *is* a shorthand for HTML on the grounds that it is defined only by how it is rendered as HTML (ie. by what parser.php does). Two carriage returns isn't defined as being the start of a new paragraph, it's defined as being the start of a new <p> block. That would change if we put together a proper grammar, etc.

Ok, you're saying it is an HTML shorthand, but that would change with a proper grammar? Well, I'm putting together a proper grammar, so how should it change? It probably shouldn't change at all for the moment, so we can meet our goal of having a drop in parser that's as close as possible to the current one.

I'm finding a constant struggle between the goal of matching the current behaviour, and the knowledge that some of the current behaviour happens purely by chance, and no one really designed it.

Steve

Thomas Dalton

12 Dec 12 Dec

5:15 a.m.

...

Ok, you're saying it is an HTML shorthand, but that would change with a proper grammar? Well, I'm putting together a proper grammar, so how should it change? It probably shouldn't change at all for the moment, so we can meet our goal of having a drop in parser that's as close as possible to the current one.

It's just a matter of semantics. At the moment, wikitext is defined in terms of HTML, when you finish your grammar, we can define it in terms of what it should look like instead. A language defined in terms of HTML is just a shorthand for HTML, a language defined in its own terms is a language in its own right.

...

I'm finding a constant struggle between the goal of matching the current behaviour, and the knowledge that some of the current behaviour happens purely by chance, and no one really designed it.

Yeah... I can see how that would be a struggle. I don't think there's much we can do about it, though...

Jim Wilson

6:21 a.m.

Is the new grammar going to allow hard coded HTML such as <div class="someClass">whatever</div>?

If so, then wikitext is bound to remain semantically just HTML shorthand, right? Since the only valid output mechanism is HTML.

Or, is the new grammar going to take HTML tags as input and turn them into part of the abstract syntax tree? I can't see how that would be avoided since the apostrophes in the following should be literal apostrophies:

<span>'''Something </span>'''

-- Jim R. Wilson (jimbojw)

On Dec 11, 2007 4:15 PM, Thomas Dalton thomas.dalton@gmail.com wrote:

...

...
Ok, you're saying it is an HTML shorthand, but that would change with a proper grammar? Well, I'm putting together a proper grammar, so how should it change? It probably shouldn't change at all for the moment, so we can meet our goal of having a drop in parser that's as close as possible to the current one.

It's just a matter of semantics. At the moment, wikitext is defined in terms of HTML, when you finish your grammar, we can define it in terms of what it should look like instead. A language defined in terms of HTML is just a shorthand for HTML, a language defined in its own terms is a language in its own right.

...
I'm finding a constant struggle between the goal of matching the current behaviour, and the knowledge that some of the current behaviour happens purely by chance, and no one really designed it.

Yeah... I can see how that would be a struggle. I don't think there's much we can do about it, though...

Wikitext-l mailing list Wikitext-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitext-l

Thomas Dalton

6:34 a.m.

On 11/12/2007, Jim Wilson wilson.jim.r@gmail.com wrote:

...

Is the new grammar going to allow hard coded HTML such as <div class="someClass">whatever</div>?

The current parser allows straight HTML, there's just a whitelist of allowed tags (for example, <b>).

...

If so, then wikitext is bound to remain semantically just HTML shorthand, right? Since the only valid output mechanism is HTML.

There's no reason wikitext can't be parsed into something else, PDF for example. The main reason it can't be done now is because we don't know what the wikitext means other than what HTML it turns into. Once we have a proper grammar, we can (at least in theory) parse it into anything we like.

Virgil Ierubino

14 Dec 14 Dec

1:04 a.m.

I was unaware that the

:indent

markup actually used DL/DD tags. That's an awful idea. It should either <blockquote> or <div style="margin-left: 1.5em">. Changing that immediately would have no adverse effects.

The definition list conversion should only happen when you begin with a semi-colon:

; definiendum : definiens

As for the thread's main question, I agree that wikitext should parse more intuitively - but this would break current pages. Currently, HTML is actually allowed inside wikitext - we can't escape it.

Steve Bennett

20 Dec 20 Dec

9:06 a.m.

On 12/14/07, Virgil Ierubino virgil.ierubino@gmail.com wrote:

...

I was unaware that the

:indent

markup actually used DL/DD tags. That's an awful idea. It should either

<blockquote> or <div style="margin-left: 1.5em">. Changing that immediately would have no adverse effects.

See http://bugzilla.wikipedia.org/show_bug.cgi?id=4521

...

The definition list conversion should only happen when you begin with a semi-colon:

; definiendum : definiens

But what about:

; definiendum : definiens : et definiens et cetera

...

As for the thread's main question, I agree that wikitext should parse more intuitively - but this would break current pages. Currently, HTML is actually allowed inside wikitext - we can't escape it.

A limited subset of HTML is allowed, unless the $wgAllowHTML (or whatever) option is enabled, in which case, <html> blocks with *anything* (<script> included) are allowed.

Using the existence of the raw HTML aspect of wikitext as an argument for wikitext being bound to HTML is circular, however. I'm guessing the sequence went something like this:

1. Start with raw, editable HTML pages 2. Add convenient syntax for bold, italics, headings etc 3. Restrict the range of HTML allowed to prevent script kiddies

There's no particular reason that stage 4 and 5 couldn't be: 4. Add more syntax to make every thing that's possible with HTML be possible with native wiktext syntax 5. Prohibit all HTML unless some flag is enabled.

Then: 6. Adjust the interpretations of wikitext, and improve the grammar so that semantics are no longer specified in terms of HTML. (Yes, I'm looking at you, {|..|}

I think my question has been answered though: in the short to medium term, wikitext is very much bound to HTML and is defined in terms of a conversion to HTML, rather than its ultimate graphical rendering.

Steve

Steve Bennett

8:56 a.m.

On 12/12/07, Jim Wilson wilson.jim.r@gmail.com wrote:

...

Is the new grammar going to allow hard coded HTML such as <div class="someClass">whatever</div>?

The "new grammar" is, in theory, merely a codification of the "old grammar". So, yes. Any deviations from what is currently allowed are kept to a minimum, and usually only occur in unused syntax.

...

If so, then wikitext is bound to remain semantically just HTML shorthand, right? Since the only valid output mechanism is HTML.

Hmmm. That's true, but it would be easy to excise the raw HTML aspect if we wanted to get away from it being bound to HTML. Also, since the "hard coded HTML" acceptable is well defined, it would theoretically be possible for a parser to actually interpret that HTML and do something else with it. Like converting <b>bold</b> to an actual interpretation of bold.

...

Or, is the new grammar going to take HTML tags as input and turn them into part of the abstract syntax tree? I can't see how that would be

I think it's best if the AST is closely bound to the original code, warts and all. That means we can cache the tree, for example. In my current grammar, ''' converts to a B node in the AST, while <b> will convert to something else, like HTML_TAG or something.

...

avoided since the apostrophes in the following should be literal apostrophies:

<span>'''Something </span>'''

"Should"? Currently it renders as bold.

Steve

Jared Williams

12 Dec 12 Dec

3:30 a.m.

...

-----Original Message----- From: wikitext-l-bounces@lists.wikimedia.org [mailto:wikitext-l-bounces@lists.wikimedia.org] On Behalf Of Steve Bennett Sent: 10 December 2007 01:33 To: Wikitext-l Subject: [Wikitext-l] Is wikitext an HTML shorthand language,or a real markup language?

I gather that when wikis were first invented, it was more or less assumed that everyone knew some HTML and the wikitext syntax language was simply a shorthand. However, is this still the case, or should it be considered a markup language in its own right?

Here's a simple example to demonstrate the difference:

:one :two

:three

:four

If you consider wikitext to be a markup/formatting/display language, then you would expect there to be little or no gap between "one" and "two", a much bigger gap between "two and "three", and twice as big again between "three" and "four".

That's not what happens. Instead, it's converted to this:

<dl> <dd>one</dd> <dd>two</dd> </dl> <dl> <dd>three</dd> </dl> <p><br /></p> <dl> <dd>four</dd> </dl>

The significant thing is that the only difference between one/two and two/three is that the latter is two separate "definition lists" rather than two list items in the same list. The visual difference is minute.

So, to properly use the : operator, you need to know how the : is converted into HTML, then how that HTML will render in most browsers.

Is this really what we want? Don't we generally want the wikitext to render the way the user expects it to, rather than how HTML dictates it should render? Should we consider going as far as to convert the above into <span> tags with styles to indent a certain distance from the left, rather than abusing the <dl> tag this way?

Opinions and comments please!

Just browing bugzilla and seems it is classed a minor bug, http://bugzilla.wikimedia.org/show_bug.cgi?id=4521

Jared

Steve Bennett

5:10 a.m.

On 12/12/07, Jared Williams jared.williams1@ntlworld.com wrote:

...

Just browing bugzilla and seems it is classed a minor bug, http://bugzilla.wikimedia.org/show_bug.cgi?id=4521

Cool. I'll try and keep : and ;/: separate in the grammar hten.

Steve

6226

Age (days ago)

6236

Last active (days ago)

wikitext-l@lists.wikimedia.org

12 comments

5 participants

tags (0)

participants (5)

Jared Williams
Jim Wilson
Steve Bennett
Thomas Dalton
Virgil Ierubino