Incorrect HTML encoding--something new?

List overview All Threads
Download

newer

older

Categories in the PHP script

Press release tomorrow: Wikipedia...

lcrocker＠nupedia.com

31 Dec 2001 31 Dec '01

7:55 p.m.

I'm not sure when it happened, but somewhere the main Wikipedia site started putting out HTML with:

?xml version="1.0" encoding="utf-8"? ... !DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" ...

None of the above is even close to the truth, and for the code to claim it is will cause many problems. The HTTP headers simply claim that the text is "HTML" and that the encoding is "ISO-8859-1", both of which are accurate and useful; this is the way UseMod has always worked and still does, so I don't know where the new stuff came from. This needs to be fixed.

"Meta" is too broken now for me to check. 0

Show replies by date

Carey Evans

1 Jan 1 Jan

4:50 a.m.

lcrocker@nupedia.com writes:

...

I'm not sure when it happened, but somewhere the main Wikipedia site started putting out HTML with:

?xml version="1.0" encoding="utf-8"? ... !DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" ...

I believe this changed just a few days to a week before I added my bug report to http://www.wikipedia.com/wiki/Wikipedia_bugs on 17 November. It's rather obvious in Mozilla and Netscape 6, because the DOCTYPE triggers strict parsing mode and the tag soup method of nested <DL> elements for indents stops working.

-- Carey Evans http://home.clear.net.nz/pages/c.evans/

Taral

6 Jan 6 Jan

12:01 a.m.

On Tue, Jan 01, 2002 at 05:50:24PM +1300, Carey Evans wrote:

...

lcrocker@nupedia.com writes:

...
I'm not sure when it happened, but somewhere the main Wikipedia site started putting out HTML with:

?xml version="1.0" encoding="utf-8"? ... !DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN" ...

I believe this changed just a few days to a week before I added my bug report to http://www.wikipedia.com/wiki/Wikipedia_bugs on 17 November. It's rather obvious in Mozilla and Netscape 6, because the DOCTYPE triggers strict parsing mode and the tag soup method of nested <DL> elements for indents stops working.

Well, it says HTML 4 Transitional now, but it's still not right. There's an <hr> inside a tag and the nested <dl> stuff still isn't considered valid. (Mozilla ignores it)

-- Taral taral@taral.net This message is digitally signed. Please PGP encrypt mail to me. "Any technology, no matter how primitive, is magic to those who don't understand it." -- Florence Ambrose

Lars Aronsson

5:35 p.m.

Taral wrote:

...

Well, it says HTML 4 Transitional now, but it's still not right. There's an <hr> inside a tag and the nested <dl> stuff still isn't considered valid. (Mozilla ignores it)

"HTML 4 Transitional" is the standard that I use on my websites, and I think it could be a good idea for Wikipedia aswell. You use W3C's validator to check your web pages, and when they pass, you receive instructions for how to include a logotype-link at the bottom of each page. See e.g. the yellow W3C logo at the bottom of these pages:

http://aronsson.se/ http://elektrosmog.nu/ http://www.lysator.liu.se/runeberg/ http://susning.nu/

The latter is a UseModWiki, just like Wikipedia, so it is indeed possible.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik Teknikringen 1e, SE-583 30 Linköping, Sweden tel +46-70-7891609 http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/

Jimmy Wales

13 Jan 13 Jan

12:14 a.m.

I have no _particular_ opinions. All I want is for our pages to work in all browsers. What should I do?

Obviously, there's no way for us to pass all the webpages through a validator (or, is there?) since end users might well write invalid or at least not-perfect html.

Lars Aronsson wrote:

...

Taral wrote:

...
Well, it says HTML 4 Transitional now, but it's still not right. There's an <hr> inside a tag and the nested <dl> stuff still isn't considered valid. (Mozilla ignores it)

"HTML 4 Transitional" is the standard that I use on my websites, and I think it could be a good idea for Wikipedia aswell. You use W3C's validator to check your web pages, and when they pass, you receive instructions for how to include a logotype-link at the bottom of each page. See e.g. the yellow W3C logo at the bottom of these pages:
http://aronsson.se/
http://elektrosmog.nu/
http://www.lysator.liu.se/runeberg/
http://susning.nu/
The latter is a UseModWiki, just like Wikipedia, so it is indeed possible.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik Teknikringen 1e, SE-583 30 Linköping, Sweden tel +46-70-7891609 http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/ [Wikipedia-l] To manage your subscription to this list, please go here: http://www.nupedia.com/mailman/listinfo/wikipedia-l

-- ********************************************************************* *"The rifle itself has no moral stature, since it has no will of its* * own. Naturally, it may be used by evil men for evil purposes, but * * there are more good men than evil, and while the latter cannot be * * pursuaded to the path of righteousness by propaganda, they can * * certainly be corrected by good men with rifles." -Jeff Cooper * ********************************************************************

Lars Aronsson

2:29 a.m.

Jimmy wrote:

...

Obviously, there's no way for us to pass all the webpages through a validator (or, is there?) since end users might well write invalid or at least not-perfect html.

When I first thought of this, I thought it will never work because you allow HTML markup inside the user-edited wiki text. Then I found out that this is wrong. Think again! Each user can run her favorite page through the validator and fix the HTML code in the user-edited wiki text until it passes. All that the system programmers have to do is to make sure the system-generated HTML code passes through the validator. Testing a few pages without user-edited HTML code should be sufficient for this. After this, your users will help you find the remaining bugs.

In the UseModWiki, this code adds a logotype-link to the W3C validator:

sub GetMinimumFooter { my $w3c = "<a href='http://validator.w3.org/check/referer'><img border=0\n" . "src='/img/valid-html40.png'\n" . "alt='Valid HTML 4.0!' align=top height=31 width=88></a>"; if ($FooterNote ne '') { return T($FooterNote) . $w3c . $q->end_html; # Allow local translations } return $w3c . $q->end_html; }

You will want to download the valid-html40.png to your own site from http://www.w3.org/Icons/valid-html40.png

I made a quick test for the German Wikipedia through the validator (http://validator.w3.org/check?uri=http%3A%2F%2Fde.wikipedia.com%2F&chars...). The validator assumed the contents was HTML 2.0 (!), because of the insufficient DTD declaration at the top of the HTML code.

The beginning of the page:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <HTML><HEAD><TITLE>Wikipedia: HomePage</TITLE> </HEAD><BODY BGCOLOR="white">

Should look something like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <HTML><HEAD><TITLE>Wikipedia: HomePage</TITLE> </HEAD><BODY bgcolor="white" alink="c00000" vlink="#000080" link="#0000c0" text="black">

-- Lars Aronsson lars@aronsson.se tel +46-70-7891609 http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/

Simon Kissane

5:26 p.m.

--- Lars Aronsson lars@aronsson.se wrote:

...

Jimmy wrote:

...
Obviously, there's no way for us to pass all the webpages through a validator (or, is there?) since

...

...
end users might well write invalid or at least not-perfect html.

When I first thought of this, I thought it will never work because you allow HTML markup inside the user-edited wiki text. Then I found out that this is wrong. Think again! Each user can run her favorite page through the validator and fix the HTML code in the user-edited wiki text until it passes.

[snip] I can think of another possibility. Since we only allow a limited subset of HTML tags, and we only interpret it as a HTML tag if its properly formed (right?), we are already on the way to producing valid HTML output. Basically, if the user types invalid HTML into the wiki page, the software should either automatically correct it on display, or display it raw. Thus e.g. "<nonexistenttag>" will be outtputted to the browser as "<nonexistenttag>", "<LI>" will automatically have a closing "</LI>" added to it, and "<a href="http://www.foo.com" attributeimadeup="yes">" will have the "attributeimadeup" dropped from it.

That way, the output of the script on display will always be valid HTML, even if the user types bad HTML in to begin with.

Of course, this will make the parser in the script more complicated...

In fact, if we wanted to, we could even produce XHTML output to the browser, and yet take HTML in the page source.

Simon J Kissane

__________________________________________________ Do You Yahoo!? Send FREE video emails in Yahoo! Mail! http://promo.yahoo.com/videomail/

Kurt Jansson

7:22 p.m.

Hello everyone!

...

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<HTML><HEAD><TITLE>Wikipedia: HomePage</TITLE> </HEAD><BODY BGCOLOR="white">

Silly question from me: Why don't we have meta tags like keywords, language, author <g>, etc.?

Bye, Kurt

Magnus Manske

7:56 p.m.

I have a meta keywords tag automatically generated from the links in the article body. Check the output of any article at http://wikipedia.sourceforge.net/fpw/wiki.phtml

Language will be done as well soon.

Magnus

...

-----Original Message----- From: wikipedia-l-admin@nupedia.com [mailto:wikipedia-l-admin@nupedia.com]On Behalf Of Kurt Jansson Sent: Sunday, January 13, 2002 8:22 PM To: wikipedia-l@nupedia.com Subject: Re: [Wikipedia-l] Re: Incorrect HTML encoding--something new?

Hello everyone!

...
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<HTML><HEAD><TITLE>Wikipedia: HomePage</TITLE> </HEAD><BODY BGCOLOR="white">

Silly question from me: Why don't we have meta tags like keywords, language, author <g>, etc.?

Bye, Kurt

[Wikipedia-l] To manage your subscription to this list, please go here: http://www.nupedia.com/mailman/listinfo/wikipedia-l

Larry Sanger

10:34 p.m.

Last I heard, we solicited a report from Manning Bartlett on metadata, how to keep track of it, why we should keep track of it, and so forth.

The suggestion is old, but basically, we shouldn't do it at all unless we can do it right (and, above all, unless we can keep it *simple for users*).

Automatically-generated metadata is a Good Thing, I think.

Larry

On Sun, 13 Jan 2002, Magnus Manske wrote:

...

I have a meta keywords tag automatically generated from the links in the article body. Check the output of any article at http://wikipedia.sourceforge.net/fpw/wiki.phtml

Language will be done as well soon.

Magnus

...
-----Original Message----- From: wikipedia-l-admin@nupedia.com [mailto:wikipedia-l-admin@nupedia.com]On Behalf Of Kurt Jansson Sent: Sunday, January 13, 2002 8:22 PM To: wikipedia-l@nupedia.com Subject: Re: [Wikipedia-l] Re: Incorrect HTML encoding--something new?

Hello everyone!

...
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<HTML><HEAD><TITLE>Wikipedia: HomePage</TITLE> </HEAD><BODY BGCOLOR="white">

Silly question from me: Why don't we have meta tags like keywords, language, author <g>, etc.?

Bye, Kurt

[Wikipedia-l] To manage your subscription to this list, please go here: http://www.nupedia.com/mailman/listinfo/wikipedia-l

[Wikipedia-l] To manage your subscription to this list, please go here: http://www.nupedia.com/mailman/listinfo/wikipedia-l

Magnus Manske

10:50 p.m.

I just ran the main page of the test site through the W3C HTML validator (checking for HTML 4.01), and it shows 4 errors:

- 3 of then I blame on the validator for it won't parse <a href="wiki.phtml?foo=this&bar=that"> , saying "unknown entity bar". Obviously, it is trying to parse the URL...

- The other error is a that lacks a matching .

If these are the problems we're facing with incorrect HTML, there's no need to be scared... :)

Magnus

...

-----Original Message----- From: wikipedia-l-admin@nupedia.com [mailto:wikipedia-l-admin@nupedia.com]On Behalf Of Larry Sanger Sent: Sunday, January 13, 2002 11:34 PM To: wikipedia-l@nupedia.com Subject: RE: [Wikipedia-l] Re: Incorrect HTML encoding--something new?

Last I heard, we solicited a report from Manning Bartlett on metadata, how to keep track of it, why we should keep track of it, and so forth.

The suggestion is old, but basically, we shouldn't do it at all unless we can do it right (and, above all, unless we can keep it *simple for users*).

Automatically-generated metadata is a Good Thing, I think.

Larry

On Sun, 13 Jan 2002, Magnus Manske wrote:

...
I have a meta keywords tag automatically generated from the links in the article body. Check the output of any article at http://wikipedia.sourceforge.net/fpw/wiki.phtml

Language will be done as well soon.

Magnus

...
-----Original Message----- From: wikipedia-l-admin@nupedia.com [mailto:wikipedia-l-admin@nupedia.com]On Behalf Of Kurt Jansson Sent: Sunday, January 13, 2002 8:22 PM To: wikipedia-l@nupedia.com Subject: Re: [Wikipedia-l] Re: Incorrect HTML encoding--something new?

Hello everyone!

...
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<HTML><HEAD><TITLE>Wikipedia: HomePage</TITLE> </HEAD><BODY BGCOLOR="white">

Silly question from me: Why don't we have meta tags like keywords, language, author <g>, etc.?

Bye, Kurt

[Wikipedia-l] To manage your subscription to this list, please go here: http://www.nupedia.com/mailman/listinfo/wikipedia-l

[Wikipedia-l] To manage your subscription to this list, please go here: http://www.nupedia.com/mailman/listinfo/wikipedia-l

[Wikipedia-l] To manage your subscription to this list, please go here: http://www.nupedia.com/mailman/listinfo/wikipedia-l

Lars Aronsson

14 Jan 14 Jan

2:36 a.m.

Magnus Manske wrote:

...

href="wiki.phtml?foo=this&bar=that"> , saying "unknown entity bar". Obviously, it is trying to parse the URL...

Actually, whenever you represent the & character in HTML 4.0, you should write & even if it is in the middle of a CGI URL. This does not change the URL, only its representation in the HTML document.

...

If these are the problems we're facing with incorrect HTML, there's no need to be scared... :)

It is not very difficult to do HTML 4.0 correctly. The change to XHTML is a much bigger step (and unnecessary, if you ask me).

-- Lars Aronsson lars@aronsson.se tel +46-70-7891609 http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/

Robert Bihlmeyer

10:03 a.m.

Lars Aronsson lars@aronsson.se writes:

...

Magnus Manske wrote:

...
href="wiki.phtml?foo=this&bar=that"> , saying "unknown entity bar". Obviously, it is trying to parse the URL...

Actually, whenever you represent the & character in HTML 4.0, you should write & even if it is in the middle of a CGI URL. This does not change the URL, only its representation in the HTML document.

Perhaps just use ";" instead to separate parameters in the URL.

...

It is not very difficult to do HTML 4.0 correctly. The change to XHTML is a much bigger step (and unnecessary, if you ask me).

XHTML 1.0 Basic shouldn't be that big. But its strict rules are a bear if we want to encourage users writing HTML markup (instead of Wiki markup). I don't know if we want to encourage that, though.

-- Robbe

Jimmy Wales

7:08 p.m.

Robert Bihlmeyer wrote:

...

XHTML 1.0 Basic shouldn't be that big. But its strict rules are a bear if we want to encourage users writing HTML markup (instead of Wiki markup). I don't know if we want to encourage that, though.

My guess is that no, we don't want people writing HTML markup. We want to support a few tags, as we do now, because they're so commonly known and useful.

Above all, we don't want people to _have_ to do anything even the _least bit_ hard.

--Jimbo

Larry Sanger

13 Jan 13 Jan

10:11 p.m.

A comment that MIGHT OR MIGHT NOT :-) be relevant to Lars' suggestion (I'll leave that for you to decide):

I'd generally be opposed to any extra step, particularly anything that requires any degree of technical know-how, that would make it more difficult or complicated for the ordinary (non-coding) Joe to save Wikipedia articles.

Larry

On Sun, 13 Jan 2002, Lars Aronsson wrote:

...

Jimmy wrote:

...
Obviously, there's no way for us to pass all the webpages through a validator (or, is there?) since end users might well write invalid or at least not-perfect html.

When I first thought of this, I thought it will never work because you allow HTML markup inside the user-edited wiki text. Then I found out that this is wrong. Think again! Each user can run her favorite page through the validator and fix the HTML code in the user-edited wiki text until it passes. All that the system programmers have to do is to make sure the system-generated HTML code passes through the validator. Testing a few pages without user-edited HTML code should be sufficient for this. After this, your users will help you find the remaining bugs.

In the UseModWiki, this code adds a logotype-link to the W3C validator:

sub GetMinimumFooter { my $w3c = "<a href='http://validator.w3.org/check/referer'><img border=0\n" . "src='/img/valid-html40.png'\n" . "alt='Valid HTML 4.0!' align=top height=31 width=88></a>"; if ($FooterNote ne '') { return T($FooterNote) . $w3c . $q->end_html; # Allow local translations } return $w3c . $q->end_html; }

You will want to download the valid-html40.png to your own site from http://www.w3.org/Icons/valid-html40.png

I made a quick test for the German Wikipedia through the validator (http://validator.w3.org/check?uri=http%3A%2F%2Fde.wikipedia.com%2F&chars...). The validator assumed the contents was HTML 2.0 (!), because of the insufficient DTD declaration at the top of the HTML code.

The beginning of the page:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">

<HTML><HEAD><TITLE>Wikipedia: HomePage</TITLE> </HEAD><BODY BGCOLOR="white">

Should look something like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

<HTML><HEAD><TITLE>Wikipedia: HomePage</TITLE> </HEAD><BODY bgcolor="white" alink="c00000" vlink="#000080" link="#0000c0" text="black">

Jimmy Wales

14 Jan 14 Jan

7:05 p.m.

Lars Aronsson wrote:

...

Each user can run her favorite page through the validator and fix the HTML code in the user-edited wiki text until it passes.

I dunno. Seems pretty unwiki to me. One of the key principles of wikipedia is that anyone can edit without having to know much of anything. Requiring them to input semantically perfect HTML seems a bit much.

Of course, I have no opposition to having some easy way for people who care about such things (not me) to run any page through a validator _if they feel like it_.

...

The validator assumed the contents was HTML 2.0 (!), because of the insufficient DTD declaration at the top of the HTML code.

O.k., well, I'm in over my head a bit. My feeling is that we should be generating a very low level of HTML, so that all browsers can be sure to render it. So why do we want to force 4.0?

--Jimbo

Robert Bihlmeyer

15 Jan 15 Jan

5:09 p.m.

Jimmy Wales jwales@bomis.com writes:

...

O.k., well, I'm in over my head a bit. My feeling is that we should be generating a very low level of HTML, so that all browsers can be sure to render it. So why do we want to force 4.0?

I agree. As long as there are no 4.0 features in the generated HTML, there's no need to specify it.

OTOH, I know of no old browser which would reject displaying a page just because it specifies a DTD that is new.

Jimmy Wales jwales@bomis.com writes:

...

My guess is that no, we don't want people writing HTML markup. We want to support a few tags, as we do now, because they're so commonly known and useful.

Above all, we don't want people to _have_ to do anything even the _least bit_ hard.

Agreed again. I even thought about rewikifying commonly-known elements like B back to '''. Otherwise the following situation is possible:

Author 1 knows a bit HTML, but less Wiki. She writes "term". Author 2 knows no HTML, but is fluent in Wiki. He is confused by "". Also note that while Wiki is fully documented on our site, HTML is not.

For the time being rewikifying can be done by people, but the software could do it as well.

The problem is not big, though, the only thing where Wikipedians commonly use HTML is tables -- many think HTML tables are superior to the Wiki tables. That complaint could be fixed of course in Magnus's script.

-- Robbe

8375

Age (days ago)

8390

Last active (days ago)

wikipedia-l@lists.wikimedia.org

16 comments

10 participants

tags (0)

participants (10)

Carey Evans
Jimmy Wales
Kurt Jansson
Larry Sanger
Lars Aronsson
lcrocker＠nupedia.com
Magnus Manske
Robert Bihlmeyer
Simon Kissane
Taral