Re: [Wikitech-l] Wikitech-l Digest, Vol 65, Issue 34 - Wikitech-l

28 Dec 2008

      SOS...SOS...SOS...HELP...Slovakia-Slovensko,dakujem za E-mail ale neviem čo tam je napísané,lebo neovládam Váš jazyk -prosím Slovenčinu alebo češtinu...
______________________________________________________________
...
Od: wikitech-l-request@lists.wikimedia.org
Komu: wikitech-l@lists.wikimedia.org
Datum: 28.12.2008 04:16
Předmět: Wikitech-l Digest, Vol 65, Issue 34
Send Wikitech-l mailing list submissions to
wikitech-l@lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
or, via email, send a message with subject or body 'help' to
wikitech-l-request@lists.wikimedia.org
You can reach the person managing the list at
wikitech-l-owner@lists.wikimedia.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Wikitech-l digest..."
Today's Topics:
1. Data center move in Amsterdam: expect some downtime (Mark Bergsma)
  2. Re: IBM DB2 patch for MediaWiki (Jes?s Quiroga)
  3. Re: Anchors haven't id attribute (Danny B.)
  4. Re: Anchors haven't id attribute (Brion Vibber)
  5. Re: IBM DB2 patch for MediaWiki (Aryeh Gregor)
  6. Re: Anchors haven't id attribute (Aryeh Gregor)
  7. Re: Anchors haven't id attribute (Danny B.)
  8. Re: Anchors haven't id attribute (Aryeh Gregor)

Message: 1
Date: Fri, 26 Dec 2008 22:05:17 +0100
From: Mark Bergsma mark@wikimedia.org
Subject: [Wikitech-l] Data center move in Amsterdam: expect some
downtime
To: Wikimedia developers wikitech-l@lists.wikimedia.org, Wikimedia
Foundation Mailing List foundation-l@lists.wikimedia.org
Message-ID: 4955470D.10503@wikimedia.org
Content-Type: text/plain; charset=ISO-8859-1
In the upcoming days until new years we will be moving our servers and
other equipment in the Amsterdam data center location to a new data
center. Unfortunately this might result in some down time and hiccups of
certain web sites &amp; services, although we will try to keep this to a
minimum.
On Sunday the 28th, between 09:00 and 11:00 UTC we will migrate our
network in Amsterdam to new equipment. All services located there will
be unreachable for a brief period. Traffic for the main wikis will be
rerouted to the Florida cluster however, and should remain unaffected.
In the days after we will be moving the servers themselves. Some
services, such as the mailing lists server, the subversion server and
the toolserver cluster, will be down for a number of hours while the
equipment is being moved. Traffic for the wikis should again remain
largely unaffected.
We hope to have the entire migration finished before we enter the last
few hours of 2008... and start 2009 with a clean sheet. Happy Holidays
everyone!
-- 
Mark Bergsma mark@wikimedia.org
System &amp; Network Administrator, Wikimedia Foundation

Message: 2
Date: Sat, 27 Dec 2008 07:23:00 +0100
From: Jes?s Quiroga jquiroga@pobox.com
Subject: Re: [Wikitech-l] IBM DB2 patch for MediaWiki
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Message-ID: 4955C9C4.9080509@pobox.com
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Hello.
After a few days of pondering the issues, I would like to explain what I 
suggested in my previous message, in more detail and (hopefully) more 
clearly.
What I'm about to say is pretty abstract, so it's difficult to convey 
the right meaning. Please forgive me if I say something you already 
know, or just nonsense :-)
Jes?s Quiroga escribi?:
...
I believe a better solution is to design a domain-specific language, an 
idea not very different from your first one.
This DSL would model the interaction between the application and the DB 
as it is now, and would be designed to evolve. That's it.
The problem I discuss is how to best access the data store from an 
application. I believe the right answer is different for each project, 
but it's not difficult to evaluate the alternatives, one by one, in a 
given context. I think it is worthwhile to do that in the context of 
MediaWiki.
I will refer to wiki modules and databases as if they were 'hosts' 
connected to a 'network', to highlight the role of languages in the 
operation of the system at runtime.
The first way to access the data store is the 'direct' one:
[polyglot wiki] <--- mysDataL ---> [mysql]
   [polyglot wiki] <--- posDataL ---> [postgresql]
   [polyglot wiki] <--- db2DataL ---> [db2]
Here, the polyglot wiki module talks to every database using the proper 
languages. 'mysDataL' means 'the data language understood by MySQL', 
'posDataL' means 'the data language understood by PostgreSQL', etc.
The polyglot wiki promises to learn several languages and to speak them 
correctly forever, so, if a new database comes along or any of their 
data languages evolves, the polyglot wiki is forced to adapt at a 
potentially great cost. Besides, any change to the database schema can 
trigger lots of updates to the wiki code, and be very costly too.
The advantages of this way are well known: it is fast, no need to do 
design, easy to understand.
The drawbacks are apparently few, but devastating: verbose and complex 
code in multiple places in the wiki module, very costly to maintain, 
even more costly to evolve. All changes cost a lot, in time and effort.
The second way to access the data store that is usually considered is 
the 'indirect' one:
[wiki] <--- wikiDataL ---> [polyglot translator]
[polyglot translator] <--- mysDataL ---> [mysql]
   [polyglot translator] <--- posDataL ---> [postgresql]
   [polyglot translator] <--- db2DataL ---> [db2]
Here, wikiDataL means 'some relational data definition and manipulation 
language suitable for use by the wiki'.
The polyglot translator promises to learn wikiDataL and the other 
dialects and to evolve with them, so it has all the problems the wiki 
had in the direct way, but now the cost is lower because a lot of 
complexity is 'hidden' inside the translator and can't reach the wiki. 
As a result, wiki code is not updated as much, and it's much cleaner and 
less verbose.
The advantages of this way are: wiki module code is simpler, cost of 
evolution is reduced.
The drawbacks are apparently many: it's slower, design is needed, harder 
to understand, a new language (wikiDataL), translator can be very 
complex. However, the need to reduce the cost to achieve change is 
usually so great that these inconveniences are minor in comparison.
Now the interesting bit begins. A third possible way to access the data 
store, the 'interpreted' one:
[wiki] <--- wikiNeedL ---> [polyglot interpreter]
[polyglot interpreter] <--- mysDataL ---> [mysql]
   [polyglot interpreter] <--- posDataL ---> [postgresql]
   [polyglot interpreter] <--- db2DataL ---> [db2]
Here, wikiNeedL means 'some language adequate for the wiki to express 
its data access needs and nothing else'.
wikiNeedL is the domain-specific language I wrote about in my previous 
message.
The differences between wikiDataL and wikiNeedL are mainly these:
  - wikiNeedL would contain just enough wiki concepts to express the 
wiki's needs, so it's effectively confined to that domain. wikiDataL 
belongs to the relational data model domain, which is quite different.
  - in general, wikiNeedL would have different semantics than the 
dialects understood by the databases, so the translation step becomes 
more like interpretation, rather than just syntactic transformations. 
wikiDataL usually has the same semantics than the dialects.
  - wikiNeedL would contain just enough concepts to satisfy current 
needs, and will be open to extension. wikiDataL aims to be 
general-purpose and to fulfill current and future needs.
The main reason to consider the 'interpreted' way is, of course, that it 
helps reduce even more the cost to achieve change.
So that's what I was talking about. I will say more about the 
differences between the indirect and the interpreted ways in a future 
message.
Thanks for your attention.

Message: 3
Date: Sat, 27 Dec 2008 13:05:53 +0100 (CET)
From: Danny B.Wikipedia.Danny.B@email.cz
Subject: Re: [Wikitech-l] Anchors haven't id attribute
To: Wikimedia developerswikitech-l@lists.wikimedia.org
Message-ID: 18263.21683-30277-135341947-1230379553@email.cz
Content-Type: text/plain; charset="iso-8859-2"
...
------------ P?vodn? zpr?va ------------
Od: Brion Vibber brion@wikimedia.org
P?edm?t: Re: [Wikitech-l] Anchors haven't id attribute
Datum: 26.12.2008 06:30:00

On 12/25/08 4:32 AM, Danny B. wrote:
...
I have reverted both revisions in r45021 and r45022 because it caused massive
invalidity of pages.
Given that we've been outputting these as "id" attributes for the last 
few years already (as output by Tidy), I have reverted your revert in 
r45044 pending further discussion.
-- brion
Well, the id was added _only_ to those tags, where name was transferable to id - thus had to start with ASCII letter. _Never_ to those, which did not conform this rule (the regexp mentioned in my previous post). Easily provable by either running older revision of MediaWiki or testing in Tidy directly:
Take this code excerpt (and wrap it with minimal XHTML document stuff) and run it through Tidy:
<a name="X"></a><h2> <span class="mw-headline"> X </span></h2>
<a name="1X"></a><h2> <span class="mw-headline"> 1X </span></h2>
<a name=".C3.81X"></a><h2> <span class="mw-headline"> ?X </span></h2>
<a name="-X"></a><h2> <span class="mw-headline"> -X </span></h2>
The result will be:
<a name="X" id="X"></a><h2><span class="mw-headline">X</span></h2>
<a name="1X"></a><h2><span class="mw-headline">1X</span></h2>
<a name=".C3.81X"></a><h2><span class="mw-headline">?X</span></h2>
<a name="-X"></a><h2><span class="mw-headline">-X</span></h2>
Now, let me repeat, how the "id" is defined:
1: XHTML is reformulation of HTML 4 as an XML 1.0 application.
2: That means it takes every single definition from HTML 4 and keeps it unless it is overriden in XHTML.
3: The id and name has been defined in HTML 4 as /[A-Za-z][A-Za-z0-9:_.-]*/  [1] [2]
4: The name has been redefined to NMTOKEN  [2] [3]
5: The id has never been redefined thus stays on definition mentioned in point 3 above.
This is how the id in XHTML was always handled since the XHTML is out. I also think that such important thing like handling of id is, was fixed in validator during so many years if it wasn't correct.
So currently, all non-latin-chars wikis are now totally invalid according to W3C validator. Major parts of non-ASCII-chars wikis are invalid as well. Therefore is very hard to find other invalid mistakes in code when having worthless positives on every other page. :-(
Also one thing at the end: I think that the current rendering with controversial ids brought more negatives (such as much lowering down the ability to find the real invalid parts of the code) than positives - well, it was working correctly before, so what benefit it actually brought? On the other hand it brought this controversy.
I take the point that I (and majority of people over the world, the validator, Tidy and so many other tools etc.) _may_ be wrong with the interpretation of definition of id. But I guess unless the authority tools, as validator or Tidy are, are fixed in this issue - thus can be proved we render the page correctly - we should not render that way. As I mentioned above - it was working correctly before so there is no urge to force the new rendering since it is not correcting any mistake or misfunctionality.
[1] http://www.w3.org/TR/html401/types.html#type-name
[2] http://www.w3.org/TR/xhtml1/#C_8
[3] http://www.w3.org/TR/2000/WD-xml-2e-20000814#NT-Nmtoken
Kind regards
Danny B.

Message: 4
Date: Sat, 27 Dec 2008 12:14:33 -0800
From: Brion Vibber brion@wikimedia.org
Subject: Re: [Wikitech-l] Anchors haven't id attribute
To: Wikimedia developers wikitech-l@lists.wikimedia.org
Message-ID: 49568CA9.6090104@wikimedia.org
Content-Type: text/plain; charset=ISO-8859-2; format=flowed
[snip]
Maybe we should just fix the normalization function the way we'd already 
planned to, so that it'll work right the way we'd already planned to?
-- brion

Message: 5
Date: Sat, 27 Dec 2008 18:25:10 -0500
From: "Aryeh Gregor" Simetrical+wikilist@gmail.com
Subject: Re: [Wikitech-l] IBM DB2 patch for MediaWiki
To: "Wikimedia developers" wikitech-l@lists.wikimedia.org
Message-ID:
7c2a12e20812271525g3055d1ffr855bc071028262b@mail.gmail.com
Content-Type: text/plain; charset=UTF-8
On Sat, Dec 27, 2008 at 1:23 AM, Jes?s Quiroga jquiroga@pobox.com wrote:
...
The second way to access the data store that is usually considered is
the 'indirect' one:
[wiki] <--- wikiDataL ---> [polyglot translator]
[polyglot translator] <--- mysDataL ---> [mysql]
   [polyglot translator] <--- posDataL ---> [postgresql]
   [polyglot translator] <--- db2DataL ---> [db2]
Here, wikiDataL means 'some relational data definition and manipulation
language suitable for use by the wiki'.
This is what we currently use, and I don't think we're going to
seriously consider changing it without some very compelling arguments
being presented.  Incremental improvements to our current way of doing
things (cutting back on raw queries, moving MySQL-specific stuff from
Database to DatabaseMySql, defining more clearly what Database methods
mean and avoiding undefined behavior) seem entirely sufficient to
allow support for any number of additional database backends.
...
The differences between wikiDataL and wikiNeedL are mainly these:
  - wikiNeedL would contain just enough wiki concepts to express the
wiki's needs, so it's effectively confined to that domain. wikiDataL
belongs to the relational data model domain, which is quite different.
  - in general, wikiNeedL would have different semantics than the
dialects understood by the databases, so the translation step becomes
more like interpretation, rather than just syntactic transformations.
wikiDataL usually has the same semantics than the dialects.
  - wikiNeedL would contain just enough concepts to satisfy current
needs, and will be open to extension. wikiDataL aims to be
general-purpose and to fulfill current and future needs.
In practice, wikiNeedL would be drastically more complicated, if I
understand you correctly.  Its basic semantic units would be things
like articles, users, revisions, etc., instead of rows, columns, and
tables.  We *have* a wikiNeedL, in fact: it's called "calling the
appropriate Article method" or whatever.  Most code doesn't have to
manually do queries.  Further abstraction of the database queries
would be possible, but I question its usefulness.

Message: 6
Date: Sat, 27 Dec 2008 19:06:24 -0500
From: "Aryeh Gregor" Simetrical+wikilist@gmail.com
Subject: Re: [Wikitech-l] Anchors haven't id attribute
To: "Wikimedia developers" wikitech-l@lists.wikimedia.org
Message-ID:
7c2a12e20812271606u6b188edj22a6579803ccd43d@mail.gmail.com
Content-Type: text/plain; charset=UTF-8
On Sat, Dec 27, 2008 at 3:14 PM, Brion Vibber brion@wikimedia.org wrote:
...
[snip]
Maybe we should just fix the normalization function the way we'd already
planned to, so that it'll work right the way we'd already planned to?
Done in r45109.  I notice, by the way, that HTML5 allows any string
not containing whitespace for id's . . . yet another case where it
clearly wins the "don't gratuitously cause pain to developers"
contest.

Message: 7
Date: Sun, 28 Dec 2008 03:02:26 +0100 (CET)
From: Danny B.Wikipedia.Danny.B@email.cz
Subject: Re: [Wikitech-l] Anchors haven't id attribute
To: Wikimedia developerswikitech-l@lists.wikimedia.org
Message-ID: 18278.21698-2886-1817746719-1230429746@email.cz
Content-Type: text/plain; charset="iso-8859-2"
...
------------ P?vodn? zpr?va ------------
Od: Aryeh Gregor Simetrical+wikilist@gmail.com
P?edm?t: Re: [Wikitech-l] Anchors haven't id attribute
Datum: 28.12.2008 01:07:08

On Sat, Dec 27, 2008 at 3:14 PM, Brion Vibber brion@wikimedia.org wrote:
...
[snip]
Maybe we should just fix the normalization function the way we'd already
planned to, so that it'll work right the way we'd already planned to?
Done in r45109.  I notice, by the way, that HTML5 allows any string
not containing whitespace for id's . . . yet another case where it
clearly wins the "don't gratuitously cause pain to developers"
contest.
*sigh*
Why do we have to hunt for some other solution when we have fully working, fully valid and fully intuitive one?
OK, let's make some summary about three versions we have:
Terms used:

old version - the for-many-years used version until r44896
mid version - r44896 way
new version - r45109 way

Old version was used for many years. It was fully valid - ids were only there where they could have been copied from name AND comply to the regexp mentioned in previous posts. It has been done automatically by Tidy. And it was fully intuitive - you just wrote [[#Foo]] and it linked to section named Foo. Or you've added #Foo in URL in address bar and you got to the proper section as well. And it was fully working properly.
The mid version brought the "feature" that all name attributes have been duplicated to ids. That caused massive invalidity of pages, especially non-latin and non-ASCII. However, the intuitivity of anchors creation has still been kept.
The new version prepends x to all anchors to solve the problem which was spread here in mid version - the massive invalidity of pages. So it solved one problem (which actually didn't have to be solved if we kept the old version) but brought at least two major other:
First major problem is, that this change is breaking millions of existing links to sections. Links used on pages on wikis, links used on external sites, links in people's bookmarks, in emails, forum threads etc. Well, OK, let's discount all external stuff, since we don't have any influence on it, but we still have millions of links left on our own wikis which won't work anymore since r45109.
The other major problem is, that since this point further the anchor links are no longer intuitive - we are now pushing people to constantly think about prepending x when creating anchor links. No more simple copy pasting of the headline.
As a side effect we are now adding unnecessary work to people from non-latin wikis by pushing them to always switch to latin keyboard, or to click on edittools or whatever just to get the one "x" character in editbox to create the anchor link.
So let me summarize in points:

First we did not have any problem at all.
Second we had one problem.
Third we "solved" the problem but created at least two new.

I am pretty scared what's coming next... :-/
One question for the end: What is the benefit of either mid or new version over the old one - what new functionality or feature it brings or which existing bug it fixes?
Kind regards
Danny B.

Message: 8
Date: Sat, 27 Dec 2008 22:15:24 -0500
From: "Aryeh Gregor" Simetrical+wikilist@gmail.com
Subject: Re: [Wikitech-l] Anchors haven't id attribute
To: "Wikimedia developers" wikitech-l@lists.wikimedia.org
Message-ID:
7c2a12e20812271915gf2bb722gd33f461fb180b946@mail.gmail.com
Content-Type: text/plain; charset=UTF-8
2008/12/27 Danny B. Wikipedia.Danny.B@email.cz:
...
*sigh*
Why do we have to hunt for some other solution when we have fully working, fully valid and fully intuitive one?
Because:

Our previous behavior arguably violated the XHTML 1 specification

by allowing name attributes to begin with nonletters.  Please don't
ignore this argument because you think it's wrong.  I think you're
wrong on this issue too, but I don't just ignore your opinion when
discussing what the software that we *both* develop should do.  Note
"arguably" in the first sentence here -- your opinion counts as much
as mine.

It's not arguable at all that the XHTML 1 specification strongly

recommends that <a> elements with a name attribute also have an id
attribute.  In fact, section 4.10 states: "In order to ensure that
XHTML 1.0 documents are well-structured XML documents, XHTML 1.0
documents MUST use the id attribute when defining fragment identifiers
on the elements listed above [including <a>]."
I'm not saying these reasons outweigh the reasons against, but those
are the reasons it was done.  In particular, I don't think I've seen
an argument from you against (2).
...
Old version was used for many years. It was fully valid
Could you *please* stop pretending that a debate doesn't even exist
here?  It's obnoxious and uncivil, and you keep on doing it.
...
First major problem is, that this change is breaking millions of existing links to sections. Links used on pages on wikis, links used on external sites, links in people's bookmarks, in emails, forum threads etc. Well, OK, let's discount all external stuff, since we don't have any influence on it, but we still have millions of links left on our own wikis which won't work anymore since r45109.
First of all, all auto-generated internal links (in TOCs) will
automatically switch to the new format.  Second of all, it should be
one extra line of code to fix up all manually-created internal links
as well, so that the x is automatically added as part of the encoding
process.  (I didn't find where this needed to be done at a quick
glance.)  So we're only talking about external links here.
This is a one-time cost and I don't think it's a big problem -- at
worst, a few users will end up on the wrong part of the page.  It
should be pointed out that this will affect *all* section links on
non-Latin wikis (since they get encoded to begin with dots and then
need to start with a letter), but again, only as a one-time cost, and
only external links (links from external sites or links using external
link syntax), and it will still get viewers to almost the right place.
...
The other major problem is, that since this point further the anchor links are no longer intuitive - we are now pushing people to constantly think about prepending x when creating anchor links. No more simple copy pasting of the headline.
As a side effect we are now adding unnecessary work to people from non-latin wikis by pushing them to always switch to latin keyboard, or to click on edittools or whatever just to get the one "x" character in editbox to create the anchor link.
Again, not an issue if internal links are fixed to work correctly.  I
didn't think about that aspect, but it should be very simple to fix
(I'd do it now except I'm going to bed).
It seems to me that there are only weak reasons in favor (following
recommended best practice with no practical effect) and only weak
reasons against (small one-time transition cost -- unless you're
correct that there will be longer-term costs, in which case please
clarify why you think this).  Normally I would say that standards
compliance by itself (as opposed to standards compliance that brings
concrete benefit) is worth small one-time costs, although not large
enough one-time costs and probably not even fairly small recurring
costs.  So as it stands, without further arguments, I'd still be
weakly in favor of keeping the current state of trunk, of course with
the fix for anchors on internal links.

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
End of Wikitech-l Digest, Vol 65, Issue 34