New subject: Use cases for Sites handling change (Re: Wikidata blockers weekly update)

11 Aug 2012

      Hi everyone,
I'm starting a separate thread, because this is an important topic and
I don't think it's well served as a subtopic of a "Wikidata blockers"
thread.
To recap, Jeroen submitted changeset 14295 in Gerrit
https://gerrit.wikimedia.org/r/#/c/14295/ with the following
summary:
...
This commit introduces a new table to hold site data and configuration,
objects to represent the table, site objects and lists of sites and
associated tests.
...
The sites code is a more generalized and less contrived version of the
interwiki code we currently have and is meant to replace it eventually.
This commit does not do away with the existing interwiki code in any way yet.
...
The reasons for this change where outlined and discussed on wikitech here:
http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/060992.html
Thanks Brian for summarizing an important point:
On Fri, Aug 10, 2012 at 6:33 AM, bawolff bawolff+wn@gmail.com wrote:
...
First and foremost, I'm a little confused as to what the actual use
cases here are. Could we get a short summary for those who aren't
entirely following how wikidata will work, why the current interwiki
situation is insufficient? I've read the I0a96e585 and
http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/060992.html,
but everything seems very vague "It doesn't work for our situation",
without any detailed explanation of what that situation is. At most
the messages kind of hint at wanting to be able to access the list of
interwiki types of the wikidata "server" from a wikidata "client" (and
keep them in sync, or at least have them replicated from
server->client). But there's no explanation given to why one needs to
do that (are we doing some form of interwiki transclusion and need to
render foreign interwiki links correctly? Want to be able to do global
whatlinkshere and need unique global ids for various wikis? Something
else?)
I've included the rest of Brian's mail below because I think his other
points are worth responding to as well, but included the above because
I wanted to reiterate his core set of questions.
I don't mean to jerk y'all around.  I'm pushing the Platform devs
(Tim, Aaron, Chad, and Sam in particular) to be responsive here, and
based on the conversations that I've had with them, they have these
questions too.
Rob
[1] http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/thread.html#60992
---------- Forwarded message ----------
From: bawolff bawolff+wn@gmail.com
Date: Fri, Aug 10, 2012 at 6:33 AM
Subject: [Wikitech-l] Wikidata blockers weekly update
To: wikitech-l wikitech-l@lists.wikimedia.org
...
Hey,
You mean site_config?
...
You're suggesting the interwiki system should look for a site by
site_local_key, when it finds one parse out the site_config, check if it's
disabled and if so ignore the fact it found a site with that local key?
Instead of just not having a site_local_key for that row in the first place?
No. Since the interwiki system is not specific to any type of site, this
approach would be making it needlessly hard. The site_link_inline field
determines if the site should be usable as interwiki link, as you can see
in the patchset:
-- If the site should be linkable inline as an "interwiki link" using
  -- [[site_local_key:pageTitle]].
  site_link_inline           bool                NOT NULL,
So queries would be _very_ simple.
...
So data duplication simply because one wiki needs a second local name
will mean that one url now has two different global ids this sounds
precisely like something that is going to get in the way of the whole
reason you wanted this rewrite.

It does not get in our way at all, and is completely disjunct from why we

want the rewrite

It's currently done like this
The changes we do need and are proposing to make will make such a rewrite

at a later point easier then it is now
...
Doing it this way frees us from creating any restrictions on whatever
source we get sites from that we shouldn't be placing on them.

We don't need this for Wikidata
It's a new feature that might or might not be nice to have that currently

does not exist

The changes we do need and are proposing to make will make such a rewrite

at a later point easier then it is now
...
So you might as well drop the 3 url related columns and just use the data
blob that you already have.
I don't see what this would gain us at all. It's just make things more
complicated.
...
The $1 pattern may not even work for some sites.

We don't need this for Wikidata
It's a new feature that might or might not be nice to have that currently

does not exist

The changes we do need and are proposing to make will make such a rewrite

at a later point easier then it is now
And in fact we are making this more flexible by having the type system. The
MediaWiki site type could for instance be able to form both "nice" urls and
index.php ones. Or a gerrit type could have the logic to distinguish
between the gerrit commit number and a sha1 hash.
Cheers
[Just to clarify, I'm doing inline replies to things various people
said, not just Jeroen]
First and foremost, I'm a little confused as to what the actual use
cases here are. Could we get a short summary for those who aren't
entirely following how wikidata will work, why the current interwiki
situation is insufficient? I've read the I0a96e585 and
http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/060992.html,
but everything seems very vague "It doesn't work for our situation",
without any detailed explanation of what that situation is. At most
the messages kind of hint at wanting to be able to access the list of
interwiki types of the wikidata "server" from a wikidata "client" (and
keep them in sync, or at least have them replicated from
server->client). But there's no explanation given to why one needs to
do that (are we doing some form of interwiki transclusion and need to
render foreign interwiki links correctly? Want to be able to do global
whatlinkshere and need unique global ids for various wikis? Something
else?)
...

Site definitions can exist that are not used as "interlanguage link" and

not used as "interwiki link"
And if we put one of those on a talk page, what would happen? Or if
foo was one such link, doing [[:foo:some page]]  (Current behaviour is
it becomes an interwiki).
Although to be fair, I do see how the current way we distinguish
between interwiki and interlang links is a bit hacky.
...
And in fact we are making this more flexible by having the type system. The
MediaWiki site type could for instance be able to form both "nice" urls and
index.php ones. Or a gerrit type could have the logic to distinguish
between the gerrit commit number and a sha1 hash.
I must admit I do like this this idea. In particular the current
situation where we treat the value of an interwiki link as a title
(aka spaces -> underscores etc) even for sites that do not use such
conventions, has always bothered me. Having interwikis that support
url re-writing based on the value does sound cool, but I certainly
wouldn't want said code in a db blob (and just using an integer
site_type identifier is quite far away from giving us that, but its
still a step in a positive direction), which raises the question of
where would such rewriting code go.
...
The issue I was trying to deal with was storage. Currently we 100% assume
that the interwiki list is a table and there will only ever be one of them.
Do we really assume that? Certainly that's the default config, but I
don't think that is the config used on WMF. As far as I'm aware,
Wikimedia uses a cdb database file (via $wgInterwikiCache), which
contains all the interwikis for all sites. From what I understand, it
supports doing various "scope" levels of interwikis, including per db,
per site (Wikipedia, Wiktionary, etc), or global interwikis that act
on all sites.
The feature is a bit wmf specific, but it does seem to support
different levels of interwiki lists.
Furthermore, I imagine (but don't know, so lets see how fast I get
corrected ;) that the cdb database was introduced not just as
convenience measure for easier administration of the interwiki tables,
but also for better performance.  If so, one should also take into
account any performance hit that may come with switching to the
proposed "sites" facility.
Cheers,
-bawolff
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l