Hi everyone,
I'm starting a separate thread, because this is an important topic and I don't think it's well served as a subtopic of a "Wikidata blockers" thread.
To recap, Jeroen submitted changeset 14295 in Gerrit https://gerrit.wikimedia.org/r/#/c/14295/ with the following summary:
This commit introduces a new table to hold site data and configuration, objects to represent the table, site objects and lists of sites and associated tests.
The sites code is a more generalized and less contrived version of the interwiki code we currently have and is meant to replace it eventually. This commit does not do away with the existing interwiki code in any way yet.
The reasons for this change where outlined and discussed on wikitech here: http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/060992.html
Thanks Brian for summarizing an important point:
On Fri, Aug 10, 2012 at 6:33 AM, bawolff bawolff+wn@gmail.com wrote:
First and foremost, I'm a little confused as to what the actual use cases here are. Could we get a short summary for those who aren't entirely following how wikidata will work, why the current interwiki situation is insufficient? I've read the I0a96e585 and http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/060992.html, but everything seems very vague "It doesn't work for our situation", without any detailed explanation of what that situation is. At most the messages kind of hint at wanting to be able to access the list of interwiki types of the wikidata "server" from a wikidata "client" (and keep them in sync, or at least have them replicated from server->client). But there's no explanation given to why one needs to do that (are we doing some form of interwiki transclusion and need to render foreign interwiki links correctly? Want to be able to do global whatlinkshere and need unique global ids for various wikis? Something else?)
I've included the rest of Brian's mail below because I think his other points are worth responding to as well, but included the above because I wanted to reiterate his core set of questions.
I don't mean to jerk y'all around. I'm pushing the Platform devs (Tim, Aaron, Chad, and Sam in particular) to be responsive here, and based on the conversations that I've had with them, they have these questions too.
Rob [1] http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/thread.html#60992
---------- Forwarded message ---------- From: bawolff bawolff+wn@gmail.com Date: Fri, Aug 10, 2012 at 6:33 AM Subject: [Wikitech-l] Wikidata blockers weekly update To: wikitech-l wikitech-l@lists.wikimedia.org
Hey,
You mean site_config?
You're suggesting the interwiki system should look for a site by site_local_key, when it finds one parse out the site_config, check if it's disabled and if so ignore the fact it found a site with that local key? Instead of just not having a site_local_key for that row in the first place?
No. Since the interwiki system is not specific to any type of site, this approach would be making it needlessly hard. The site_link_inline field determines if the site should be usable as interwiki link, as you can see in the patchset:
-- If the site should be linkable inline as an "interwiki link" using -- [[site_local_key:pageTitle]]. site_link_inline bool NOT NULL,
So queries would be _very_ simple.
So data duplication simply because one wiki needs a second local name
will mean that one url now has two different global ids this sounds precisely like something that is going to get in the way of the whole reason you wanted this rewrite.
- It does not get in our way at all, and is completely disjunct from why we
want the rewrite
- It's currently done like this
- The changes we do need and are proposing to make will make such a rewrite
at a later point easier then it is now
Doing it this way frees us from creating any restrictions on whatever
source we get sites from that we shouldn't be placing on them.
- We don't need this for Wikidata
- It's a new feature that might or might not be nice to have that currently
does not exist
- The changes we do need and are proposing to make will make such a rewrite
at a later point easier then it is now
So you might as well drop the 3 url related columns and just use the data
blob that you already have.
I don't see what this would gain us at all. It's just make things more complicated.
The $1 pattern may not even work for some sites.
- We don't need this for Wikidata
- It's a new feature that might or might not be nice to have that currently
does not exist
- The changes we do need and are proposing to make will make such a rewrite
at a later point easier then it is now
And in fact we are making this more flexible by having the type system. The MediaWiki site type could for instance be able to form both "nice" urls and index.php ones. Or a gerrit type could have the logic to distinguish between the gerrit commit number and a sha1 hash.
Cheers
[Just to clarify, I'm doing inline replies to things various people said, not just Jeroen]
First and foremost, I'm a little confused as to what the actual use cases here are. Could we get a short summary for those who aren't entirely following how wikidata will work, why the current interwiki situation is insufficient? I've read the I0a96e585 and http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/060992.html, but everything seems very vague "It doesn't work for our situation", without any detailed explanation of what that situation is. At most the messages kind of hint at wanting to be able to access the list of interwiki types of the wikidata "server" from a wikidata "client" (and keep them in sync, or at least have them replicated from server->client). But there's no explanation given to why one needs to do that (are we doing some form of interwiki transclusion and need to render foreign interwiki links correctly? Want to be able to do global whatlinkshere and need unique global ids for various wikis? Something else?)
- Site definitions can exist that are not used as "interlanguage link" and
not used as "interwiki link"
And if we put one of those on a talk page, what would happen? Or if foo was one such link, doing [[:foo:some page]] (Current behaviour is it becomes an interwiki).
Although to be fair, I do see how the current way we distinguish between interwiki and interlang links is a bit hacky.
And in fact we are making this more flexible by having the type system. The MediaWiki site type could for instance be able to form both "nice" urls and index.php ones. Or a gerrit type could have the logic to distinguish between the gerrit commit number and a sha1 hash.
I must admit I do like this this idea. In particular the current situation where we treat the value of an interwiki link as a title (aka spaces -> underscores etc) even for sites that do not use such conventions, has always bothered me. Having interwikis that support url re-writing based on the value does sound cool, but I certainly wouldn't want said code in a db blob (and just using an integer site_type identifier is quite far away from giving us that, but its still a step in a positive direction), which raises the question of where would such rewriting code go.
The issue I was trying to deal with was storage. Currently we 100% assume that the interwiki list is a table and there will only ever be one of them.
Do we really assume that? Certainly that's the default config, but I don't think that is the config used on WMF. As far as I'm aware, Wikimedia uses a cdb database file (via $wgInterwikiCache), which contains all the interwikis for all sites. From what I understand, it supports doing various "scope" levels of interwikis, including per db, per site (Wikipedia, Wiktionary, etc), or global interwikis that act on all sites.
The feature is a bit wmf specific, but it does seem to support different levels of interwiki lists.
Furthermore, I imagine (but don't know, so lets see how fast I get corrected ;) that the cdb database was introduced not just as convenience measure for easier administration of the interwiki tables, but also for better performance. If so, one should also take into account any performance hit that may come with switching to the proposed "sites" facility.
Cheers, -bawolff
_______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hi everyone,
2012/8/11 Rob Lanphier robla@wikimedia.org:
To recap, Jeroen submitted changeset 14295 in Gerrit https://gerrit.wikimedia.org/r/#/c/14295/ with the following summary:
This commit introduces a new table to hold site data and configuration, objects to represent the table, site objects and lists of sites and associated tests.
The sites code is a more generalized and less contrived version of the interwiki code we currently have and is meant to replace it eventually. This commit does not do away with the existing interwiki code in any way yet.
The reasons for this change where outlined and discussed on wikitech here: http://lists.wikimedia.org/pipermail/wikitech-l/2012-June/060992.html
Thanks Brian for summarizing an important point:
On Fri, Aug 10, 2012 at 6:33 AM, bawolff bawolff+wn@gmail.com wrote:
First and foremost, I'm a little confused as to what the actual use cases here are. Could we get a short summary for those who aren't entirely following how wikidata will work, why the current interwiki situation is insufficient?
The use case is the following: in order for Wikidata to be able to provide language links for the wikis using Wikidata, we need to use consistent global IDs when communicating about the involved wikis (i.e. if a "client wiki", i.e. a Wikipedia like fr.wp, asks Wikidata for the language links for an article X, the client and the repo need to know that e.g. "enwiki" refers to en.wp. Right now the table does not sport any such field -- the local prefix "en" might be differently defined on fr.wp and fr.wikinews, for example, and we obviously do not want to break that).
We further made some configurations explicit that are as of now embedded in the code using the current interwiki table.
The change also facilitates synchronizing that data, but this is part of another changeset and of other code.
I am a bit confused here. As far as I can see everyone agrees that this changeset goes in the right direction. I also did not see contentions about how the changeset is working that have not been resolved yet. The reservations that are raised are that the changeset does not go *far enough*. Considering that we want to keep changesets small, and that this changeset keeps the old system in place and thus should not break anything, wouldn't that be a good first step?
If this is the case, why do we not move by taking this step and continue to discuss about how to iterate further from there to an even better and more comprehensive solution?
Cheers, Denny
Hi Denny,
I think we may be talking past each other. Comments inline...
On Mon, Aug 13, 2012 at 9:47 AM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
I am a bit confused here. As far as I can see everyone agrees that this changeset goes in the right direction.
I don't think enough people actually understand the patch well enough to say that. The fear is that it's a step sideways, trading crufty but well-tested code for something larger, more confusing, and less stable.
I also did not see contentions about how the changeset is working that have not been resolved yet. The reservations that are raised are that the changeset does not go *far enough*. Considering that we want to keep changesets small, and that this changeset keeps the old system in place and thus should not break anything, wouldn't that be a good first step?
It depends. Every time someone asks for specifics ("where is this code used?", "what exactly is this needed for?"), they get very meta answers ("it's used in Wikidata").
If you want to expedite this review, give specific answers. Point to line numbers in files, and show how the code there would be far more complicated without this change. Point to specific functionality we can see in a running instance. Use this as an opportunity to educate everyone on Wikidata internals.
Thanks Rob
Hey,
Every time someone asks for specifics ("where is this
code used?", "what exactly is this needed for?"), they get very meta answers ("it's used in Wikidata").
Can you be specific and point to these questions we've answered to vague, then I'll try to answer then in more detail.
If you want to expedite this review, give specific answers. Point to
line numbers in files, and show how the code there would be far more complicated without this change. Point to specific functionality we can see in a running instance. Use this as an opportunity to educate everyone on Wikidata internals.
We need generalizations provided by this patch. Yes, that's not specific at all to why and where we need them. You'd need to know that to verify we're not doing stupid stuff in Wikidata. However, these generalizations make sense on their own, and can be judged entirely loose from Wikidata. Educating people on Wikidata internals really seems to be out of scope to me.
I don't think enough people actually understand the patch well enough
to say that.
The code is well documented and I've been answering questions both on the list here and gerrit. If you want to understand the patch, look at it, and if you're still not clear on anything, ask about it. I don't see how we can do much more from our end - got any suggestions?
The fear is that it's a step sideways, trading crufty
but well-tested code for something larger, more confusing, and less stable.
How do you figure this? My interpretation from the thread is similar to that of Denny - we're basically all agreeing that this change improves on the current system in various ways, but some thing it should tackle some issues it's not currently dealing with as well.
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. --
On Mon, Aug 13, 2012 at 11:03 AM, Jeroen De Dauw jeroendedauw@gmail.com wrote:
Can you be specific and point to these questions we've answered to vague, then I'll try to answer then in more detail.
Two places to start off with: 1. In response to Brian Wolff's email. Many interesting questions were redacted in Denny's response. 2. In response to Tim's July 18 comment here: https://gerrit.wikimedia.org/r/#/c/14295/
We need generalizations provided by this patch. Yes, that's not specific at all to why and where we need them. You'd need to know that to verify we're not doing stupid stuff in Wikidata. However, these generalizations make sense on their own, and can be judged entirely loose from Wikidata.
Not really. Basically, what you're proposing is that these changes are necessary for Wikidata, that you don't have time to implement the full solution, and that's why we have to settle for a halfway solution instead of finishing the job.
I can understand not wanting the scope creep of "finishing the job", since there's not consensus on what that means. What Daniel suggested (which seems to also have the support of Chad and Aaron, at least) is that this is RfC material. If avoiding scope creep is the goal, then it becomes more important to understand exactly what Wikidata needs out of this patch, and that involves understanding the parts of Wikidata that use this.
Educating people on Wikidata internals really seems to be out of scope to me.
Given that the Wikidata code needs a full review by many of the same people that are asking about this particular change, doesn't that seem largely academic?
How do you figure this? My interpretation from the thread is similar to that of Denny - we're basically all agreeing that this change improves on the current system in various ways, but some thing it should tackle some issues it's not currently dealing with as well.
My reading is that folks like Daniel and Chad are conceding that the current system needs to be improved, and that this change *might* be a step in the right direction, but is probably not far enough to be worth dealing with the problems of doing this halfway.
Rob
On Mon, 13 Aug 2012 17:56:49 -0700, Rob Lanphier robla@wikimedia.org wrote:
On Mon, Aug 13, 2012 at 11:03 AM, Jeroen De Dauw jeroendedauw@gmail.com wrote:
Can you be specific and point to these questions we've answered to vague, then I'll try to answer then in more detail.
Two places to start off with:
- In response to Brian Wolff's email. Many interesting questions
were redacted in Denny's response. 2. In response to Tim's July 18 comment here: https://gerrit.wikimedia.org/r/#/c/14295/
We need generalizations provided by this patch. Yes, that's not specific at all to why and where we need them. You'd need to know that to verify we're not doing stupid stuff in Wikidata. However, these generalizations make sense on their own, and can be judged entirely loose from Wikidata.
Not really. Basically, what you're proposing is that these changes are necessary for Wikidata, that you don't have time to implement the full solution, and that's why we have to settle for a halfway solution instead of finishing the job.
I can understand not wanting the scope creep of "finishing the job", since there's not consensus on what that means. What Daniel suggested (which seems to also have the support of Chad and Aaron, at least) is that this is RfC material. If avoiding scope creep is the goal, then it becomes more important to understand exactly what Wikidata needs out of this patch, and that involves understanding the parts of Wikidata that use this.
Educating people on Wikidata internals really seems to be out of scope to me.
Given that the Wikidata code needs a full review by many of the same people that are asking about this particular change, doesn't that seem largely academic?
How do you figure this? My interpretation from the thread is similar to that of Denny - we're basically all agreeing that this change improves on the current system in various ways, but some thing it should tackle some issues it's not currently dealing with as well.
My reading is that folks like Daniel and Chad are conceding that the current system needs to be improved, and that this change *might* be a step in the right direction, but is probably not far enough to be worth dealing with the problems of doing this halfway.
Rob
I also feel that some of the changes that don't go far enough or don't look like the ideal I would have used if I wrote this code, are in areas such as database schema and potentially overall API. Areas which if this is committed now will require anyone who tries to finish the project to add in migrations, etc... just to fix the schema that should have been done right from the start. Also there is a key question undecided. Will the sites table be a first-class edited table. Or act like an index. Not deciding the way we treat this table right now will make it practically impossible to change that perception later on, and if we do decide that it should be more of an index when people have started writing editing interfaces on top of the table then we would practically have to rewrite it yet again.
Frankly some of the code sets off my rewrite nerves. And if I had the time/backing I'd collect all the requirements on an RfC page and write the new system myself.
On Mon, 13 Aug 2012 09:47:21 -0700, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
Hi everyone,
The use case is the following: in order for Wikidata to be able to provide language links for the wikis using Wikidata, we need to use consistent global IDs when communicating about the involved wikis (i.e. if a "client wiki", i.e. a Wikipedia like fr.wp, asks Wikidata for the language links for an article X, the client and the repo need to know that e.g. "enwiki" refers to en.wp. Right now the table does not sport any such field -- the local prefix "en" might be differently defined on fr.wp and fr.wikinews, for example, and we obviously do not want to break that).
We further made some configurations explicit that are as of now embedded in the code using the current interwiki table.
The change also facilitates synchronizing that data, but this is part of another changeset and of other code.
Cheers, Denny
I actually have a side question in this area.
You mention using a global id to refer to sites for making links. And synchronization of the sites table.
So you're saying that this part of Wikidata only works within Wikimedia projects right?
Does Wikidata overall only function within Wikimedia projects. Or is there a different mechanism to deal with clients from external wikis?
Hey,
You mention using a global id to refer to sites for making links. And
synchronization of the sites table.
So you're saying that this part of Wikidata only works within Wikimedia projects right?
Does Wikidata overall only function within Wikimedia projects. Or is there a different mechanism to deal with clients from external wikis?
The software we're writing is completely Wikimedia agnostic and the actual Wikidata project will obvious be usable outside of Wikimedia projects. We will allow for links to non Wikimedia sites (although we have not agreed on how open this will be), and for non-Wikimedia sites to access all data stored within Wikidata (including our "equivalent links" using the sites table). Does that answer your question or am I missing something?
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. --
On Tue, 14 Aug 2012 07:32:07 -0700, Jeroen De Dauw jeroendedauw@gmail.com wrote:
Hey,
You mention using a global id to refer to sites for making links. And
synchronization of the sites table.
So you're saying that this part of Wikidata only works within Wikimedia projects right?
Does Wikidata overall only function within Wikimedia projects. Or is there a different mechanism to deal with clients from external wikis?
The software we're writing is completely Wikimedia agnostic and the actual Wikidata project will obvious be usable outside of Wikimedia projects. We will allow for links to non Wikimedia sites (although we have not agreed on how open this will be), and for non-Wikimedia sites to access all data stored within Wikidata (including our "equivalent links" using the sites table). Does that answer your question or am I missing something?
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. --
Ok, so the data is available to 3rd party wikis.
I was asking how you planned to handle sites in 3rd party wikis. Do you have a separate mechanism to handle links from 3rd party clients? Or are they supposed to sync their sites from Wikimedia's Wikidata?
Hey,
I was asking how you planned to handle sites in 3rd party wikis.
Do you have a separate mechanism to handle links from 3rd party clients? Or are they supposed to sync their sites from Wikimedia's Wikidata?
AFAIK we're providing full urls in our export formats, not sure what our current status on this is and what our exact plans are. We're not exporting site data ourselves (that's really not our job), but third parties can obtain it via the sites API (which has not been created yet, but would be very similar to the existing interwiki API). We _could_ include site data in our export formats as well, but that really is a different discussion altogether :)
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. --
Hi all,
thanks to Daniel (F.) for structuring the discussion. The discussion is currently ongoing here:
https://www.mediawiki.org/wiki/Requests_for_comment/New_sites_system
I hope that the requirements and use cases section is complete. If not, please tune in now. We will build on the use cases and their discussion there.
I also created a first draft based for a schema, which was very quickly completely ripped apart, and replaced by a much better one on the discussion page. There are also other discussions going on there. Please tune in if you are interested in the Sites table, in order to achieve consensus on the topic.
Furthermore, I want to address the unanswered questions Rob raised:
* Re Tim's July 18th comment and Rob's following comment: where is the calling code?
The code calling the sitetables is in the Wikibase Library, basically all the files starting with Site*:
But since they are part of the patchset, you probably seen them. The Sites info is being used in:
* most importantly Wikibase/lib/includes/SiteLink.php, where the site link (e.g. the link from a Wikidata item to a Wikipedia article) is defined using the Sites data. The Sitelinks are the most prominent object depending on the data, and are used basically everywhere on the repository. Wikibase/repo/includes/api/ApiSetSiteLink.php offers a good example of that.
* some utils in Wikibase/lib/includes/Utils.php * further, a few places on the client, like LangLinkHandler and the hooks
* Questions by Bawulff I redacted from my answer (because I was focusing on other stuff):
First and foremost, I'm a little confused as to what the actual use cases here are. Could we get a short summary for those who aren't entirely following how wikidata will work, why the current interwiki situation is insufficient?
Most of all, we need global identifiers for the different wikis. We could add a table which only contains mapping of the local prefixes to global identifiers, but we think that the current interwiki table could use some love anyway, and thus we decided to restructure it as a whole. This now has lead to the above mentioned RFC, but the original blocker is: for providing language links form a central source -- Wikidata -- we need to have global wiki identifiers.
- Site definitions can exist that are not used as "interlanguage link" and
not used as "interwiki link"
And if we put one of those on a talk page, what would happen? Or if foo was one such link, doing [[:foo:some page]] (Current behaviour is it becomes an interwiki).
I probably misunderstand. If currently something is not set up as an interlanguage link and neither as an interwiki link, it will become a normal link, not an interwiki link (i.e. it will point to the local page foo:some page in the main namespace). Did you mean something else?
Although to be fair, I do see how the current way we distinguish between interwiki and interlang links is a bit hacky.
Agreed, the way it is currently done in core is a bit hacky.
And in fact we are making this more flexible by having the type system. The MediaWiki site type could for instance be able to form both "nice" urls and index.php ones. Or a gerrit type could have the logic to distinguish between the gerrit commit number and a sha1 hash.
I must admit I do like this this idea. In particular the current situation where we treat the value of an interwiki link as a title (aka spaces -> underscores etc) even for sites that do not use such conventions, has always bothered me. Having interwikis that support url re-writing based on the value does sound cool, but I certainly wouldn't want said code in a db blob (and just using an integer site_type identifier is quite far away from giving us that, but its still a step in a positive direction), which raises the question of where would such rewriting code go.
A handler class for each type of site, that would construct links to that type of side based on the data about this site.
The issue I was trying to deal with was storage. Currently we 100% assume that the interwiki list is a table and there will only ever be one of them.
Do we really assume that? Certainly that's the default config, but I don't think that is the config used on WMF. As far as I'm aware, Wikimedia uses a cdb database file (via $wgInterwikiCache), which contains all the interwikis for all sites. From what I understand, it supports doing various "scope" levels of interwikis, including per db, per site (Wikipedia, Wiktionary, etc), or global interwikis that act on all sites.
We did not know about that database. Who can tell us more about it? This would be very interesting to get our synching code optimized.
It still wouldn't help us with the global identifiers, though, but it would be good to know more about it.
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
2012/8/14 Daniel Friesen lists@nadir-seen-fire.com:
On Tue, 14 Aug 2012 07:32:07 -0700, Jeroen De Dauw jeroendedauw@gmail.com wrote:
Hey,
You mention using a global id to refer to sites for making links. And
synchronization of the sites table.
So you're saying that this part of Wikidata only works within Wikimedia projects right?
Does Wikidata overall only function within Wikimedia projects. Or is there a different mechanism to deal with clients from external wikis?
The software we're writing is completely Wikimedia agnostic and the actual Wikidata project will obvious be usable outside of Wikimedia projects. We will allow for links to non Wikimedia sites (although we have not agreed on how open this will be), and for non-Wikimedia sites to access all data stored within Wikidata (including our "equivalent links" using the sites table). Does that answer your question or am I missing something?
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. --
Ok, so the data is available to 3rd party wikis.
I was asking how you planned to handle sites in 3rd party wikis. Do you have a separate mechanism to handle links from 3rd party clients? Or are they supposed to sync their sites from Wikimedia's Wikidata?
-- ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org