~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
On 12-08-09 12:00 PM, Jeroen De Dauw wrote:
Hey,
Daniel, thanks for your input.
TL;DR at the bottom :)
The issue I was trying to deal with was storage. Currently we 100%
assume that the interwiki list is a table and there will only ever be one of them.
Yes, we are not changing this. Having a more flexible system might or might not be something we'd want in MediaWiki. We do not need it in Wikidata though. The changes we're making here do not seem to affect this issue at all, so you can just as well implement it later on.
In practice we don't want one interwiki map. In projects like
Wikimedia we actually usually want two or three.
.. And sometimes we also want a wiki-local interwiki list because some
communities want to add links to sites that other wikis don't.
This we are actually tacking, although in a different fashion then you propose. Rather then having many different lists of sites to maintain, we have split sites from their configuration. The list of sites is global and shared by all clients. Their configuration however is local. So if wiki a wants to use site x as interwikilink with prefix foobar, wiki b wants to use it with prefix baz and wiki c does not want to use it as interwikilink at all, this is perfectly possible. This split and associated generalization our changes bring add a lot of flexibility compared to the current system and remove bad assumptions currently baked in.
I think we're going to need to have some of this and the synchronization stuff in core. Right now the code has nothing but the one sites table. No repo code so presumably the only implementation of that for awhile will be wikidata. And if parts of this table is supposed to be editable in some cases where there is no repo but non-editable then I don't see any way for an edit ui to tell the difference.
I'm also not sure how this synchronization which sounds like one-way will play with individual wikis wanting to add new interwiki links.
Also anything in this area really needs to think of our lack of user
interface. If we rewrite this then we absolutely must include a UI to view and edit this in core.
Again, this is not something we're touching at all, or want to touch, as we don't need it. Personally I think I'd be great to have such facilities, and it makes sense to add these after the backend has been fixed. I'd be happy to work with you on this (or leave it entirely up to you) once we got the relevant rewrite work done.
By rewriting it we ditch every hack trying to make it easy to
control the interwiki list and only make the problem worse.
Our change will not drop any existing functionality. I will make sure there are tools/facilities at least as good (and probably better) then the current ones.
I'm talking about things like the interwiki extensions and scripts that turn wiki tables into interwiki lists. All these things are written against the interwiki table. So by rewriting and using a new table we implicitly break all the working tricks and throw the user back into SQL.
I would like to understand what Wikidata needs out of
interwiki/sites and what it's going to do with the data
We need this for our "equivalent links", which consist out of a global site id and a page. Right now we do not have consistent global ids, in fact we don't have global ids. We just have local ids that happen to be similar everywhere (while one might not want this, but is pretty much forced to right now), which must be language codes in order to be "languagelinks" or (better named) "equivalent links". Also, right now, all languagelinks are interwikilinks (wtf) - we want to be able to have "equivalent links" without then also being interwiki links!
I like the idea of table entries without actual interwikis. The idea of some interface listing user selectable sites came to mind and perhaps sites being added trivially even automatically. Though if you plan to support this I think you'll need to drop the NOT NULL from site_local_key.
Actually, another thought makes me think the schema should be a little different. site_local_key probably shouldn't be a column, it should probably be another table. Something like site_local_key (slc_key, slc_site) which would map things like en:, Wikipedia:, etc... to a specific site. I can see wikis wanting to use multiple interwiki names for the same site. In fact I'm pretty sure this already happens with the existing interwiki table. We just create duplicate rows. But you want global ids so I really don't think you want data duplication like that to happen.
I'd also like to know if Wikidata plans to add any interface that
will add/remove sites
The backend will have an interface to do this, but we're not planning on any API modules or UIs. The backend will be written keeping in mind people will want those though, so it ought to be easy to add them later on.
So to wrap up: I don't think there is any conflict between what we want to do (if you disagree, please provide some pointers). You can make your changes later on, and will have a much more solid base to work on then now.
I think I need to understand the plans you have for synchronization a bit more. - Where does Wikidata get the sites - What synchronizes the data - What is the repo like. Also what it it based off of. Is this wikis syncing from another wiki's sites table or does Wikidata have a real set of data the sites table gets based off of. - Is this one-way synchronization or multiway.
synchronization, treatment of the table (whether it's an index of something else or first class data), and editing/UIs for editing are a set of things where you can get in the way of the ability to do the others later if you don't think of them all up front.
Our old interwiki table was treated as first-class data and was simple data that was easy to create an edit interface for. As a result it's hard to do any synchronization for since we didn't plan for it. Likewise if we design a sites table focused on synchronizing data and treatment of the table as simultaneous first-class data with some of it treated like an index. We can easily come up with something that is going to get in the way of the consistency needed for a UI.
One of our options might be to treat sites like an index of data built from other sources just like pagelinks. Wikidata can act as a repo, the sites code can build from multiple sources with Wikidata being the first, and when a UI comes into play the UI can create it's own list of sites and that can be used as a source for the building of the sites table. ---- Heh, it probably doesn't help that this is making my abstract revision idea come up and make me want to have the UI depend off of that.
Cheers
-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. --
Btw if you really want to make this an abstract list of sites dropping site_url and the other two related columns might be an idea. At first glance the url looks like something standard that every site would have. But once you throw something like MediaWiki into the mix with short urls, long urls, and an API the url really becomes type specific data that should probably go in the blob. Especially when you start thinking about other custom types.