Hi all, I'm not sure exactly where to raise this, so am asking here.
A researcher I have been in touch with has proposed starting a 2nd, research-oriented Wikimedia toolserver. He thinks his lab can pay for the hardware and would be willing to maintain it, if they could get help setting it up. He got this idea after a member of his research group tried (unsuccessfully so far -- no response) to get an account on the current toolserver; their Wikipedia-related research has been put on hold for a few months because of the delay. (It seems like there is a big backlog of account requests right now and only one person working on them?) This research group has done some interesting Wikipedia research to date and I expect they could do more with access to the right data.
Personally, I think a dedicated toolserver is a great idea for the research community, but I know very little about the technical issues involved and/or whether this has been proposed before. Please comment, and I can pass on replies and put the researcher in touch with the tech team if it seems like a good idea.
-- user: "first post on wikitech" phoebe
Greetings, We are setting up a research server at Concordia University (Canada) that is dedicated for Wikipedia. We would love to share the resources with anyone interested.
In case anyone needs help setting it up, we would love to help as well.
bilal
On Mon, Mar 9, 2009 at 8:07 PM, phoebe ayers phoebe.wiki@gmail.com wrote:
Hi all, I'm not sure exactly where to raise this, so am asking here.
A researcher I have been in touch with has proposed starting a 2nd, research-oriented Wikimedia toolserver. He thinks his lab can pay for the hardware and would be willing to maintain it, if they could get help setting it up. He got this idea after a member of his research group tried (unsuccessfully so far -- no response) to get an account on the current toolserver; their Wikipedia-related research has been put on hold for a few months because of the delay. (It seems like there is a big backlog of account requests right now and only one person working on them?) This research group has done some interesting Wikipedia research to date and I expect they could do more with access to the right data.
Personally, I think a dedicated toolserver is a great idea for the research community, but I know very little about the technical issues involved and/or whether this has been proposed before. Please comment, and I can pass on replies and put the researcher in touch with the tech team if it seems like a good idea.
-- user: "first post on wikitech" phoebe
--
- I use this address for lists; send personal messages to phoebe.ayers
<at> gmail.com *
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Bilal Abdul Kader schrieb:
Greetings, We are setting up a research server at Concordia University (Canada) that is dedicated for Wikipedia. We would love to share the resources with anyone interested.
In case anyone needs help setting it up, we would love to help as well.
bilal
There's a project for a biggish research cluster for wikipedia data awaiting funding at the Syracuse University. I forwarded your mail to one of the people involved. Perhaps you can join forces.
On Mon, Mar 9, 2009 at 8:07 PM, phoebe ayers phoebe.wiki@gmail.com wrote:
Hi all, I'm not sure exactly where to raise this, so am asking here.
A researcher I have been in touch with has proposed starting a 2nd, research-oriented Wikimedia toolserver. He thinks his lab can pay for the hardware and would be willing to maintain it, if they could get help setting it up. He got this idea after a member of his research group tried (unsuccessfully so far -- no response) to get an account on the current toolserver; their Wikipedia-related research has been put on hold for a few months because of the delay. (It seems like there is a big backlog of account requests right now and only one person working on them?) This research group has done some interesting Wikipedia research to date and I expect they could do more with access to the right data.
I apologize for the delay, perhaps you can send me some detaqils in private, and I'll look at it. DaB doesn't have much time lately, and we had some major changes in infrastructure to take care of, that caused some delays.
Personally, I think a dedicated toolserver is a great idea for the research community, but I know very little about the technical issues involved and/or whether this has been proposed before. Please comment, and I can pass on replies and put the researcher in touch with the tech team if it seems like a good idea.
If it makes sense to run a separate cluster largely depends on what kind of data you need access too, and in what time frame. If you workj mustly on secondaty data like link tables, and you need the data in near-real time, use toolserver.org. That's what it's there for, and it's unlikely you can set up anything that could get the same data with low latency.
However, if you work mostly on full text, toolserver.org is not so useful anyway - there's no direct access to full page text there anyway, not to search indexes. Having a dedicated cluster for research on textual content, perhaps providing content in various pre-processed forms, would be a very good idea. This is what the project I mentioned above aims at, and I'll be happy to support this effort officially, as Wikimedia Germany's tech guy.
-- daniel
I have a toolserver account and would be willing to run some queries for him, what kind of queries are you wanting to run
Betacommand
On Mon, Mar 9, 2009 at 8:07 PM, phoebe ayers phoebe.wiki@gmail.com wrote:
Hi all, I'm not sure exactly where to raise this, so am asking here.
A researcher I have been in touch with has proposed starting a 2nd, research-oriented Wikimedia toolserver. He thinks his lab can pay for the hardware and would be willing to maintain it, if they could get help setting it up. He got this idea after a member of his research group tried (unsuccessfully so far -- no response) to get an account on the current toolserver; their Wikipedia-related research has been put on hold for a few months because of the delay. (It seems like there is a big backlog of account requests right now and only one person working on them?) This research group has done some interesting Wikipedia research to date and I expect they could do more with access to the right data.
Personally, I think a dedicated toolserver is a great idea for the research community, but I know very little about the technical issues involved and/or whether this has been proposed before. Please comment, and I can pass on replies and put the researcher in touch with the tech team if it seems like a good idea.
-- user: "first post on wikitech" phoebe
--
- I use this address for lists; send personal messages to phoebe.ayers
<at> gmail.com *
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Tue, Mar 10, 2009 at 11:07 AM, phoebe ayers phoebe.wiki@gmail.com wrote:
Personally, I think a dedicated toolserver is a great idea for the research community, but I know very little about the technical issues involved and/or whether this has been proposed before. Please comment, and I can pass on replies and put the researcher in touch with the tech team if it seems like a good idea.
Currently all data, including private data, is replicated to the toolserver. We could not do this with a third-party server.
On Mon, Mar 9, 2009 at 9:18 PM, Andrew Garrett andrew@werdn.us wrote:
Currently all data, including private data, is replicated to the toolserver. We could not do this with a third-party server.
. . . and this fact is also apparently a major reason for the slowness of new user review. New roots can't be added to the toolserver until the private data is moved off, so there are too few roots to add new users.
On Tue, Mar 10, 2009 at 12:33 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
. . . and this fact is also apparently a major reason for the slowness of new user review. New roots can't be added to the toolserver until the private data is moved off, so there are too few roots to add new users.
The bottleneck is in approval (by Wikimedia DE's representative Daniel), not in creating their accounts.
On Mon, Mar 9, 2009 at 9:54 PM, Andrew Garrett andrew@werdn.us wrote:
The bottleneck is in approval (by Wikimedia DE's representative Daniel), not in creating their accounts.
Oh. Why does a single specific person have to handle the approval of all toolserver account requests, then?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aryeh Gregor:
Oh. Why does a single specific person have to handle the approval of all toolserver account requests, then?
because accounts have to be approved by WM-DE, and WM-DE has designated this person to approve accounts on their behalf.
- river.
On Mon, Mar 9, 2009 at 9:33 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
. . . and this fact is also apparently a major reason for the slowness of new user review. New roots can't be added to the toolserver until the private data is moved off, so there are too few roots to add new users.
Really? We just got a new root (Werdna) and normally regular roots do not handle new accounts anyway -- that job rests with the WMDE contact, currently DaB, doesn't it?
Currently all data, including private data, is replicated to the toolserver. We could not do this with a third-party server.
My understanding is that the the toolserver(/s) are owned by the german chapter and not by wikimedia directly so why is private data being replicated onto them?
On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858snake@yahoo.com.au wrote:
Currently all data, including private data, is replicated to the toolserver. We could not do this with a third-party server.
My understanding is that the the toolserver(/s) are owned by the german chapter and not by wikimedia directly so why is private data being replicated onto them?
Because it was chosen as the best technical solution. Is there a specific problem with private data being on the toolserver? If so, what?
You should be aware that toolserver roots are approved by the foundation before becoming roots.
On Mon, Mar 9, 2009 at 9:29 PM, Andrew Garrett andrew@werdn.us wrote:
On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858snake@yahoo.com.au wrote:
Currently all data, including private data, is replicated to the toolserver. We could not do this with a third-party server.
My understanding is that the the toolserver(/s) are owned by the german chapter and not by wikimedia directly so why is private data being replicated onto them?
Because it was chosen as the best technical solution. Is there a specific problem with private data being on the toolserver? If so, what?
I'd say the added worries about security and access approval are a "problem" partially bundled up with that, even if they can be worked around.
Logistically it would be nice to have a means of providing an exclusively public data replica for purposes such as research, though I can certainly see how that could get technically messy.
-Robert Rohde
Robert Rohde schrieb:
On Mon, Mar 9, 2009 at 9:29 PM, Andrew Garrett andrew@werdn.us wrote:
On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858snake@yahoo.com.au wrote:
Currently all data, including private data, is replicated to the toolserver. We could not do this with a third-party server.
My understanding is that the the toolserver(/s) are owned by the german chapter and not by wikimedia directly so why is private data being replicated onto them?
Because it was chosen as the best technical solution. Is there a specific problem with private data being on the toolserver? If so, what?
I'd say the added worries about security and access approval are a "problem" partially bundled up with that, even if they can be worked around.
Logistically it would be nice to have a means of providing an exclusively public data replica for purposes such as research, though I can certainly see how that could get technically messy.
As far as I know, there is simply no efficient way to do this currently. MySQL's replication can be told to omit entire tables, but not individual columns or even rows. That would be required though. Witrh the new revision-deletion feature, we have even more trouble.
So, toolserver roots need to be trusted and approved by the foundation. However, account *approval* doesn't require root access. It doesn't require any access, technically. Accoiunt *creation* of course does, but that's not much of a problem (except currently, because of infrastructure changes due to new serves, but that will be fixed soon).
To avoid confusion: *two* Daniels can do approval: DaB and me. We both don't have much time, currently - DaB does it every now and then, and I don't do it at all, admittedly - i'm caught up in organizing the dev meeting and hardware orders besides doing my regular develoment jobs. I suppose we should streamline the process, yes. This would be a good topic for the developer meeting, maybe.
-- daniel
On Tue, Mar 10, 2009 at 12:29 AM, Andrew Garrett andrew@werdn.us wrote:
On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858snake@yahoo.com.au wrote:
Currently all data, including private data, is replicated to the toolserver. We could not do this with a third-party server.
My understanding is that the the toolserver(/s) are owned by the german chapter and not by wikimedia directly so why is private data being replicated onto them?
Because it was chosen as the best technical solution. Is there a specific problem with private data being on the toolserver? If so, what?
You should be aware that toolserver roots are approved by the foundation before becoming roots.
You answer the questions in your first paragraph with your sentence in the second. Think Cathedral vs. Bazaar.
On Tue, Mar 10, 2009 at 4:27 AM, Daniel Kinzler daniel@brightbyte.dewrote: Robert Rohde schrieb: On Mon, Mar 9, 2009 at 9:29 PM, Andrew Garrett andrew@werdn.us wrote: Logistically it would be nice to have a means of providing an exclusively public data replica for purposes such as research, though I can certainly see how that could get technically messy.
As far as I know, there is simply no efficient way to do this currently.
How much information does the live feed provide? Every revision, or just a subset of revisions? How much would it cost the WMF to provide a single near-live stream of every revision?
Anthony schrieb:
On Tue, Mar 10, 2009 at 12:29 AM, Andrew Garrett andrew@werdn.us wrote:
On Tue, Mar 10, 2009 at 3:21 PM, K. Peachey p858snake@yahoo.com.au wrote:
Currently all data, including private data, is replicated to the toolserver. We could not do this with a third-party server.
My understanding is that the the toolserver(/s) are owned by the german chapter and not by wikimedia directly so why is private data being replicated onto them?
Because it was chosen as the best technical solution. Is there a specific problem with private data being on the toolserver? If so, what?
You should be aware that toolserver roots are approved by the foundation before becoming roots.
You answer the questions in your first paragraph with your sentence in the second. Think Cathedral vs. Bazaar.
On Tue, Mar 10, 2009 at 4:27 AM, Daniel Kinzler daniel@brightbyte.dewrote: Robert Rohde schrieb: On Mon, Mar 9, 2009 at 9:29 PM, Andrew Garrett andrew@werdn.us wrote: Logistically it would be nice to have a means of providing an exclusively public data replica for purposes such as research, though I can certainly see how that could get technically messy.
As far as I know, there is simply no efficient way to do this currently.
How much information does the live feed provide? Every revision, or just a subset of revisions? How much would it cost the WMF to provide a single near-live stream of every revision?
A feed service for all revisions is available, see http://meta.wikimedia.org/wiki/Wikimedia_update_feed_service. Search engines like to use it (think: answers.com) and they are made to pay for it. Researches should generally get it for free. Just ask brion.
This doesn provide notifications in the range of seconds (which might bee needed for vandal-fighting tools), but should be quite sufficient to keep a text database up to date. For real-time notifications, the only decent method is the RC feed on IRC, but that's hard to parse and messages frequently get truncated.
Having better means for distributing notifications of changes is something i'm quite interested in. XMPP would be a very good choice, I think, I wrote about it a while ago here: http://brightbyte.de/page/RecentChanges_via_Jabber. I did not write about including full revision text or diffs in the notifications, but that's sure possible. It may be a bit too heavy for a general purpose feed, but it would be feasible wehen using PubSub, I think. Anyway, getting this implemented would be nice. If anyone has time and/or money he could commit towards this, that would be excellent :)
-- daniel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
phoebe ayers:
Personally, I think a dedicated toolserver is a great idea for the research community, but I know very little about the technical issues involved and/or whether this has been proposed before. Please comment, and I can pass on replies and put the researcher in touch with the tech team if it seems like a good idea.
i don't understand what "research-oriented" toolserver means. what will the research-toolserver provide that the current toolserver doesn't provide?
is the only issue the time it takes for accounts to be created? this is a WM-DE issue; the more people who complain to WM-DE about this, the more likely it is to be resolved. (so far, i've had zero communications from WM-DE about how the only people able to approve accounts are so busy with other things nowadays. on the other hand, i didn't ask them about it either; i suppose they don't bother monitoring the toolserver most of the time.)
we recently conducted a survey of toolserver users, and account approval (not creation) was generally felt to be quite slow. once i produce a report from the results of that survey, we might be able to get WM-DE to do something about it.
most of the issues with the current toolserver come down to money. we don't have enough money to afford redundant databases, so any failure is a major problem and creates inconvenience for users. we don't have enough money for a paid admin, so it often takes a long time for things to get done. we don't have enough money to upgrade hardware when we need it, so things are often slow until the money is available. i think the only non-money issue is that the Wikimedia Foundation won't allow us to add any more admins until they do some internal reorganisation of their databases, which we've been waiting for for several months now.
the more separate toolservers we have, the less efficiently the money is spent. sure, every chapter and university could have their own toolserver, but i don't see how that's a better situation than these people contributing to a single toolserver in order to fix the problems that prevent people from using it. i've lost count of how often i've heard "the toolserver sucks; let's start our own". what i don't understand is why no one says "the toolserver sucks; how can we make it better?". (there _has_ been some interest from other chapters recently about how to improve the toolserver; however, most chapters don't have a lot of money to spend. a single additional database servers for the toolserver would cost at least EUR8'000.)
in the past, we had a lot of problems getting WM-DE to do anything for the toolserver (it seemed everyone there was busy with something else), but that's been better recently, so i think we're making some progress.
- river.
River Tarnell wrote:
i think the only non-money issue is that the Wikimedia Foundation won't allow us to add any more admins until they do some internal reorganisation of their databases, which we've been waiting for for several months now.
Is mediawiki table structure going to change? RevisionDelete system is not friendly for partial replication, but precisely doing things that way is what [will] allows avoiding the row-copying from revision to archive of the 'old' deletion system.
Moreover, any more private method for sharing the tables (eg. a trigger deleting the row when rev_deleted is set) would precisely lose the backup ability the toolserver is performing.
On Tue, Mar 10, 2009 at 7:54 PM, Platonides Platonides@gmail.com wrote:
Is mediawiki table structure going to change?
Yes, it changes on a regular basis.
Moreover, any more private method for sharing the tables (eg. a trigger deleting the row when rev_deleted is set) would precisely lose the backup ability the toolserver is performing.
I don't think the toolserver is used for backups. At least I hope it's not, given its reliability (which is quite good, but "quite good" is scary for backups).
On 3/10/09 5:29 PM, Aryeh Gregor wrote:
On Tue, Mar 10, 2009 at 7:54 PM, PlatonidesPlatonides@gmail.com wrote:
Is mediawiki table structure going to change?
Yes, it changes on a regular basis.
Moreover, any more private method for sharing the tables (eg. a trigger deleting the row when rev_deleted is set) would precisely lose the backup ability the toolserver is performing.
I don't think the toolserver is used for backups. At least I hope it's not, given its reliability (which is quite good, but "quite good" is scary for backups).
The existence of the replicas on toolserver is one of our backups. Obviously we want to improve our offsite backups to include complete offline snapshots as well. It's in progress. :)
-- brion
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aryeh Gregor:
I don't think the toolserver is used for backups.
it is, but only in the sense that it's our only off-site copy of the database. it was not created to act as a backup...
At least I hope it's not, given its reliability (which is quite good, but "quite good" is scary for backups).
... however, if we had enough money to support the toolserver properly, i think it would be perfectly reliable as a backup. that's something that might change this year.
- river.
On Wed, Mar 11, 2009 at 4:13 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
... however, if we had enough money to support the toolserver properly, i think it would be perfectly reliable as a backup.
It wouldn't protect against an accidental DELETE FROM page;, though, would it? Slaves of any kind aren't great backups unless they're deliberately lagged by a significant interval.
On Mar 11, 2009, at 6:14, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Wed, Mar 11, 2009 at 4:13 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
... however, if we had enough money to support the toolserver properly, i think it would be perfectly reliable as a backup.
It wouldn't protect against an accidental DELETE FROM page;, though, would it? Slaves of any kind aren't great backups unless they're deliberately lagged by a significant interval.
Quite so. :) Replication is fantastic against outright failure, but by itself doesnt help agaibst daya loss within the system which gets replicated right alobg with it
we're working on ensuring we've got regular snapshots as well, though this isn't up yet. Regular snapshots plus the replication binlogs provide for point-in-time restoration.
-- brion
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Mar 11, 2009 at 11:20 AM, Brion Vibber brion@wikimedia.org wrote:
Quite so. :) Replication is fantastic against outright failure, but by itself doesnt help agaibst daya loss within the system which gets replicated right alobg with it
we're working on ensuring we've got regular snapshots as well, though this isn't up yet. Regular snapshots plus the replication binlogs provide for point-in-time restoration.
Maybe you need to move the DB servers to ZFS on Solaris too. ;)
we're working on ensuring we've got regular snapshots as well, though this isn't up yet. Regular snapshots plus the replication binlogs provide for point-in-time restoration.
two clusters are right now in regular snapshot schedule (third one waits for some hardware maintenance). though, we should better spot 'DELETE FROM page' within 4 hours or so - we keep snapshots for 4-8 hours.
Cheers,
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Platonides:
Is mediawiki table structure going to change? RevisionDelete system is not friendly for partial replication, but precisely doing things that way is what [will] allows avoiding the row-copying from revision to archive of the 'old' deletion system.
sorry, i don't quite follow what you're saying here. we changed the view definitions on the toolserver when rev_deleted went into use to avoid exposing this information to the users. we don't use any sort of trigger.
BTW: there is no "partial replication" at the toolserver, although this is a common misconception. we replicate everything, then use views to expose the relevant data to users. this is why the foundation won't allow us to add any more admins; the internal / private wikis are also replicated to the toolserver, and visible to any admin.
Moreover, any more private method for sharing the tables (eg. a trigger deleting the row when rev_deleted is set) would precisely lose the backup ability the toolserver is performing.
i don't know what you mean by "more private", but the method we use has no effect at all on how useful the toolserver would be as a backup.
- river.
River Tarnell wrote:
Platonides:
Is mediawiki table structure going to change? RevisionDelete system is not friendly for partial replication, but precisely doing things that way is what [will] allows avoiding the row-copying from revision to archive of the 'old' deletion system.
sorry, i don't quite follow what you're saying here. we changed the view definitions on the toolserver when rev_deleted went into use to avoid exposing this information to the users. we don't use any sort of trigger.
BTW: there is no "partial replication" at the toolserver, although this is a common misconception. we replicate everything, then use views to expose the relevant data to users. this is why the foundation won't allow us to add any more admins; the internal / private wikis are also replicated to the toolserver, and visible to any admin.
I know. That's precisely what i'm addressing. From your email, WMF is "reorganising their databases" so the toolserver can get more admins (less private data is replicated/stored at ts). Any such schema change to the schema would be pretty big, IMHO (and yet incomplete).
Moreover, any more private method for sharing the tables (eg. a trigger deleting the row when rev_deleted is set) would precisely lose the backup ability the toolserver is performing.
i don't know what you mean by "more private", but the method we use has no effect at all on how useful the toolserver would be as a backup.
- river.
Changes so toolserver roots can't get /some/ information would.
On Wed, Mar 11, 2009 at 12:35 PM, Platonides Platonides@gmail.com wrote:
I know. That's precisely what i'm addressing. From your email, WMF is "reorganising their databases" so the toolserver can get more admins (less private data is replicated/stored at ts). Any such schema change to the schema would be pretty big, IMHO (and yet incomplete).
If I understand correctly, the only change being contemplated here is not replicating the databases that are entirely secret (databases of private wikis). Toolserver roots would still have access to things like the recentchanges table and hidden revisions on public wikis, and would presumably still have to sign NDAs or act as Foundation agents or whatever to access those.
I might be misunderstanding, though. If only entire databases need to be hidden, why can't the toolserver just be set up not to replicate those, given that MySQL supports that?
On 3/11/09 9:43 AM, Aryeh Gregor wrote:
On Wed, Mar 11, 2009 at 12:35 PM, PlatonidesPlatonides@gmail.com wrote:
I know. That's precisely what i'm addressing. From your email, WMF is "reorganising their databases" so the toolserver can get more admins (less private data is replicated/stored at ts). Any such schema change to the schema would be pretty big, IMHO (and yet incomplete).
If I understand correctly, the only change being contemplated here is not replicating the databases that are entirely secret (databases of private wikis). Toolserver roots would still have access to things like the recentchanges table and hidden revisions on public wikis, and would presumably still have to sign NDAs or act as Foundation agents or whatever to access those.
I might be misunderstanding, though. If only entire databases need to be hidden, why can't the toolserver just be set up not to replicate those, given that MySQL supports that?
Could be done. We're also fine with new toolserver roots as long as we approve em too for now.
-- brion
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber:
Could be done. We're also fine with new toolserver roots as long as we approve em too for now.
it would have been nice if the Toolserver was aware of this ;-)
- river.
On 3/12/09 7:14 AM, River Tarnell wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brion Vibber:
Could be done. We're also fine with new toolserver roots as long as we approve em too for now.
it would have been nice if the Toolserver was aware of this ;-)
I was pretty sure this came up in an IRC chat a few months ago; my apologies if we didn't both realize it. :)
-- brion
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aryeh Gregor:
If I understand correctly, the only change being contemplated here is not replicating the databases that are entirely secret (databases of private wikis).
this is correct.
I might be misunderstanding, though. If only entire databases need to be hidden, why can't the toolserver just be set up not to replicate those, given that MySQL supports that?
because it would require a proxy server under WMF control that filtered out the evil tables and provided a "clean" replicated feed to the toolserver, which is a lot more effort (and more fragile) than just moving the bad data.
- river.
Hello everyone. We started the conversation with Phoebe about the possibility of a "research-oriented toolserver" that could be used by researchers who wish to explore novel gadgets or other tools for Wikipedia users. The toolserver could provide back-end support for these gadgets.
By the phrase "research-oriented toolserver" we are looking for similar services to what is available in the existing toolserver cluster. From what we've heard of the research infrastructures being developed at Syracuse and Concordia, they will be valuable for researchers who are in need of full text data access on a large scale. The research toolserver, by contrast, would be for tools that need "live" access to Wikipedia databases, but that would only access the full text on a small scale through the Wikipedia API.
The major difference from our perspective is how applications for new accounts would be handled. Our idea is to be able to hand out accounts based around the likelihood of effective research, rather than on visibility within Wikipedia, or on the usefulness of the resulting tool to the larger Wikipedia community. The latter two cases are already handled well by the existing toolserver and its application process. Accounts on the research toolserver would be approved based on the quality of the research ideas, and the ability of the proposing team to carry out the research.
The research toolserver would need a more transparent decision-making process for approving accounts. The basis for decisions should be clear to applicants so they're able to write better applications, and denied applications should be returned with feedback about why the decision was made.
What do you think? Seem like a useful idea if we can find sufficient resources, and put together a management plan?
Morten Warncke-Wang, Research Assistant John Riedl, Professor GroupLens Research www.grouplens.org
The current toolserver user base is always willing to help. I for one am willing to review and run queries on the database if requested. I also am a very active python programmer and can use that to assist. If you have requests let me know. (Unless Im losing it) there have been accounts that are only created to be used for database queries. But until then feel free to email or contact me. I think that improving and expanding the current TS is the best option as further duplications will result in lower preformance.
Betacommand
On Wed, Mar 11, 2009 at 7:05 PM, Morten Warncke-Wang morten@cs.umn.eduwrote:
Hello everyone. We started the conversation with Phoebe about the possibility of a "research-oriented toolserver" that could be used by researchers who wish to explore novel gadgets or other tools for Wikipedia users. The toolserver could provide back-end support for these gadgets.
By the phrase "research-oriented toolserver" we are looking for similar services to what is available in the existing toolserver cluster. From what we've heard of the research infrastructures being developed at Syracuse and Concordia, they will be valuable for researchers who are in need of full text data access on a large scale. The research toolserver, by contrast, would be for tools that need "live" access to Wikipedia databases, but that would only access the full text on a small scale through the Wikipedia API.
The major difference from our perspective is how applications for new accounts would be handled. Our idea is to be able to hand out accounts based around the likelihood of effective research, rather than on visibility within Wikipedia, or on the usefulness of the resulting tool to the larger Wikipedia community. The latter two cases are already handled well by the existing toolserver and its application process. Accounts on the research toolserver would be approved based on the quality of the research ideas, and the ability of the proposing team to carry out the research.
The research toolserver would need a more transparent decision-making process for approving accounts. The basis for decisions should be clear to applicants so they're able to write better applications, and denied applications should be returned with feedback about why the decision was made.
What do you think? Seem like a useful idea if we can find sufficient resources, and put together a management plan?
Morten Warncke-Wang, Research Assistant John Riedl, Professor GroupLens Research www.grouplens.org
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Mar 11, 2009 at 7:05 PM, Morten Warncke-Wang morten@cs.umn.edu wrote:
The major difference from our perspective is how applications for new accounts would be handled. Our idea is to be able to hand out accounts based around the likelihood of effective research, rather than on visibility within Wikipedia, or on the usefulness of the resulting tool to the larger Wikipedia community. The latter two cases are already handled well by the existing toolserver and its application process. Accounts on the research toolserver would be approved based on the quality of the research ideas, and the ability of the proposing team to carry out the research.
As far as I know, the account approval process on the toolserver is fairly lax. As long as you have some credible Wikipedia-related reason to use the toolserver, whether tools or research, you should be able to get an account. Am I wrong? Have any researchers been rejected from the toolserver?
Aryeh Gregor schrieb:
As far as I know, the account approval process on the toolserver is fairly lax. As long as you have some credible Wikipedia-related reason to use the toolserver, whether tools or research, you should be able to get an account. Am I wrong? Have any researchers been rejected from the toolserver?
The main reason for rejection is, as far as i can see, that the project would take to much resources to be feasible on the toolserver. Or it requires access to data we don't have, such as bulk access to page content. Or it needs access to private data, such as IP addresses. Or to access logs, which we don't have.
Research projects that can sanely be expected to actually work on the toolserver should always be approved, even if they have no immediate utility to wiki users.
-- daniel
The major difference from our perspective is how applications for new accounts would be handled. Our idea is to be able to hand out accounts based around the likelihood of effective research, rather than on visibility within Wikipedia, or on the usefulness of the resulting tool to the larger Wikipedia community. The latter two cases are already handled well by the existing toolserver and its application process. Accounts on the research toolserver would be approved based on the quality of the research ideas, and the ability of the proposing team to carry out the research.
The research toolserver would need a more transparent decision-making process for approving accounts. The basis for decisions should be clear to applicants so they're able to write better applications, and denied applications should be returned with feedback about why the decision was made.
What do you think? Seem like a useful idea if we can find sufficient resources, and put together a management plan?
If the only problem solved by setting up a dedicated research cluster is that of the account approval system, then by all means lets fix the system on the toolserver, and keep things together. Apart from the fact that full database replication to a third party system is very unlikely to happen for legal reasons, it would be a waste of hardware and effort.
For a system with a very much different focus, such as text crunching, a separate cluster seems worth considering, even though I'd of course prefer to have everything available to "our" users. But a second system with a spec very similar to ours (live replicated meta data) seems wasteful, even if replication was technically and legally feasible.
Let's try to fix the problems of the current toolserver, starting with the application process and continuing with a plan for on how research projects could contribute to the hardware platform and infrastructure software.
As to the approval policy: research projects are usually approved, if their resource requirements are not too steep. Utility to the wikimedia user community is only one factor that is considered, it's not required for research projects. Making the process more transparent and giving feedback more swiftly is indeed something we should work on. In fact, I will try to set aside a fixed amount of working time for this.
-- daniel
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Morten Warncke-Wang:
What do you think? Seem like a useful idea if we can find sufficient resources, and put together a management plan?
no, like Daniel said, this is a waste of time and effort. i originally assumed that a research toolserver would be different in some technical sense, which might make at least some sense (although i've argued against that elsewhere in this thread). however, i completely fail to understand your reasoning here.
is there some backstory i'm missing? did you apply for a Toolserver account and were rejected because you aren't a Wikipedia editor? does the WM-DE have a history of doing this? (i'm certainly not aware of it, if so...)
if you want to improve the account approval process at the Toolserver, doesn't it make more sense to do that, rather than creating a completely new project to fix one small issue?
- river.
Hi all,
Judging by the replies we think we've failed to communicate clearly some of the ideas we wanted to put forward, and we'd like to take the opportunity to try to clear that up.
We did not want to narrow this down to be only about a third party toolserver. Before we initiated contact we noticed the need for adding more resources to the existing cluster. Therefore we also had in mind the idea of augmenting the toolserver, rather than attempt to create a competitor for it. For instance this could help allow the toolserver to also host applications requiring some amounts of text crunching, which is currently not feasible as far as we can tell.
Additionally we think there could perhaps be two paths to account creation, one for Wikipedians and one for researchers, with the research path laid out with clearer documentation on the requirements projects would need to fit the toolserver and what the application should contain, which combined with faster feedback would aid to make the process easier for the researchers.
We hope that this clears up some central points in our ideas surrounding a "research oriented toolserver". Currently we are exploring several ideas and this particular one might not become more than a thought and a thread on a mailing list. Nonetheless perhaps there are thoughts here that can become more solid somewhere down the line.
Morten Warncke-Wang, Research Assistant John Riedl, Professor GroupLens Research www.grouplens.org
On Fri, Mar 13, 2009 at 5:42 PM, Morten Warncke-Wang morten@cs.umn.edu wrote:
Additionally we think there could perhaps be two paths to account creation, one for Wikipedians and one for researchers, with the research path laid out with clearer documentation on the requirements projects would need to fit the toolserver and what the application should contain, which combined with faster feedback would aid to make the process easier for the researchers.
I don't see why a better-documented or faster approval track need only be provided for researchers. If the existing process is poorly documented or slow, that should be fixed for all users.
I think what the toolserver guys are saying is that they've got the data (e.g., a replica of the master database) and they are willing to expand operations to include larger-scale computations, and so yes they are willing to become more "research oriented". They just need the extra hardware of course. I think it's difficult to estimate how much but here are some applications that I would like to make or see made sooner or later:
* WikiBlame - A Lucene index of the history of all projects that can instantly find the authors of a pasted snippet. I'm not clear on the memory requirements of hosting an app like this after the index is created, but the index will be terabyte-size at 35% of the text dump.
* WikiBlame for images - an image similarity algorithm over all images in all projects that can find all places a given image is being used. I believe there is a one-time major cpu cost when first analyzing the images and then a much lesser realtime comparison cost. Again, the memory requirements of hosting such an app are unclear.
* A vandalism classifier bot that uses the entire history of a wiki in order to predict whether the current edit is vandalism. Basically, a major extension of existing published work on automatically detecting vandalism, which only used several hundred edits. This would require major cpu resources for training but very little cost for real-time classification.
* Dumps, including extended dump formats such as a natural language parse of the full text of the recent version of a wiki made readily available for researchers.
Finally, there are many worthwhile projects that have been presented at past Wikimanias or published in the literature that deserve to be kept up to date as the encyclopedia continues to grow. Permanent hosting for such projects would be a worthwhile goal, as would reaching out to these researchers. If the foundation can afford such an endeavor, the hardware cost is actually not that great. Perhaps datacenter fees are.
On Fri, Mar 13, 2009 at 3:42 PM, Morten Warncke-Wang morten@cs.umn.edu wrote:
Hi all,
Judging by the replies we think we've failed to communicate clearly some of the ideas we wanted to put forward, and we'd like to take the opportunity to try to clear that up.
We did not want to narrow this down to be only about a third party toolserver. Before we initiated contact we noticed the need for adding more resources to the existing cluster. Therefore we also had in mind the idea of augmenting the toolserver, rather than attempt to create a competitor for it. For instance this could help allow the toolserver to also host applications requiring some amounts of text crunching, which is currently not feasible as far as we can tell.
Additionally we think there could perhaps be two paths to account creation, one for Wikipedians and one for researchers, with the research path laid out with clearer documentation on the requirements projects would need to fit the toolserver and what the application should contain, which combined with faster feedback would aid to make the process easier for the researchers.
We hope that this clears up some central points in our ideas surrounding a "research oriented toolserver". Currently we are exploring several ideas and this particular one might not become more than a thought and a thread on a mailing list. Nonetheless perhaps there are thoughts here that can become more solid somewhere down the line.
Morten Warncke-Wang, Research Assistant John Riedl, Professor GroupLens Research www.grouplens.org
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Brian schrieb:
I think what the toolserver guys are saying is that they've got the data (e.g., a replica of the master database) and they are willing to expand operations to include larger-scale computations, and so yes they are willing to become more "research oriented". They just need the extra hardware of course. I think it's difficult to estimate how much but here are some applications that I would like to make or see made sooner or later:
- WikiBlame - A Lucene index of the history of all projects that can
instantly find the authors of a pasted snippet. I'm not clear on the memory requirements of hosting an app like this after the index is created, but the index will be terabyte-size at 35% of the text dump.
Note that WikiTrust can do this too, and will probably go into testing soon. For now, the database for WikiTrust weill be off-site, but if it goes live on wikipedia, the hardwaree would be run at the main wmf cluster, and not on the toolserver.
- WikiBlame for images - an image similarity algorithm over all images
in all projects that can find all places a given image is being used. I believe there is a one-time major cpu cost when first analyzing the images and then a much lesser realtime comparison cost. Again, the memory requirements of hosting such an app are unclear.
That would be very nice to have...
- A vandalism classifier bot that uses the entire history of a wiki in
order to predict whether the current edit is vandalism. Basically, a major extension of existing published work on automatically detecting vandalism, which only used several hundred edits. This would require major cpu resources for training but very little cost for real-time classification.
Pretty big for a toolserver poroject. But an excellent research topic!
- Dumps, including extended dump formats such as a natural language
parse of the full text of the recent version of a wiki made readily available for researchers.
Finally, there are many worthwhile projects that have been presented at past Wikimanias or published in the literature that deserve to be kept up to date as the encyclopedia continues to grow. Permanent hosting for such projects would be a worthwhile goal, as would reaching out to these researchers. If the foundation can afford such an endeavor, the hardware cost is actually not that great. Perhaps datacenter fees are.
Please don't foprget that the toolserver is NOT run by the wikimedia foundation. It's run by wikimedia germany, which has maybe a tenth of the foundation's budget. If the foundation is interested in supporting us further, that's great, we just need to keep responsibilities clear: is the foundation runnign a project, or is the foundation heling us (wikimedia germany) to run a project?...
-- daniel
How will WikiTrust accomplish the WikiBlame function? I think I know what WikiTrust is: http://trust.cse.ucsc.edu/
What gives it the function that you can enter a piece of wiki code from the history of any wiki - totally out of context - and it returns the authors?
On Sat, Mar 14, 2009 at 2:02 AM, Daniel Kinzler daniel@brightbyte.de wrote:
Brian schrieb:
- WikiBlame - A Lucene index of the history of all projects that can
instantly find the authors of a pasted snippet. I'm not clear on the memory requirements of hosting an app like this after the index is created, but the index will be terabyte-size at 35% of the text dump.
Note that WikiTrust can do this too, and will probably go into testing soon. For now, the database for WikiTrust weill be off-site, but if it goes live on wikipedia, the hardwaree would be run at the main wmf cluster, and not on the toolserver.
Brian schrieb:
How will WikiTrust accomplish the WikiBlame function? I think I know what WikiTrust is: http://trust.cse.ucsc.edu/
What gives it the function that you can enter a piece of wiki code from the history of any wiki - totally out of context - and it returns the authors?
Ah, no - it does need the context. IOt just provides regular "blame" functionality, as known from svn etc: for agiven article, it can show you which parts of the text have been added by which author. It can also calculate the percentag of contribution based on word count. It's not designed to find out where a piece of text comes from - that could be quite useful. Perhaps the data wikitrust generates can be useful in building the index for the extended blame function - wikitrust has good algorithems for tracking pieces of text moving around a page, etc.
-- daniel
Morten Warncke-Wang schrieb:
Hi all,
Judging by the replies we think we've failed to communicate clearly some of the ideas we wanted to put forward, and we'd like to take the opportunity to try to clear that up.
We did not want to narrow this down to be only about a third party toolserver. Before we initiated contact we noticed the need for adding more resources to the existing cluster. Therefore we also had in mind the idea of augmenting the toolserver, rather than attempt to create a competitor for it. For instance this could help allow the toolserver to also host applications requiring some amounts of text crunching, which is currently not feasible as far as we can tell.
That would be excellent.
Additionally we think there could perhaps be two paths to account creation, one for Wikipedians and one for researchers, with the research path laid out with clearer documentation on the requirements projects would need to fit the toolserver and what the application should contain, which combined with faster feedback would aid to make the process easier for the researchers.
I think this should be done for all accounts. Why only researchers?
We hope that this clears up some central points in our ideas surrounding a "research oriented toolserver". Currently we are exploring several ideas and this particular one might not become more than a thought and a thread on a mailing list. Nonetheless perhaps there are thoughts here that can become more solid somewhere down the line.
In order to develop ideas, it would be useful to get some idea of what kind of resources you think you can contribute, and under what terms and in what timeframe. I know that talking money in public is usually a bad idea, especially if the money isn't really there yet. If you like, contact me in private, preferrably under my office address, daniel.kinzler AT wikimedia.de. I'm responsible for toolserver operations, so I suppose it's my job to look into this.
-- daniel
Am Tuesday 10 March 2009 01:07:36 schrieb phoebe ayers:
their Wikipedia-related research has been put on hold for a few months because of the delay. (It seems like there is a big backlog of account requests right now and only one person working on them?)
hello, I'm DaB. and I'm the lazzy guy that approve the accounts for normal. I'm sorry that your request take a lot of time, perhaps I can tell you why it took so long: You requested your account at the end of last year. At this time our servers was quite loaded and we wait for addition. So I decide to not create new accounts for first. At the beginn of december we planed which new servers we will buy and we hoped to bought them in December. For some reason that not worked and we bought not before January. So I decide to create no new accounts before the delivery. But it took several weeks until the servers were delivered and one week more to set them up and another week to check them. Now we have the ressources to create new accounts, but then I got the flue (and have it still). I hope that I can create new accounts soon. Daniel was so nice and offer himself for help so it should take not so much time.
And BTW: I saw all your emails, wiki-emails and wiki-messages you and some others send, you was not ignored.
Sincerly, DaB.
wikitech-l@lists.wikimedia.org