Thanks for the responses, all.
Daniel and Bilal: the notes about the possible servers at Syracuse and Concordia are very interesting; it sounds like the researchers interested in such things should team up.
Daniel: I am not sure what type of data is needed -- this is not my project (I'm only the messenger!) but I'll pass along your message and send you private details (and encourage the researcher to reply himself).
River: Well, you say that part of the issue with the toolserver is money and time... and this person that I've been talking to is offering to throw money and time at the problem. So, what can they constructively do?
All: Like I said, I am unclear on the technical issues involved, but as for why a separate "research toolserver" might be useful... : I see a difference in the type of information a researcher might want to pull (public data, large sets of related page information, full-text mining, ??) and the types of tools that the current toolserver mainly supports (editcount tools, catscan, etc). I also see a difference in how the two groups might be authenticated -- there's a difference between being a trusted Wikipedian or trusted Wikimedia developer and being a trusted technically-competent researcher (for instance, I recognized the affiliation of the person who was trying to apply, because I've read their research papers; but if you were going on wikimedia status alone, they don't have any).
-- Phoebe
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
phoebe ayers:
River: Well, you say that part of the issue with the toolserver is money and time... and this person that I've been talking to is offering to throw money and time at the problem. So, what can they constructively do?
i think this is being discussed privately now...
I see a difference in the type of information a researcher might want to pull (public data, large sets of related page information, full-text mining, ??) and the types of tools that the current toolserver mainly supports (editcount tools, catscan, etc).
so, what is missing from the current toolserver that prevents researchers from working with large data sets?
I also see a difference in how the two groups might be authenticated -- there's a difference between being a trusted Wikipedian or trusted Wikimedia developer and being a trusted technically-competent researcher
i don't see why access to the toolserver would be restricted to Wikipedia editors. in fact, i'd be happier giving access to a recognised academic expert than some random guy on Wikipedia.
- river.
On Tue, Mar 10, 2009 at 1:27 PM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
phoebe ayers:
River: Well, you say that part of the issue with the toolserver is money and time... and this person that I've been talking to is offering to throw money and time at the problem. So, what can they constructively do?
i think this is being discussed privately now...
If other research groups are interested in contributing to this, who should they be talking to?
<snip>
i don't see why access to the toolserver would be restricted to Wikipedia editors. in fact, i'd be happier giving access to a recognised academic expert than some random guy on Wikipedia.
The converse of this is that some recognized experts would probably prefer to administer their own server/cluster rather than relying on some random guy with Wikimedia DE (or wherever) to get things done.
-Robert Rohde
Robert Rohde schrieb:
On Tue, Mar 10, 2009 at 1:27 PM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
phoebe ayers:
River: Well, you say that part of the issue with the toolserver is money and time... and this person that I've been talking to is offering to throw money and time at the problem. So, what can they constructively do?
i think this is being discussed privately now...
If other research groups are interested in contributing to this, who should they be talking to?
Wikimedia Germany. That is, I guess, me. Send mail to daniel dot kinzler at wikimedia dot de. I'll forward it as appropriate.
i don't see why access to the toolserver would be restricted to Wikipedia editors. in fact, i'd be happier giving access to a recognised academic expert than some random guy on Wikipedia.
The converse of this is that some recognized experts would probably prefer to administer their own server/cluster rather than relying on some random guy with Wikimedia DE (or wherever) to get things done.
An academic institution may also get a serious research grant for this - that would be more complicated if the money would be handeled via the german chapter. Though it's something we are, of course, also interested in.
Basically, if we could all work on making the toolserver THE ONE PLACE for working with wikipedia's data, that would be perfect. If, for some reason, it makes sense to build a separate cluster, I propose to give it a distict purpose and profile: let it provide facilities for fulltext research, with low priority for the update latency, and high priority of having fulltext in various forms, with search indexes, word lists, and all the fun.
Regards, Daniel
On Tue, Mar 10, 2009 at 2:18 PM, Daniel Kinzler daniel@brightbyte.de wrote:
Robert Rohde schrieb:
The converse of this is that some recognized experts would probably prefer to administer their own server/cluster rather than relying on some random guy with Wikimedia DE (or wherever) to get things done.
An academic institution may also get a serious research grant for this - that would be more complicated if the money would be handeled via the german chapter. Though it's something we are, of course, also interested in.
Basically, if we could all work on making the toolserver THE ONE PLACE for working with wikipedia's data, that would be perfect. If, for some reason, it makes sense to build a separate cluster, I propose to give it a distict purpose and profile: let it provide facilities for fulltext research, with low priority for the update latency, and high priority of having fulltext in various forms, with search indexes, word lists, and all the fun.
Personally I would favor a physically distinct cluster (regardless of who administers it) more or less with the focus you describe. In particular, I think it is useful to separate "tools" from "analysis". A "tool" aims to provide useful information in near realtime based on specific and focused parameters. By contrast, "analysis" often involves running some process systematically through a very large portion of the data with the expectation that it will take a while (for example, I've used dumps to perform large statistical analyses where the processing code might take 24 hours when run against the full edit history of a large wiki.) "Tools" need high availability and low lag relative to the live site, but "analysis" doesn't care if it gets out of date and should use scheduling etc. to balance large loads.
-Robert Rohde
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Robert Rohde:
In particular, I think it is useful to separate "tools" from "analysis".
why?
"Tools" need high availability and low lag relative to the live site, but "analysis" doesn't care if it gets out of date and should use scheduling etc. to balance large loads.
what is preventing people from using the current toolserver for this analysis? what do we need to change about the platform that will enable people to run it on the current toolserver?
- river.
I vote for making the toolserver the head-node to a much larger beowulf cluster that has a well configured job scheduler. The data that needs to be crunched is already right there - it makes sense to put a research cluster there as well.
There will always be a limited supply of resources. Perhaps there should be a public approval system for the resources, where the community gets to pick which jobs should get added to the queue based on public analysis of the code and a description of the computation.
There will be no shortage of participants ;)
On Wed, Mar 11, 2009 at 2:37 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Robert Rohde:
In particular, I think it is useful to separate "tools" from "analysis".
why?
"Tools" need high availability and low lag relative to the live site, but "analysis" doesn't care if it gets out of date and should use scheduling etc. to balance large loads.
what is preventing people from using the current toolserver for this analysis? what do we need to change about the platform that will enable people to run it on the current toolserver?
- river. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX)
iEYEARECAAYFAkm3eD0ACgkQIXd7fCuc5vJeNQCbB3zmpKh2jLmyJDqr6riSXtE5 1GMAoLjUPl28JgGFiXMAMKEEF2659DI8 =R0i8 -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brian:
I vote for making the toolserver the head-node to a much larger beowulf cluster that has a well configured job scheduler.
so the issue is that more CPU is needed to run the research jobs? how much more? do you have an example of a job and what it would require to run here?
- river.
Sure - creating a lucene index of the entire revision history of all wikipedia's for a WikiBlame extension.
More realistically (although I would like to do the above) a natural language parse of the current revision of the english wikipedia. Based on the supposed availability of this hardware, I'd say it could be done in less than a week.
https://wiki.toolserver.org/view/Servers
I have to say the toolserver has grown a lot from that first donated server ^_^
On Wed, Mar 11, 2009 at 3:00 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brian:
I vote for making the toolserver the head-node to a much larger beowulf cluster that has a well configured job scheduler.
so the issue is that more CPU is needed to run the research jobs? how much more? do you have an example of a job and what it would require to run here?
- river. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX)
iEYEARECAAYFAkm3fcsACgkQIXd7fCuc5vLahACgl/mTCSMcqndaChCrooL9geWo qYYAnRBmY5aFv3uvScH6uZWcDB8fTV5a =Q0+7 -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Mar 11, 2009 at 2:09 AM, Brian Brian.Mingus@colorado.edu wrote:
Sure - creating a lucene index of the entire revision history of all wikipedia's for a WikiBlame extension.
More realistically (although I would like to do the above) a natural language parse of the current revision of the english wikipedia. Based on the supposed availability of this hardware, I'd say it could be done in less than a week.
https://wiki.toolserver.org/view/Servers
I have to say the toolserver has grown a lot from that first donated server ^_^
I will confess that this server list is significantly more impressive than I expected it to be based on historical recollections.
To answer River's question, I would basically agree with Brian. The starting point is providing full-text history availability and once you have that there are a number of different projects (like wikiblame) which would desire to pull and process every revision in some way. Some of the code I've worked with would probably take weeks to run single-threaded against enwiki, but that can be made practical if one is willing to throw enough cores at the problem. From an exterior point of view it often seems like toolserver is significantly lagged or tools are going down, and from that I have generally assumed that it operates relatively close to capacity a lot of the time. Perhaps that is a bad assumption, and there is in fact plenty of spare capacity?
-Robert Rohde
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Robert Rohde:
The starting point is providing full-text history availability and once you have that there are a number of different projects (like wikiblame) which would desire to pull and process every revision in some way.
okay, so full text access has been a 'would be nice' thing for a while. i added an item to this year's shopping list for it.
it seems more useful to provide the text in uncompressed form, instead of the MediaWiki internal form that's almost impossible to work with. does that seem reasonable?
Some of the code I've worked with would probably take weeks to run single-threaded against enwiki, but that can be made practical if one is willing to throw enough cores at the problem.
well, this probably isn't something we could afford ourselves, but if there's enough interest in a batch computing infrastructure, it's probably worth talking to external organisations about this.
From an exterior point of view it often seems like toolserver is significantly lagged or tools are going down, and from that I have generally assumed that it operates relatively close to capacity a lot of the time.
that is correct. the way it works is we run at or over capacity for a while, until we can afford new hardware, then things are fast for a while, until we reach capacity again. this repeats every year or so. (interestingly, this is exactly how Wikipedia worked in the first few years.)
- river.
River Tarnell wrote:
it seems more useful to provide the text in uncompressed form, instead of the MediaWiki internal form that's almost impossible to work with. does that seem reasonable?
The tools should get the text in uncompressed form. The interface to do that is not so important. Given the amount of text, I don't think storing text with some kind of compression is something to discard right away.
A common data access interface would be interesting. Perhaps as a C library to link, include as php extension... Then implement it for different sources: -Toolserver text replication -WikiProxy -Mysql mediawiki database -Mediawiki API -XML dump
Then applications just need to be designed for the text interface, debugged with a local install, tested with a small dump, deployed on toolserver...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brian:
Sure - creating a lucene index of the entire revision history of all wikipedia's for a WikiBlame extension.
a natural language parse of the current revision of the english wikipedia.
can you estimate how much resources (disk/cpu/etc) would be needed to create and maintain either of these?
- river.
On Wed, Mar 11, 2009 at 2:40 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Brian:
Sure - creating a lucene index of the entire revision history of all wikipedia's for a WikiBlame extension.
a natural language parse of the current revision of the english wikipedia.
can you estimate how much resources (disk/cpu/etc) would be needed to create and maintain either of these?
A useful baseline to think about might be the database backup dumper. Proposals that require processing the content of every revision to do something are structurally similar to what is required to build and compress a full history dump. Obviously dump generation is a months long process for enwiki right now, but if one is going to add a text service to the toolserver then perhaps there are ways to do that which would cut down on bottlenecks.
-Robert Rohde
I've been trying to do some work mining the full en dump with revision history and was involved in getting together the Syracuse grant proposal. To give you an idea, for me personally, the incentive for a new resource is a need for a server (perhaps a cluster) to support full-text queries at a reasonable speed. People at various research institutions duplicate this effort over and over.
Andrea
On Tue, Mar 10, 2009 at 2:26 PM, phoebe ayers phoebe.wiki@gmail.com wrote:
Thanks for the responses, all.
Daniel and Bilal: the notes about the possible servers at Syracuse and Concordia are very interesting; it sounds like the researchers interested in such things should team up.
Daniel: I am not sure what type of data is needed -- this is not my project (I'm only the messenger!) but I'll pass along your message and send you private details (and encourage the researcher to reply himself).
River: Well, you say that part of the issue with the toolserver is money and time... and this person that I've been talking to is offering to throw money and time at the problem. So, what can they constructively do?
All: Like I said, I am unclear on the technical issues involved, but as for why a separate "research toolserver" might be useful... : I see a difference in the type of information a researcher might want to pull (public data, large sets of related page information, full-text mining, ??) and the types of tools that the current toolserver mainly supports (editcount tools, catscan, etc). I also see a difference in how the two groups might be authenticated -- there's a difference between being a trusted Wikipedian or trusted Wikimedia developer and being a trusted technically-competent researcher (for instance, I recognized the affiliation of the person who was trying to apply, because I've read their research papers; but if you were going on wikimedia status alone, they don't have any).
-- Phoebe
--
- I use this address for lists; send personal messages to phoebe.ayers
<at> gmail.com *
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Andrea Forte:
To give you an idea, for me personally, the incentive for a new resource is a need for a server (perhaps a cluster) to support full-text queries at a reasonable speed.
then why not help us do this on the existing toolserver, so everyone can have access to it, instead of duplicating it yet again somewhere else?
there are many toolserver users who would like direct access to text, and the ability to search it.
- river.
Let me know if you have a grant proposal you'd like help with!
Andrea
On Tue, Mar 10, 2009 at 4:30 PM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Andrea Forte:
To give you an idea, for me personally, the incentive for a new resource is a need for a server (perhaps a cluster) to support full-text queries at a reasonable speed.
then why not help us do this on the existing toolserver, so everyone can have access to it, instead of duplicating it yet again somewhere else?
there are many toolserver users who would like direct access to text, and the ability to search it.
- river. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX)
iEYEARECAAYFAkm2zgIACgkQIXd7fCuc5vLrvgCgkWY9BizcJCSunzrk+dPdrcJO U4wAn0kIpQd7NYVBHfKNwR+dTM2rTon6 =rSHL -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Andrea Forte:
Let me know if you have a grant proposal you'd like help with!
well, i'm still not sure what exactly people need. perhaps the various academic people could produce a list of what they want to do on the toolserver and what's missing at the moment? (e.g. fast text access, search, ...)
then we can look at the best way to provide this, including where the money should come from.
- river.
wikitech-l@lists.wikimedia.org