-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
hi,
currently the dump process is a bit broken. what is the Foundation's position on this? why are developer resources allocated to put the server admin log on twitter, but no one has touched dumps in months? is there not enough money to fund this? how much more is needed?
(for an example of a problem, there is *no* successful enwiki dump at all on download.wikimedia.org--the last 5 dumps, going back to 2008-03-12, all failed.)
- river.
On Sun, Feb 22, 2009 at 7:49 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
why are developer resources allocated to put the server admin log on twitter
er... I think that was a personal choice by one of the shell users, I don't think the Foundation said "dudes, we HAVE to put that log on identi.ca/twitter... who cares about anything else, it's microblogging dude!"
2009/2/22 Casey Brown cbrown1023.ml@gmail.com:
On Sun, Feb 22, 2009 at 7:49 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
why are developer resources allocated to put the server admin log on twitter
er... I think that was a personal choice by one of the shell users, I don't think the Foundation said "dudes, we HAVE to put that log on identi.ca/twitter... who cares about anything else, it's microblogging dude!"
It doesn't matter whose decision it was. If it was done on Foundation time (I don't know if it was), then it was a bad priority.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Casey Brown:
On Sun, Feb 22, 2009 at 7:49 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
why are developer resources allocated to put the server admin log on twitter
er... I think that was a personal choice by one of the shell users
the shell user in question is/was a WMF contractor, and i believe it was done on paid time at Brion's request--but anyway, the point is the WMF is clearly past the stage of running around like headless chickens trying to keep the site on-line, so will we see some progress with dumps any time soon?
- river.
I'm not familiar with the details of the data dump process, so I can't comment on whether it's broken or not.
However, one question that I have is whether the dump includes, or should conclude, all namespaces, or only articles. In the past, there have allegedly been instances in which database dumps have been utilized for purposes such as harvesting oversighted edits in userspace and utilizing the information for purposes of harassment. I am not sure whether there is value to providing dumps of other than the content spaces. Comments?
Newyorkbrad
On Sun, Feb 22, 2009 at 11:02 AM, River Tarnell < river@loreley.flyingparchment.org.uk> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Casey Brown:
On Sun, Feb 22, 2009 at 7:49 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
why are developer resources allocated to put the server admin log on twitter
er... I think that was a personal choice by one of the shell users
the shell user in question is/was a WMF contractor, and i believe it was done on paid time at Brion's request--but anyway, the point is the WMF is clearly past the stage of running around like headless chickens trying to keep the site on-line, so will we see some progress with dumps any time soon?
- river.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX)
iEYEARECAAYFAkmhdxcACgkQIXd7fCuc5vKW4ACfddKQ5pHfbX0JriZDxnj3ehiS DrYAmwa3j24Ey7UEAgfeoJFQhgthFJu7 =Z22D -----END PGP SIGNATURE-----
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2009/2/23 Newyorkbrad (Wikipedia) newyorkbrad@gmail.com:
I'm not familiar with the details of the data dump process, so I can't comment on whether it's broken or not.
It's broken, I don't think there is any dispute there.
However, one question that I have is whether the dump includes, or should conclude, all namespaces, or only articles. In the past, there have allegedly been instances in which database dumps have been utilized for purposes such as harvesting oversighted edits in userspace and utilizing the information for purposes of harassment. I am not sure whether there is value to providing dumps of other than the content spaces. Comments?
If you want complete statistics, you need all the information. It might be interesting to see how the ratio of edits to non-article namespaces to total edits varies over time, for instance.
2009/2/23 Newyorkbrad (Wikipedia) newyorkbrad@gmail.com:
However, one question that I have is whether the dump includes, or should conclude, all namespaces, or only articles. In the past, there have allegedly been instances in which database dumps have been utilized for purposes such as harvesting oversighted edits in userspace and utilizing the information for purposes of harassment. I am not sure whether there is value to providing dumps of other than the content spaces. Comments?
The value of providing good dumps is forkability, in case WMF is hit by a meteor, hit by a legal meteor, goes collectively insane, etc. Imagine trying to fork Wikipedia without being able to take the project spaces with you.
It's too easy for a nominally "open" project to effectively be proprietised by just not providing the data/code/etc.
(We will gloss over the idea that has occurred to me and several others that a nuke-and-pave of the project spaces might be the only way to fix en:wp's terminal instruction creep.)
See my blog post of a coupla years ago on the subject:
http://davidgerard.co.uk/notes/2007/04/10/disaster-recovery-planning/
- d.
Actually, I was thinking primarily of userspace.
Newyorkbrad
On Mon, Feb 23, 2009 at 2:44 PM, David Gerard dgerard@gmail.com wrote:
2009/2/23 Newyorkbrad (Wikipedia) newyorkbrad@gmail.com:
However, one question that I have is whether the dump includes, or should conclude, all namespaces, or only articles. In the past, there have allegedly been instances in which database dumps have been utilized for purposes such as harvesting oversighted edits in userspace and utilizing
the
information for purposes of harassment. I am not sure whether there is value to providing dumps of other than the content spaces. Comments?
The value of providing good dumps is forkability, in case WMF is hit by a meteor, hit by a legal meteor, goes collectively insane, etc. Imagine trying to fork Wikipedia without being able to take the project spaces with you.
It's too easy for a nominally "open" project to effectively be proprietised by just not providing the data/code/etc.
(We will gloss over the idea that has occurred to me and several others that a nuke-and-pave of the project spaces might be the only way to fix en:wp's terminal instruction creep.)
See my blog post of a coupla years ago on the subject:
http://davidgerard.co.uk/notes/2007/04/10/disaster-recovery-planning/
- d.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Mon, Feb 23, 2009 at 2:44 PM, David Gerard dgerard@gmail.com wrote:
2009/2/23 Newyorkbrad (Wikipedia) newyorkbrad@gmail.com:
However, one question that I have is whether the dump includes, or should conclude, all namespaces, or only articles. In the past, there have allegedly been instances in which database dumps have been utilized for purposes such as harvesting oversighted edits in userspace and utilizing
the
information for purposes of harassment. I am not sure whether there is value to providing dumps of other than the content spaces. Comments?
The value of providing good dumps is forkability, in case WMF is hit by a meteor, hit by a legal meteor, goes collectively insane, etc. Imagine trying to fork Wikipedia without being able to take the project spaces with you.
You mean we haven't already gone collectively insane?
To answer Brad's original question: different dumps contain different information. There's the article-only dump that most mirrors, etc. use and there's the larger full-wiki dump. It's the latter that is most prone to failure and tends to kill the overall dump process.
-Chad
2009/2/23 Chad innocentkiller@gmail.com:
On Mon, Feb 23, 2009 at 2:44 PM, David Gerard dgerard@gmail.com wrote:
The value of providing good dumps is forkability, in case WMF is hit by a meteor, hit by a legal meteor, goes collectively insane, etc. Imagine trying to fork Wikipedia without being able to take the project spaces with you.
You mean we haven't already gone collectively insane?
The community is a total write-off, but the WMF seems sane enough to pay the electricity bill on time.
On Mon, Feb 23, 2009 at 2:44 PM, David Gerard dgerard@gmail.com wrote:
It's too easy for a nominally "open" project to effectively be proprietised by just not providing the data/code/etc.
Sure. That's not the issue here, but more people indicating how much they care about good dumps (I do! also full commons dumps, please.) should help.
(We will gloss over the idea that has occurred to me and several others that a nuke-and-pave of the project spaces might be the only way to fix en:wp's terminal instruction creep.)
You often make me smile in the middle of distressing threads.
See my blog post of a coupla years ago on the subject:
http://davidgerard.co.uk/notes/2007/04/10/disaster-recovery-planning/
It would be good to take another look at contingency plans, not only tech backups and alternatives but also ensuring resources go into safety nets such as an endowment.
SJ
A dump with just the article namespace would be grossly incomplete.
Much important information about the validity of the contents is on the discussion pages. But not only there. Other discussions on articles have been held on user talk pages. Missing these out would greatly hamper any judgement on the validity of the articles.
teun spaans
On Mon, Feb 23, 2009 at 8:35 PM, Newyorkbrad (Wikipedia) < newyorkbrad@gmail.com> wrote:
I'm not familiar with the details of the data dump process, so I can't comment on whether it's broken or not.
However, one question that I have is whether the dump includes, or should conclude, all namespaces, or only articles. In the past, there have allegedly been instances in which database dumps have been utilized for purposes such as harvesting oversighted edits in userspace and utilizing the information for purposes of harassment. I am not sure whether there is value to providing dumps of other than the content spaces. Comments?
Newyorkbrad
2009/2/22 River Tarnell river@loreley.flyingparchment.org.uk:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
hi,
currently the dump process is a bit broken. what is the Foundation's position on this? why are developer resources allocated to put the server admin log on twitter, but no one has touched dumps in months? is there not enough money to fund this? how much more is needed?
(for an example of a problem, there is *no* successful enwiki dump at all on download.wikimedia.org--the last 5 dumps, going back to 2008-03-12, all failed.)
According to Brion on wikitech-l nearly a month ago:
"[...] it's a software architecture issue. We'll restart [the enwiki dump] with the new arch when it's ready to go."
So I guess that means it is being worked on, but it doesn't seem to be a high enough priority. This problem has been around for years now, it should have been fixed a long time ago.
Copying the Commons list.
I am interested in hosting (and running some scripts on) copies of the commons media dump on offline regional servers for offline-reading purposes. This is difficult without an image dump.
The last time I looked, I was able to find an image dump from 2007? Now I have a hard time finding that... can someone help me out / point me to a recent torrent?
My 300+G copy from 2007 was deleted in an unfortunate fileserver cleanup, by someone who said "hey, you can always download another copy later." Perhaps I waited too long...
SJ
On Sun, Feb 22, 2009 at 7:49 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
hi,
currently the dump process is a bit broken. what is the Foundation's position on this?
<
(for an example of a problem, there is *no* successful enwiki dump at all on download.wikimedia.org--the last 5 dumps, going back to 2008-03-12, all failed.)
On 2/23/09 5:31 PM, Samuel Klein wrote:
Copying the Commons list.
I am interested in hosting (and running some scripts on) copies of the commons media dump on offline regional servers for offline-reading purposes. This is difficult without an image dump.
Awesome -- can you work with an rsync3-over-ssh setup? Email me and we'll set it up.
(Reminds me -- Greg, can we make sure we've got yours going again?)
-- brion
Its not at all clear why the english wikipedia dump or other large dumps need to be compressed. It is far more absurd to spend hundreds of days compressing a file than it is to spend tens of days downloading one.
On Tue, Feb 24, 2009 at 10:45 AM, Brion Vibber brion@wikimedia.org wrote:
On 2/23/09 5:31 PM, Samuel Klein wrote:
Copying the Commons list.
I am interested in hosting (and running some scripts on) copies of the commons media dump on offline regional servers for offline-reading purposes. This is difficult without an image dump.
Awesome -- can you work with an rsync3-over-ssh setup? Email me and we'll set it up.
(Reminds me -- Greg, can we make sure we've got yours going again?)
-- brion
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Feb 24, 2009 at 12:56 PM, Brian Brian.Mingus@colorado.edu wrote:
Its not at all clear why the english wikipedia dump or other large dumps need to be compressed. It is far more absurd to spend hundreds of days compressing a file than it is to spend tens of days downloading one.
There's no reason it needs to take hundreds of days to compress even a petabyte of data. Bzip2 compression can be done in parallel, producing a file which can be decompressed using standard uncompression software.
In any case, there are cost factors to be considered. Depending on the number of people downloading the file, compression might save a significant amount of money.
There also are other, faster ways to make a file smaller besides compression (like delta encoding), which are probably being looked into.
I am of the understanding that the WMF's bandwidth is very cheap.
If you want to consider costs, I think its appropriate to consider the costs not only to the WMF but to the user. Different compression algorithms have different encode/decode ratios but if it takes a cluster to compress a file there's a good chance you're going to want one to decompress it. It may in fact be much more user friendly to simply offer an enormous text file for download because users don't have to unpack it.
Our mission is to spread knowledge. Compressing that knowledge has been in the way of spreading it for years now. Its high time we gave up!
On Tue, Feb 24, 2009 at 11:18 AM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 12:56 PM, Brian Brian.Mingus@colorado.edu wrote:
Its not at all clear why the english wikipedia dump or other large dumps need to be compressed. It is far more absurd to spend hundreds of days compressing a file than it is to spend tens of days downloading one.
There's no reason it needs to take hundreds of days to compress even a petabyte of data. Bzip2 compression can be done in parallel, producing a file which can be decompressed using standard uncompression software.
In any case, there are cost factors to be considered. Depending on the number of people downloading the file, compression might save a significant amount of money.
There also are other, faster ways to make a file smaller besides compression (like delta encoding), which are probably being looked into. _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Feb 24, 2009 at 1:24 PM, Brian Brian.Mingus@colorado.edu wrote:
I am of the understanding that the WMF's bandwidth is very cheap.
Compared to what?
If you want to consider costs, I think its appropriate to consider the
costs not only to the WMF but to the user. Different compression algorithms have different encode/decode ratios but if it takes a cluster to compress a file there's a good chance you're going to want one to decompress it.
bzip2 decompression speeds on an average CPU almost certainly exceed Internet download speeds.
It may in fact be much more user friendly to simply offer an enormous text file for download because users don't have to unpack it.
I've looked at the numbers and thought about this in detail and I don't think so. What definitely *would* be much more user friendly is to use a compression scheme which allows random access, so that end users don't have to decompress everything all at once in the first place.
The uncompressed full history English Wikipedia dump is reaching (and more likely has already exceeded) the size which will fit on the largest consumer hard drives. So just dealing with such a large file is a problem in itself. And "an enormous text file" is not very useful without an index, so you've gotta import the thing into some sort of database anyway, which, unless you're a database guru is going to take longer than a simple decompression.
In the long term, and considering how long it's taking to just produce a usable dump the long term may never come, the most user friendly dump would already be compressed, indexed, and ready for random access, so a reuser could just download and go (or even download on the fly as needed). It could be done, but I make no bet on whether or not it will be done.
Our mission is to spread knowledge. Compressing that knowledge has
been in the way of spreading it for years now. Its high time we gave up!
Clearly something is in the way. I don't think it's the compression, though.
Anthony wrote:
I've looked at the numbers and thought about this in detail and I don't think so. What definitely *would* be much more user friendly is to use a compression scheme which allows random access, so that end users don't have to decompress everything all at once in the first place.
The uncompressed full history English Wikipedia dump is reaching (and more likely has already exceeded) the size which will fit on the largest consumer hard drives. So just dealing with such a large file is a problem in itself. And "an enormous text file" is not very useful without an index, so you've gotta import the thing into some sort of database anyway, which, unless you're a database guru is going to take longer than a simple decompression.
In the long term, and considering how long it's taking to just produce a usable dump the long term may never come, the most user friendly dump would already be compressed, indexed, and ready for random access, so a reuser could just download and go (or even download on the fly as needed). It could be done, but I make no bet on whether or not it will be done.
I did make indexed, random-access, backwards compatible, XML dumps. http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040812.html
Wouldn't be hard to plug into the dump process (just replace bzip2 on a new DumpPipeOutput) but so far nobody seemed interested on it.
And there's the added benefit of the offline reader I implemented using those files.
Compression allowing random access is definitely the way to go for large selections.
Ángel, that's an interesting reader you wrote. I cc: a list for offline wikireaders (most designed around mediawiki). A similar idea is in use by schools across Peru[1] to provide offline access to the Spanish Wikipedia, based on wikipedia-iphone code: http://dev.laptop.org/git/projects/wikiserver
It doesn't have the windows/IE dependency but leaves out many of your features like special pages, full template support, and categories.
SJ
[1] the same schools want offline access to images, so a smarter reader that knows to look in turn locally / at a server / online to find images is desired.
On Tue, Feb 24, 2009 at 5:34 PM, Ángel keisial@gmail.com wrote:
Anthony wrote:
I've looked at the numbers and thought about this in detail and I don't think so. What definitely *would* be much more user friendly is to use a compression scheme which allows random access, so that end users don't have to decompress everything all at once in the first place.
I did make indexed, random-access, backwards compatible, XML dumps. http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040812.html
Wouldn't be hard to plug into the dump process (just replace bzip2 on a new DumpPipeOutput) but so far nobody seemed interested on it.
And there's the added benefit of the offline reader I implemented using those files.
Samuel Klein wrote:
Compression allowing random access is definitely the way to go for large selections.
Ángel, that's an interesting reader you wrote. I cc: a list for offline wikireaders (most designed around mediawiki).
Subscribed.
A similar idea is in use by schools across Peru[1] to provide offline access to the Spanish Wikipedia, based on wikipedia-iphone code: http://dev.laptop.org/git/projects/wikiserver
It doesn't have the windows/IE dependency
The goal was to support both webbrowser and Mozembed but their different link handling and lack of time reduced it. :( Any takers? ;)
but leaves out many of your features like special pages, full template support, and categories.
SJ
[1] the same schools want offline access to images, so a smarter reader that knows to look in turn locally / at a server / online to find images is desired.
I also support images :) As it's running a mediawiki, it just searches for the images in the local folder, and fallbacks to fetch them from commons using a ForeignAPIRepo.
You can get some samples at http://www.wiki-web.es/mediawiki-offline-reader/ although they're slower than they could.
On Tue, Feb 24, 2009 at 1:24 PM, Brian Brian.Mingus@colorado.edu wrote:
It may in fact be much more user friendly to simply offer an enormous text file for download because users don't have to unpack it.
Another point, which I forgot to mention. If you have the bandwidth to kill and just want an enormous uncompressed text file, why not just screen-scrape everything?
2009/2/24 Anthony wikimail@inbox.org:
On Tue, Feb 24, 2009 at 1:24 PM, Brian Brian.Mingus@colorado.edu wrote:
It may in fact be much more user friendly to simply offer an enormous text file for download because users don't have to unpack it.
Another point, which I forgot to mention. If you have the bandwidth to kill and just want an enormous uncompressed text file, why not just screen-scrape everything?
Because that involves the servers' CPUs as well as bandwidth.
On Tue, Feb 24, 2009 at 2:16 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/2/24 Anthony wikimail@inbox.org:
On Tue, Feb 24, 2009 at 1:24 PM, Brian Brian.Mingus@colorado.edu
wrote:
It may in fact be much more user friendly to simply offer an enormous text file for download because users don't have to unpack it.
Another point, which I forgot to mention. If you have the bandwidth to
kill
and just want an enormous uncompressed text file, why not just
screen-scrape
everything?
Because that involves the servers' CPUs as well as bandwidth.
What's the average ratio of CPU-seconds to download seconds for an article? Surely a single machine could handle thousands of simultaneous screen-scrapers doing this 24/7. I don't buy it.
2009/2/24 Anthony wikimail@inbox.org:
What's the average ratio of CPU-seconds to download seconds for an article? Surely a single machine could handle thousands of simultaneous screen-scrapers doing this 24/7. I don't buy it.
I don't know, but people are asked not to crawl the entire site like that, so I guess there is a reason.
Just to be clear, your suggesting that, in lieu of a compressed dump, people who want the full history of the english wikipedia should use Special:Export to download it, article by article?
Its a truly awful idea.
On Tue, Feb 24, 2009 at 12:06 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 1:24 PM, Brian Brian.Mingus@colorado.edu wrote:
It may in fact be much more user friendly to simply offer an enormous text file for download because users don't have to unpack it.
Another point, which I forgot to mention. If you have the bandwidth to kill and just want an enormous uncompressed text file, why not just screen-scrape everything? _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Feb 24, 2009 at 2:41 PM, Brian Brian.Mingus@colorado.edu wrote:
Just to be clear, your suggesting that, in lieu of a compressed dump, people who want the full history of the english wikipedia should use Special:Export to download it, article by article?
No, I wasn't suggesting it as an alternative to compressed dumps at all. And Special:Export doesn't even work for historical versions any more.
Its a truly awful idea.
Providing only uncompressed dumps would be an even more awful idea.
On Tue, Feb 24, 2009 at 2:46 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 2:41 PM, Brian Brian.Mingus@colorado.edu wrote:
Just to be clear, your suggesting that, in lieu of a compressed dump, people who want the full history of the english wikipedia should use Special:Export to download it, article by article?
No, I wasn't suggesting it as an alternative to compressed dumps at all. And Special:Export doesn't even work for historical versions any more.
You can still dump full-history for an article. I did a 2,000+ revision dump of an article just a week ago.
-Chad
On Tue, Feb 24, 2009 at 2:55 PM, Chad innocentkiller@gmail.com wrote:
On Tue, Feb 24, 2009 at 2:46 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 2:41 PM, Brian Brian.Mingus@colorado.edu
wrote:
Just to be clear, your suggesting that, in lieu of a compressed dump, people who want the full history of the english wikipedia should use Special:Export to download it, article by article?
No, I wasn't suggesting it as an alternative to compressed dumps at all. And Special:Export doesn't even work for historical versions any more.
You can still dump full-history for an article. I did a 2,000+ revision dump of an article just a week ago.
Yeah, I guess that bug was reintroduced. They had it fixed for a while.
On Tue, Feb 24, 2009 at 3:01 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 2:55 PM, Chad innocentkiller@gmail.com wrote:
On Tue, Feb 24, 2009 at 2:46 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 2:41 PM, Brian Brian.Mingus@colorado.edu
wrote:
Just to be clear, your suggesting that, in lieu of a compressed dump, people who want the full history of the english wikipedia should use Special:Export to download it, article by article?
No, I wasn't suggesting it as an alternative to compressed dumps at all. And Special:Export doesn't even work for historical versions any more.
You can still dump full-history for an article. I did a 2,000+ revision dump of an article just a week ago.
Yeah, I guess that bug was reintroduced. They had it fixed for a while.
http://blog.p2pedia.org/2008/09/wgexportallowhistoryfalse.html
On Tue, Feb 24, 2009 at 3:03 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 3:01 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 2:55 PM, Chad innocentkiller@gmail.com wrote:
On Tue, Feb 24, 2009 at 2:46 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 2:41 PM, Brian Brian.Mingus@colorado.edu
wrote:
Just to be clear, your suggesting that, in lieu of a compressed dump, people who want the full history of the english wikipedia should use Special:Export to download it, article by article?
No, I wasn't suggesting it as an alternative to compressed dumps at
all.
And Special:Export doesn't even work for historical versions any more.
You can still dump full-history for an article. I did a 2,000+ revision dump of an article just a week ago.
Yeah, I guess that bug was reintroduced. They had it fixed for a while.
http://blog.p2pedia.org/2008/09/wgexportallowhistoryfalse.html
http://wikitech.wikimedia.org/index.php?title=Server_admin_log&diff=1640...
So, yeah, I guess it wasn't fixed for that long. You're not *supposed* to be able to dump more than 1,000 revisions at a time, though :).
On Tue, Feb 24, 2009 at 3:12 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 3:03 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 3:01 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 2:55 PM, Chad innocentkiller@gmail.com wrote:
On Tue, Feb 24, 2009 at 2:46 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 2:41 PM, Brian Brian.Mingus@colorado.edu
wrote:
Just to be clear, your suggesting that, in lieu of a compressed
dump,
people who want the full history of the english wikipedia should
use
Special:Export to download it, article by article?
No, I wasn't suggesting it as an alternative to compressed dumps at
all.
And Special:Export doesn't even work for historical versions any
more.
You can still dump full-history for an article. I did a 2,000+ revision dump of an article just a week ago.
Yeah, I guess that bug was reintroduced. They had it fixed for a
while.
http://blog.p2pedia.org/2008/09/wgexportallowhistoryfalse.html
http://wikitech.wikimedia.org/index.php?title=Server_admin_log&diff=1640...
So, yeah, I guess it wasn't fixed for that long. You're not *supposed* to be able to dump more than 1,000 revisions at a time, though :). _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Afaik, it's fully open on all the wikis except enwiki (according to the admin log). However, I've yet to find any place in CommonSettings or InitialiseSettings that changes it from the default. It's no so much a bug as it is a configuration choice :)
-Chad
On Tue, Feb 24, 2009 at 3:17 PM, Chad innocentkiller@gmail.com wrote:
Afaik, it's fully open on all the wikis except enwiki (according to the admin log). However, I've yet to find any place in CommonSettings or InitialiseSettings that changes it from the default. It's no so much a bug as it is a configuration choice :)
No, it's definitely a bug, because it sends you all revisions even if you request a limit (say, only 10 of them). In fact, I used to know exactly where the bug was in the code, but I've since forgotten.**
On Tue, Feb 24, 2009 at 9:56 AM, Brian Brian.Mingus@colorado.edu wrote:
Its not at all clear why the english wikipedia dump or other large dumps need to be compressed. It is far more absurd to spend hundreds of days compressing a file than it is to spend tens of days downloading one.
Faulty premise. Based on my old-ish hardware and the smaller but still very large ruwiki dump, I'd assume the actual compression of enwiki would take less than a week of processing time. Since my high end DSL would take multiple weeks to download ~2 TBs uncompressed, it is clearly a net time savings to compress it first.
Compression does take substantial time, but my impression is that the hundreds of days comes mostly from communicating with the data store and assembling the XML, and not from compressing the output.
-Robert Rohde
Excellent -- following up offlist. SJ
On Tue, Feb 24, 2009 at 12:45 PM, Brion Vibber brion@wikimedia.org wrote:
On 2/23/09 5:31 PM, Samuel Klein wrote:
Copying the Commons list.
I am interested in hosting (and running some scripts on) copies of the commons media dump on offline regional servers for offline-reading purposes. This is difficult without an image dump.
Awesome -- can you work with an rsync3-over-ssh setup? Email me and we'll set it up.
(Reminds me -- Greg, can we make sure we've got yours going again?)
-- brion
Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l
2009/2/22 River Tarnell river@loreley.flyingparchment.org.uk:
currently the dump process is a bit broken. what is the Foundation's position on this?
Making the full history dump process scale to en.wp, and dumps more reliable in general, is a high priority project. It was assigned to Ariel, but due to escalating tech support priorities, will need to be re-assigned, probably to Tomasz, after some further internal discussions about a desirable and feasible architecture. Brion is working on a revised project timeline. We've made incremental improvements last year, but we'll definitely want to move towards a situation where full public dumps of pretty much everything can be expected at least on a monthly basis.
On 2/22/09 4:49 AM, River Tarnell wrote:
currently the dump process is a bit broken. what is the Foundation's position on this?
Well, you could have just asked me. :)
http://leuksman.com/log/2009/02/24/wikimedia-data-dump-update/
-- brion
Why not make the uncompressed dump available as an Amazon Public Dataset? http://aws.amazon.com/publicdatasets/
You can already find DBPedia and FreeBase there. Its true that the uncompressed dump won't fit on a commercial drive (the largest is a 4-platter 500GB = 2TB drive). Cloud computing seems to be the most economically feasible alternative for all parties involved.
"Typically the data sets in the repository are between 1 GB to 1 TB in size (based on the Amazon EBS volume limit), but we can work with you to host larger data sets as well. You must have the right to make the data freely available."
Clearly a boon!
On Sun, Feb 22, 2009 at 5:49 AM, River Tarnell river@loreley.flyingparchment.org.uk wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
hi,
currently the dump process is a bit broken. what is the Foundation's position on this? why are developer resources allocated to put the server admin log on twitter, but no one has touched dumps in months? is there not enough money to fund this? how much more is needed?
(for an example of a problem, there is *no* successful enwiki dump at all on download.wikimedia.org--the last 5 dumps, going back to 2008-03-12, all failed.)
- river. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (HP-UX)
iEYEARECAAYFAkmhSeMACgkQIXd7fCuc5vJXcwCfRvj2eqvM5f7awkvp3IS/vWFm dk8AoIiajhlsTsEMHOl/iu9Hj3HlQX/i =mTgb -----END PGP SIGNATURE-----
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Brian wrote:
Why not make the uncompressed dump available as an Amazon Public Dataset? http://aws.amazon.com/publicdatasets/
You can already find DBPedia and FreeBase there. Its true that the uncompressed dump won't fit on a commercial drive (the largest is a 4-platter 500GB = 2TB drive). Cloud computing seems to be the most economically feasible alternative for all parties involved.
It depends on the parties--- for me as a user, it's more economically feasible to download the dataset locally and run scripts on my own machine, than to pay for EC2 compute time to run those scripts. But I have free unlimited university bandwidth.
It does seem like there might be some mutual benefits to having a copy at Amazon, for those who do prefer it. Since it would become easy to analyze a full database dump from an Amazon EC2 compute instance, due to it being already available on the filesystem, a number of people might use EC2 to run their analysis scripts. From that perspective, maybe Amazon might be persuaded to help out? Maybe they could donate some money, equipment, or developer time to reengineer the dump process, in return for one part of the reengineering being the addition of a routine sync to their service?
-Mark
On Tue, Feb 24, 2009 at 11:26 PM, Brian Brian.Mingus@colorado.edu wrote:
Why not make the uncompressed dump available as an Amazon Public Dataset? http://aws.amazon.com/publicdatasets/
Which uncompressed dump? The full history English Wikipedia dump doesn't exist, and there doesn't seem to be any demand for this anyway.
You can already find DBPedia and FreeBase there. Its true that the
uncompressed dump won't fit on a commercial drive (the largest is a 4-platter 500GB = 2TB drive). Cloud computing seems to be the most economically feasible alternative for all parties involved.
"Cloud computing" might be a good alternative for some reusers, but if so it'd be most economical to just host the cloud at the source, i.e. API/live feed access open to everyone (for free or for a cost). For certain uses there's the toolserver, but access to it is handed out with special permission. For small amounts of traffic there's an API, and there's the live feed which seems to be limited to major corporations with special permission.
The WMF hasn't put any real resources into this for the small time commercial user (big players have the live feed and non-commercial users can probably get toolserver access). Of course, there isn't all that much demand either. If there was, a third party would have set it up by now (I'd personally be willing to set up a pay-for-access toolserver and custom dump service if I could get a commitment from one or more people for a couple hundred a month in funding).
"Typically the data sets in the repository are between 1 GB to 1 TB in
size (based on the Amazon EBS volume limit), but we can work with you to host larger data sets as well. You must have the right to make the data freely available."
Yeah, if there was any demand for this, nothing's stopping someone from setting it up on their own.
What has led you to believe there is no demand for a full dump of the english wikipedia?
On Wed, Feb 25, 2009 at 9:26 AM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 24, 2009 at 11:26 PM, Brian Brian.Mingus@colorado.edu wrote:
Why not make the uncompressed dump available as an Amazon Public Dataset? http://aws.amazon.com/publicdatasets/
Which uncompressed dump? The full history English Wikipedia dump doesn't exist, and there doesn't seem to be any demand for this anyway.
You can already find DBPedia and FreeBase there. Its true that the
uncompressed dump won't fit on a commercial drive (the largest is a 4-platter 500GB = 2TB drive). Cloud computing seems to be the most economically feasible alternative for all parties involved.
"Cloud computing" might be a good alternative for some reusers, but if so it'd be most economical to just host the cloud at the source, i.e. API/live feed access open to everyone (for free or for a cost). For certain uses there's the toolserver, but access to it is handed out with special permission. For small amounts of traffic there's an API, and there's the live feed which seems to be limited to major corporations with special permission.
The WMF hasn't put any real resources into this for the small time commercial user (big players have the live feed and non-commercial users can probably get toolserver access). Of course, there isn't all that much demand either. If there was, a third party would have set it up by now (I'd personally be willing to set up a pay-for-access toolserver and custom dump service if I could get a commitment from one or more people for a couple hundred a month in funding).
"Typically the data sets in the repository are between 1 GB to 1 TB in
size (based on the Amazon EBS volume limit), but we can work with you to host larger data sets as well. You must have the right to make the data freely available."
Yeah, if there was any demand for this, nothing's stopping someone from setting it up on their own. _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2009/2/25 Brian Brian.Mingus@colorado.edu:
What has led you to believe there is no demand for a full dump of the english wikipedia?
He didn't say there was no demand, he said there was no demand for having it on Amazon.
Ahh ok. Anyone who wants to do processing on the full history (and there are a lot of these people who exist!) by definition *has* to be willing to throw some money at it. It simply doesn't fit on commercial drives. In fact, it would hardly fit on either of the two raid clusters I have access to. Making it available on Amazon means that, for a fair market rate, you don't have to download or uncompress the data. You can just start your data crunching. I can only speak for academics but there is generally funding available for Amazon EC2 etc... for specific projects. Professors are even known to pay for a fixed amount of processing for ambitious student projects, and these kinds of earmarks are easily fit into grants.
The claim that there is no demand for having it on amazon is some kind of fallacy that I don't know the name for. Its never been available on Amazon, how could there be demand? Heck, it hasn't been available for several years in the first place so how could there be a demand for it? People *just want the data*. Many people would be willing to pay a fee. Thus, for an extremely reasonable price they can now create a new amazon disk image and download it to their own raid cluster if they want. The foundation doesn't have to foot the bill. Or they can find funding for their specific project, or whatever.
I have a rare copy of the last available full text dump. Perhaps I should initiate the process myself.
On Wed, Feb 25, 2009 at 2:20 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/2/25 Brian Brian.Mingus@colorado.edu:
What has led you to believe there is no demand for a full dump of the english wikipedia?
He didn't say there was no demand, he said there was no demand for having it on Amazon.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2009/2/25 Brian Brian.Mingus@colorado.edu:
Ahh ok. Anyone who wants to do processing on the full history (and there are a lot of these people who exist!) by definition *has* to be willing to throw some money at it. It simply doesn't fit on commercial drives. In fact, it would hardly fit on either of the two raid clusters I have access to. Making it available on Amazon means that, for a fair market rate, you don't have to download or uncompress the data. You can just start your data crunching. I can only speak for academics but there is generally funding available for Amazon EC2 etc... for specific projects. Professors are even known to pay for a fixed amount of processing for ambitious student projects, and these kinds of earmarks are easily fit into grants.
Academics usually have access to the necessary computers (or clusters thereof) to do such processing directly. I think Amazon hosting of dumps would appeal mainly to non-academics who only have access to home PCs.
One of the academics I am speaking of wrote the textbook on natural language processing. He has a 3TB raid cluster. Of course, for about a thousand dollars you can create a bigger raid cluster than that using the new 2TB drives, but funding comes and goes. Our 26 node cluster has a 26 20GB drives in a glusterfs configuration (disk space isn't key to us, so we skimped). So I'm not sure what you mean by "usually have access." They have to pay for this access, or negotiate for it, or receive grant money specifically for it. Most academics *do not* have what you are describing. This is an exceptionally large dataset.
On Wed, Feb 25, 2009 at 4:45 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/2/25 Brian Brian.Mingus@colorado.edu:
Ahh ok. Anyone who wants to do processing on the full history (and there
are
a lot of these people who exist!) by definition *has* to be willing to
throw
some money at it. It simply doesn't fit on commercial drives. In fact, it would hardly fit on either of the two raid clusters I have access to.
Making
it available on Amazon means that, for a fair market rate, you don't have
to
download or uncompress the data. You can just start your data crunching.
I
can only speak for academics but there is generally funding available for Amazon EC2 etc... for specific projects. Professors are even known to pay for a fixed amount of processing for ambitious student projects, and
these
kinds of earmarks are easily fit into grants.
Academics usually have access to the necessary computers (or clusters thereof) to do such processing directly. I think Amazon hosting of dumps would appeal mainly to non-academics who only have access to home PCs.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Wed, Feb 25, 2009 at 15:48, Brian Brian.Mingus@colorado.edu wrote:
One of the academics I am speaking of wrote the textbook on natural language processing. He has a 3TB raid cluster. Of course, for about a thousand dollars you can create a bigger raid cluster than that using the new 2TB drives, but funding comes and goes.
Using 1TB hard drives and a bit of creativity, you can build a 9TB storage server for that $1000. Disk space is getting cheaper all the time, and it's one of those cases where you can save a small fortune by building the computer yourself.
--- El jue, 26/2/09, Brian Brian.Mingus@colorado.edu escribió:
De: Brian Brian.Mingus@colorado.edu Asunto: Re: [Foundation-l] dumps Para: "Wikimedia Foundation Mailing List" foundation-l@lists.wikimedia.org Fecha: jueves, 26 febrero, 2009 12:33 Ahh ok. Anyone who wants to do processing on the full history (and there are a lot of these people who exist!) by definition *has* to be willing to throw some money at it. It simply doesn't fit on commercial drives.
Not necessarily. For instance, WikiXRay is capable of parsing the dump file on the fly, so you don't need to uncompress the whole file if you don't want to, and the result tipically fits in a 6-8 GB DB (depending on the amount of data your recover), which fits perfectly in commodity hw.
On the other hand, I completely agree with you in that working with the huge XML file requires specific hw (we bought a couple of servers for that).
People *just want the data*. Many people would be willing to pay a fee.
Probably, but anyway, I would like to avoid paying a fee to access what should be publicly available (at least, until the dump process broke, it was).
Some universities (including ourselves) has offered storage capacity and some bandwith to distribute mirrors and improve the dump availability, at no cost at all :).
I have a rare copy of the last available full text dump. Perhaps I should initiate the process myself.
Nothing prevents you to do that (I think) and it could be a stimulus for thinking on subsequent solutions.
Best,
F.
On Wed, Feb 25, 2009 at 2:20 PM, Thomas Dalton thomas.dalton@gmail.comwrote:
2009/2/25 Brian Brian.Mingus@colorado.edu:
What has led you to believe there is no demand
for a full dump of the
english wikipedia?
He didn't say there was no demand, he said there
was no demand for
having it on Amazon.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
I went ahead and submitted it to Amazon. I'll leave the file up for a week or so if anyone else wants it (18GB):
http://mist.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
Just to emphasize my point - I have never been able to unpack this file. I've got no place to put it!!
On Wed, Feb 25, 2009 at 4:47 PM, Felipe Ortega glimmer_phoenix@yahoo.eswrote:
I have a rare copy of the last available full text dump. Perhaps I should initiate the process myself.
Nothing prevents you to do that (I think) and it could be a stimulus for thinking on subsequent solutions.
Best,
F.
I can't even ping the host. Typo? -Chad
On Wed, Feb 25, 2009 at 7:24 PM, Brian Brian.Mingus@colorado.edu wrote:
I went ahead and submitted it to Amazon. I'll leave the file up for a week or so if anyone else wants it (18GB):
http://mist.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
Just to emphasize my point - I have never been able to unpack this file. I've got no place to put it!!
On Wed, Feb 25, 2009 at 4:47 PM, Felipe Ortega <glimmer_phoenix@yahoo.es
wrote:
I have a rare copy of the last available full text dump. Perhaps I should initiate the process myself.
Nothing prevents you to do that (I think) and it could be a stimulus for thinking on subsequent solutions.
Best,
F.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Yep a typo, here is the right link:
http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
On Wed, Feb 25, 2009 at 5:35 PM, Chad innocentkiller@gmail.com wrote:
I can't even ping the host. Typo? -Chad
On Wed, Feb 25, 2009 at 7:24 PM, Brian Brian.Mingus@colorado.edu wrote:
I went ahead and submitted it to Amazon. I'll leave the file up for a
week
or so if anyone else wants it (18GB):
http://mist.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
Just to emphasize my point - I have never been able to unpack this file. I've got no place to put it!!
On Wed, Feb 25, 2009 at 4:47 PM, Felipe Ortega <glimmer_phoenix@yahoo.es
wrote:
I have a rare copy of the last available full text dump. Perhaps I should initiate the process myself.
Nothing prevents you to do that (I think) and it could be a stimulus
for
thinking on subsequent solutions.
Best,
F.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Just got a copy, thanks :)
-Chad
On Wed, Feb 25, 2009 at 9:29 PM, Brian Brian.Mingus@colorado.edu wrote:
Yep a typo, here is the right link:
http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
On Wed, Feb 25, 2009 at 5:35 PM, Chad innocentkiller@gmail.com wrote:
I can't even ping the host. Typo? -Chad
On Wed, Feb 25, 2009 at 7:24 PM, Brian Brian.Mingus@colorado.edu
wrote:
I went ahead and submitted it to Amazon. I'll leave the file up for a
week
or so if anyone else wants it (18GB):
http://mist.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
Just to emphasize my point - I have never been able to unpack this
file.
I've got no place to put it!!
On Wed, Feb 25, 2009 at 4:47 PM, Felipe Ortega <
glimmer_phoenix@yahoo.es
wrote:
I have a rare copy of the last available full text dump. Perhaps I should initiate the process myself.
Nothing prevents you to do that (I think) and it could be a stimulus
for
thinking on subsequent solutions.
Best,
F.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Brian wrote:
Yep a typo, here is the right link:
http://grey.colorado.edu/enwiki-20080103-pages-meta-history.xml.7z
People downloading it would like to verify they downloaded it correctly. md5(enwiki-20080103-pages-meta-history.xml.7z) is 20a201afc05a4e5f2f6c3b9b7afa225c
Brian wrote:
Ahh ok. Anyone who wants to do processing on the full history (and there are a lot of these people who exist!) by definition *has* to be willing to throw some money at it. It simply doesn't fit on commercial drives.
I've personally never found much of a compelling reason to actually uncompress the dump, rather than working on the stream as it's being decompressed. 7zip decompression is pretty fast, and can use multiple cores on multi-core machines, so it never seems to be a bottleneck, for me at least--- I get somewhere around 30-40 MB/s typically. From what I can tell, the top-end EC2 instances do perform rather better than that, topping out at around 200 MB/s for sequential reads. But I don't personally run anything that can't run 5x slower in return for being free, and I suspect lots of analysis is of that "just let it run for a week, who cares" variety.
I'm not going to argue that nobody could benefit from using EC2 to do their analysis instead, but it's hardly the case that it's impossible to do full-history analysis on commodity hardware.
-Mark
--- El mié, 25/2/09, Anthony wikimail@inbox.org escribió:
De: Anthony wikimail@inbox.org Asunto: Re: [Foundation-l] dumps Para: "Wikimedia Foundation Mailing List" foundation-l@lists.wikimedia.org Fecha: miércoles, 25 febrero, 2009 5:26 On Tue, Feb 24, 2009 at 11:26 PM, Brian Brian.Mingus@colorado.edu wrote:
Which uncompressed dump? The full history English Wikipedia dump doesn't exist, and there doesn't seem to be any demand for this anyway.
Mmmm, sorry but then, I'm afraid that you missed some messages over the past year and a half on Wikitech-l, eagerly asking for the whole version of the English dump.
Just to give a straightforward application: people analyzing Wikipedia from a quantitative point of view need the whole dump file, no matter what do you want to examine. And believe it or not, the number of scholars (in different disciplines) focusing on this topic is growing steadily (actually, we could be many more if we could have a stable process, updated with reasonable frequency ;) ).
It's also really difficult for people like me to advocate in favor of this line of research when we have such problems, though we found the way to accept these limitations so far (better something than nothing).
Best,
F.
On Wed, Feb 25, 2009 at 6:32 PM, Felipe Ortega glimmer_phoenix@yahoo.eswrote:
Mmmm, sorry but then, I'm afraid that you missed some messages over the past year and a half on Wikitech-l, eagerly asking for the whole version of the English dump.
By demand I meant something a bit more than making demands. I meant something more along the lines of being willing to pay for it.
Anyway, I meant demand for the Amazon service when I said "there doesn't seem to be any demand". And no, there's no fallacy to the notion that demand (a willingness to pay for something) can exist before supply.
I firmly believe that if the people who wanted the full history English Wikipedia dump got together and set up a server to host it, it would be no problem for them to get it. If nothing else, if the foundation absolutely refused to help, we could start with the latest valid dump and download the rest article by article. And keeping it up to date, once you have it, is an even easier task.
Hoi, People have publicly said that they were willing to provide hardware or developers to work on this. The thing is that such offers have not been accepted. Thanks, GerardM
2009/2/26 Anthony wikimail@inbox.org
On Wed, Feb 25, 2009 at 6:32 PM, Felipe Ortega <glimmer_phoenix@yahoo.es
wrote:
Mmmm, sorry but then, I'm afraid that you missed some messages over the past year and a half on Wikitech-l, eagerly asking for the whole version
of
the English dump.
By demand I meant something a bit more than making demands. I meant something more along the lines of being willing to pay for it.
Anyway, I meant demand for the Amazon service when I said "there doesn't seem to be any demand". And no, there's no fallacy to the notion that demand (a willingness to pay for something) can exist before supply.
I firmly believe that if the people who wanted the full history English Wikipedia dump got together and set up a server to host it, it would be no problem for them to get it. If nothing else, if the foundation absolutely refused to help, we could start with the latest valid dump and download the rest article by article. And keeping it up to date, once you have it, is an even easier task. _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Thu, Feb 26, 2009 at 9:31 AM, Gerard Meijssen gerard.meijssen@gmail.comwrote:
People have publicly said that they were willing to provide hardware or developers to work on this. The thing is that such offers have not been accepted.
Accepted by whom? Co-lo a box on the Internet, and ask the Foundation for permission to create the dump. A single thread downloading articles to a single server isn't going to impact the project. It probably wouldn't even be *noticed*.
On Thu, Feb 26, 2009 at 9:49 AM, Anthony wikimail@inbox.org wrote:
Accepted by whom? Co-lo a box on the Internet, and ask the Foundation for permission to create the dump. A single thread downloading articles to a single server isn't going to impact the project. It probably wouldn't even be *noticed*.
It would also take even longer than the Wikimedia dumps. Say 500ms average to request a single revision. Multiply by 250,000,000 revisions. I get a figure of four years. You're going to need a lot more than a single thread to get a remotely recent dump. You probably couldn't even keep up with the rate of new revision creation with a single thread blocking on each HTTP request.
On Thu, Feb 26, 2009 at 10:53 AM, Aryeh Gregor < Simetrical+wikilist@gmail.com Simetrical%2Bwikilist@gmail.com> wrote:
On Thu, Feb 26, 2009 at 9:49 AM, Anthony wikimail@inbox.org wrote:
Accepted by whom? Co-lo a box on the Internet, and ask the Foundation
for
permission to create the dump. A single thread downloading articles to a single server isn't going to impact the project. It probably wouldn't
even
be *noticed*.
It would also take even longer than the Wikimedia dumps.
What's your estimate of how long it's going to take to get the next full history English Wikipedia dump?
Say 500ms average to request a single revision.
Why say that? 500ms is a long time.
Besides, through the API, you can get multiple revisions at once.
Multiply by 250,000,000 revisions.
You'd only need to get revisions since the last successful dump - maybe 150,000,000.
I get a figure of four years. You're going to need a lot
more than a single thread to get a remotely recent dump. You probably couldn't even keep up with the rate of new revision creation with a single thread blocking on each HTTP request.
I just got 50 revisions of [[Georgia]] in 6.389 seconds using the API and my slow internet connection. Even at that rate all the revisions since the last dump could be downloaded in seven months, which is much less than the time since the last successful full history dump. Ongoing it'd take 7 hours to download a new day's edits, but more realistically a live feed could be set up at that point.
And this is a worst case scenario. It assumes the WMF doesn't help *at all* aside from allowing a single thread to access its servers.
On Thu, Feb 26, 2009 at 12:35 PM, Anthony wikimail@inbox.org wrote:
What's your estimate of how long it's going to take to get the next full history English Wikipedia dump?
I would guess it gets fixed in less than a year, with new dumps every few weeks after that. If it doesn't happen by then given the moderate priority assigned to it, and given the number of people Wikimedia now employs for server stuff, I'd be pretty surprised. Although you never know.
wikimedia-l@lists.wikimedia.org