Dear all;
First it was Encarta, then printed Britannica. Tomorrow, Knol.[1][2] It is not a good moment for Wikipedia "rivals".
We at Archive Team are attempting to download all the 700,000 Knols.[3] For the sake of history. Join us, #archiveteam EFNET.
Regards, emijrp
[1] http://knol.google.com/k [2] http://news.bbc.co.uk/2/hi/technology/7144970.stm [3] http://db.tt/GNrEh61y
We at Archive Team are attempting to download all the 700,000 Knols.[3] For the sake of history. Join us, #archiveteam EFNET.
I did some followup. I'm not sure I can help out with Knol anymore, but I discovered that AT is having some trouble making good archives of wikimedia sites.
Theoretically, wikipedia et al SHOULD be easy to reconstitute, right? That's why we're using CC licenses and all. Else if we drop the ball, WP will be gone. This seems like a priority to me!
The main problem seems to be obtaining commons images: http://archiveteam.org/index.php?title=Wikiteam
So at the very least, we don't appear to have very good documentation. Who could best help Archive Team out? Has anyone done/written documentation on completely restoring 1 or more wikimedia wikis from 'public backup' [1]?
What can we do to help them?
sincerely, Kim Bruning
[1] "Real Men don't make backups. They upload it via ftp and let the world mirror it." - Linus Torvalds
I know from experience that a wiki can be re-built from any one of the dumps that are provided, (pages-meta-current) for example contains everything needed to reboot a site except its user database (names/passwords ect). see http://www.mediawiki.org/wiki/Manual:Moving_a_wiki
On Wed, May 16, 2012 at 10:01 PM, Kim Bruning kim@bruning.xs4all.nl wrote:
We at Archive Team are attempting to download all the 700,000 Knols.[3]
For
the sake of history. Join us, #archiveteam EFNET.
I did some followup. I'm not sure I can help out with Knol anymore, but I discovered that AT is having some trouble making good archives of wikimedia sites.
Theoretically, wikipedia et al SHOULD be easy to reconstitute, right? That's why we're using CC licenses and all. Else if we drop the ball, WP will be gone. This seems like a priority to me!
The main problem seems to be obtaining commons images: http://archiveteam.org/index.php?title=Wikiteam
So at the very least, we don't appear to have very good documentation. Who could best help Archive Team out? Has anyone done/written documentation on completely restoring 1 or more wikimedia wikis from 'public backup' [1]?
What can we do to help them?
sincerely, Kim Bruning
[1] "Real Men don't make backups. They upload it via ftp and let the world mirror it." - Linus Torvalds
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
On Wed, May 16, 2012 at 11:11:04PM -0400, John wrote:
I know from experience that a wiki can be re-built from any one of the dumps that are provided, (pages-meta-current) for example contains everything needed to reboot a site except its user database (names/passwords ect). see http://www.mediawiki.org/wiki/Manual:Moving_a_wiki
Sure. Does this include all images, including commons images, eventually converted to operate locally?
I'm thinking about full snapshot-and-later-restore, say 25 or 50 years from now, or in an academic setting, (or FSM-forbid in a worst case scenario <knock on wood>). That's what the AT folks are most interested in.
==Fire Drill== Has anyone recently set up a full-external-duplicate of (for instance) en.wp? This includes all images, all discussions, all page history (excepting the user accounts and deleted pages)
This would be a useful and important exercise; possibly to be repeated once per year.
I get a sneaky feeling that the first few iterations won't go so well.
I'm sure AT would be glad to help out with the running of these fire drills, as it seems to be in line with their mission.
sincerely, Kim Bruning
Except for files, getting a content clone up is relativity easy, and can be done in a fairly quick order (aka less than two weeks for everything). I know there is talk about getting a rsync setup for images.
On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote:
Except for files, getting a content clone up is relativity easy, and can be done in a fairly quick order (aka less than two weeks for everything). I know there is talk about getting a rsync setup for images.
Ouch, 2 weeks. We need the images to be replicable too though. <scratches head>
sincerely, Kim Bruning
that two week estimate was given worst case scenario. Given the best case we are talking as little as a few hours for the smaller wikis to 5 days or so for a project the size of enwiki. (see http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor progress on image dumps`)
On Wed, May 16, 2012 at 11:10 PM, Kim Bruning kim@bruning.xs4all.nl wrote:
On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote:
Except for files, getting a content clone up is relativity easy, and can
be
done in a fairly quick order (aka less than two weeks for everything). I know there is talk about getting a rsync setup for images.
Ouch, 2 weeks. We need the images to be replicable too though. <scratches head>
sincerely, Kim Bruning
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for exactly how to import an existing dump, I know the process of re-importing a cluster for the toolserver is normally just a few days when they have the needed dumps.
On Thu, May 17, 2012 at 12:13 AM, John <phoeInixoverride@gmail.comphoenixoverride@gmail.com
wrote:
that two week estimate was given worst case scenario. Given the best case we are talking as little as a few hours for the smaller wikis to 5 days or so for a project the size of enwiki. (see http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor progress on image dumps`)
On Wed, May 16, 2012 at 11:10 PM, Kim Bruning kim@bruning.xs4all.nlwrote:
On Thu, May 17, 2012 at 12:03:02AM -0400, John wrote:
Except for files, getting a content clone up is relativity easy, and
can be
done in a fairly quick order (aka less than two weeks for everything). I know there is talk about getting a rsync setup for images.
Ouch, 2 weeks. We need the images to be replicable too though. <scratches head>
sincerely, Kim Bruning
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
On Thu, May 17, 2012 at 12:18 AM, John phoenixoverride@gmail.com wrote:
take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps for exactly how to import an existing dump, I know the process of re-importing a cluster for the toolserver is normally just a few days when they have the needed dumps.
Toolserver doesn't have full history, does it?
Toolserver is a clone of the wmf servers minus files. they run a database replication of all wikis. these times are dependent on available hardware and may very, but should provide a decent estimate
On Thu, May 17, 2012 at 12:23 AM, Anthony wikimail@inbox.org wrote:
On Thu, May 17, 2012 at 12:18 AM, John phoenixoverride@gmail.com wrote:
take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumpsfor exactly how to import an existing dump, I know the process of
re-importing
a cluster for the toolserver is normally just a few days when they have
the
needed dumps.
Toolserver doesn't have full history, does it?
Ill run a quick benchmark and import the full history of simple.wikipedia to my laptop wiki on a stick, and give an exact duration
On Thu, May 17, 2012 at 12:26 AM, John phoenixoverride@gmail.com wrote:
Toolserver is a clone of the wmf servers minus files. they run a database replication of all wikis. these times are dependent on available hardware and may very, but should provide a decent estimate
On Thu, May 17, 2012 at 12:23 AM, Anthony wikimail@inbox.org wrote:
On Thu, May 17, 2012 at 12:18 AM, John phoenixoverride@gmail.com wrote:
take a look at http://www.mediawiki.org/wiki/Manual:Importing_XML_dumpsfor exactly how to import an existing dump, I know the process of
re-importing
a cluster for the toolserver is normally just a few days when they have
the
needed dumps.
Toolserver doesn't have full history, does it?
On Thu, May 17, 2012 at 12:30 AM, John phoenixoverride@gmail.com wrote:
Ill run a quick benchmark and import the full history of simple.wikipedia to my laptop wiki on a stick, and give an exact duration
Simple.wikipedia is nothing like en.wikipedia. For one thing, there's no need to turn on $wgCompressRevisions with simple.wikipedia.
Is $wgCompressRevisions still used? I haven't followed this in quite a while.
*Simple.wikipedia is nothing like en.wikipedia* I care to dispute that statement, All WMF wikis are setup basically the same (an odd extension here or there is different, and different namespace names at times) but for the purpose of recovery simplewiki_p is a very standard example. this issue isnt just about enwiki_p but *all* wmf wikis. Doing a data recovery for enwiki vs simplewiki is just a matter of time, for enwiki a 5 day estimate would be fairly standard (depending on server setup) and lower times for smaller databases. typically you can explain it in a rate of X revisions processed per Y time unit, regardless of the project. and that rate should be similar for everything given the same hardware setup.
On Thu, May 17, 2012 at 12:37 AM, Anthony wikimail@inbox.org wrote:
On Thu, May 17, 2012 at 12:30 AM, John phoenixoverride@gmail.com wrote:
Ill run a quick benchmark and import the full history of
simple.wikipedia to
my laptop wiki on a stick, and give an exact duration
Simple.wikipedia is nothing like en.wikipedia. For one thing, there's no need to turn on $wgCompressRevisions with simple.wikipedia.
Is $wgCompressRevisions still used? I haven't followed this in quite a while.
On Thu, May 17, 2012 at 12:45 AM, John phoenixoverride@gmail.com wrote:
Simple.wikipedia is nothing like en.wikipedia I care to dispute that statement, All WMF wikis are setup basically the same (an odd extension here or there is different, and different namespace names at times) but for the purpose of recovery simplewiki_p is a very standard example. this issue isnt just about enwiki_p but *all* wmf wikis. Doing a data recovery for enwiki vs simplewiki is just a matter of time, for enwiki a 5 day estimate would be fairly standard (depending on server setup) and lower times for smaller databases. typically you can explain it in a rate of X revisions processed per Y time unit, regardless of the project. and that rate should be similar for everything given the same hardware setup.
Are you compressing old revisions, or not? Does the WMF database compress old revisions, or not?
In any case, I'm sorry, a 20 gig mysql database does not scale linearly to a 20 terabyte mysql database.
Anthony the process is linear, you have a php inserting X number of rows per Y time frame. Yes rebuilding the externallinks, links, and langlinks tables will take some additional time and wont scale. However I have been working with the toolserver since 2007 and Ive lost count of the number of times that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can be done in a semi-reasonable timeframe. The WMF actually compresses all text blobs not just old versions. complete download and decompression of simple only took 20 minutes on my 2 year old consumer grade laptop with a standard home cable internet connection, same download on the toolserver (minus decompression) was 88s. Yeah Importing will take a little longer but shouldnt be that big of a deal. There will also be some need cleanup tasks. However the main issue, archiving and restoring wmf wikis isnt an issue, and with moderately recent hardware is no big deal. Im putting my money where my mouth is, and getting actual valid stats and figures. Yes it may not be an exactly 1:1 ratio when scaling up, however given the basics of how importing a dump functions it should remain close to the same ratio
On Thu, May 17, 2012 at 12:54 AM, Anthony wikimail@inbox.org wrote:
On Thu, May 17, 2012 at 12:45 AM, John phoenixoverride@gmail.com wrote:
Simple.wikipedia is nothing like en.wikipedia I care to dispute that statement, All WMF wikis are setup basically the same (an odd extension
here
or there is different, and different namespace names at times) but for
the
purpose of recovery simplewiki_p is a very standard example. this issue
isnt
just about enwiki_p but *all* wmf wikis. Doing a data recovery for
enwiki vs
simplewiki is just a matter of time, for enwiki a 5 day estimate would be fairly standard (depending on server setup) and lower times for smaller databases. typically you can explain it in a rate of X revisions
processed
per Y time unit, regardless of the project. and that rate should be
similar
for everything given the same hardware setup.
Are you compressing old revisions, or not? Does the WMF database compress old revisions, or not?
In any case, I'm sorry, a 20 gig mysql database does not scale linearly to a 20 terabyte mysql database.
On Thu, May 17, 2012 at 1:22 AM, John phoenixoverride@gmail.com wrote:
Anthony the process is linear, you have a php inserting X number of rows per Y time frame.
Amazing. I need to switch all my databases to MySQL. It can insert X rows per Y time frame, regardless of whether the database is 20 gigabytes or 20 terabytes in size, regardless of whether the average row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a RAID array or a cluster of servers, etc.
Yes rebuilding the externallinks, links, and langlinks tables will take some additional time and wont scale.
And this is part of the process too, right?
However I have been working with the toolserver since 2007 and Ive lost count of the number of times that the TS has needed to re-import a cluster, (s1-s7) and even enwiki can be done in a semi-reasonable timeframe.
Re-importing how? From the compressed XML full history dumps?
The WMF actually compresses all text blobs not just old versions.
Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate? Is WMF using gzip or object?
complete download and decompression of simple only took 20 minutes on my 2 year old consumer grade laptop with a standard home cable internet connection, same download on the toolserver (minus decompression) was 88s. Yeah Importing will take a little longer but shouldnt be that big of a deal.
For the full history English Wikipedia it *is* a big deal.
If you think it isn't, stop playing with simple.wikipedia, and tell us how long it takes to get a mirror up and running of en.wikipedia.
Do you plan to run compressOld.php? Are you going to import everything in plain text first, and *then* start compressing? Seems like an awful lot of wasted hard drive space.
There will also be some need cleanup tasks. However the main issue, archiving and restoring wmf wikis isnt an issue, and with moderately recent hardware is no big deal. Im putting my money where my mouth is, and getting actual valid stats and figures. Yes it may not be an exactly 1:1 ratio when scaling up, however given the basics of how importing a dump functions it should remain close to the same ratio
If you want to put your money where your mouth is, import en.wikipedia. It'll only take 5 days, right?
On Thu, May 17, 2012 at 1:52 AM, Anthony wikimail@inbox.org wrote:
On Thu, May 17, 2012 at 1:22 AM, John phoenixoverride@gmail.com wrote:
Anthony the process is linear, you have a php inserting X number of rows
per
Y time frame.
Amazing. I need to switch all my databases to MySQL. It can insert X rows per Y time frame, regardless of whether the database is 20 gigabytes or 20 terabytes in size, regardless of whether the average row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a RAID array or a cluster of servers, etc.
When refering to X over Y time, its an average of a of say 1000 revisions per 1 minute, any X over Y period must be considered with averages in mind, or getting a count wouldnt be possible.
Yes rebuilding the externallinks, links, and langlinks tables will take some additional time and wont scale.
And this is part of the process too, right?
That does not need to be completed prior to the site going live, it can be done after making it public
That part isnt
However I have been working with the toolserver since 2007 and Ive lost count of the number of times that the TS has needed to re-import a cluster, (s1-s7) and even enwiki
can
be done in a semi-reasonable timeframe.
Re-importing how? From the compressed XML full history dumps?
The WMF actually compresses all text blobs not just old versions.
Is http://www.mediawiki.org/wiki/Manual:Text_table still accurate? Is WMF using gzip or object?
complete download and decompression of simple only took 20 minutes on my 2 year old consumer grade laptop with a
standard
home cable internet connection, same download on the toolserver (minus decompression) was 88s. Yeah Importing will take a little longer but shouldnt be that big of a deal.
For the full history English Wikipedia it *is* a big deal.
If you think it isn't, stop playing with simple.wikipedia, and tell us how long it takes to get a mirror up and running of en.wikipedia.
Do you plan to run compressOld.php? Are you going to import everything in plain text first, and *then* start compressing? Seems like an awful lot of wasted hard drive space.
If you setup your sever/hardware correctly it will compress the text information during insertion into the database and compressOld.php is actually designed only for cases where you start with an uncompressed configuration
There will also be some need cleanup tasks. However the main issue, archiving and restoring wmf wikis isnt an issue,
and
with moderately recent hardware is no big deal. Im putting my money
where my
mouth is, and getting actual valid stats and figures. Yes it may not be
an
exactly 1:1 ratio when scaling up, however given the basics of how
importing
a dump functions it should remain close to the same ratio
If you want to put your money where your mouth is, import en.wikipedia. It'll only take 5 days, right?
If I actually had a server or the disc space to do it I would, just to prove your smartass comments as stupid as they actually are. However given my current resource limitations (fairly crappy internet connection, older laptops, and lack of HDD) I tried to select something that could give reliable benchmarks. If your willing to foot the bill for the new hardware Ill gladly prove my point
On Thu, May 17, 2012 at 6:06 AM, John phoenixoverride@gmail.com wrote:
If your willing to foot the bill for the new hardware Ill gladly prove my point
given the millions of dollars that wikipedia has, it should not be a problem to provide such resources for a good cause like that.
On Thu, May 17, 2012 at 2:06 AM, John phoenixoverride@gmail.com wrote:
On Thu, May 17, 2012 at 1:52 AM, Anthony wikimail@inbox.org wrote:
On Thu, May 17, 2012 at 1:22 AM, John phoenixoverride@gmail.com wrote:
Anthony the process is linear, you have a php inserting X number of rows per Y time frame.
Amazing. I need to switch all my databases to MySQL. It can insert X rows per Y time frame, regardless of whether the database is 20 gigabytes or 20 terabytes in size, regardless of whether the average row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a RAID array or a cluster of servers, etc.
When refering to X over Y time, its an average of a of say 1000 revisions per 1 minute, any X over Y period must be considered with averages in mind, or getting a count wouldnt be possible.
The *average* en.wikipedia revision is more than twice the size of the *average* simple.wikipedia revision. The *average* performance of a 20 gig database is faster than the *average* performance of a 20 terabyte database. The *average* performance of your laptop's thumb drive is different from the *average* performance of a(n array of) drive(s) which can handle 20 terabytes of data.
If you setup your sever/hardware correctly it will compress the text information during insertion into the database
Is this how you set up your simple.wikipedia test? How long does it take import the data if you're using the same compression mechanism as WMF (which, you didn't answer, but I assume is concatenation and compression). How exactly does this work "during insertion" anyway? Does it intelligently group sets of revisions together to avoid decompressing and recompressing the same revision several times? I suppose it's possible, but that would introduce quite a lot of complication into the import script, slowing things down dramatically.
What about the answers to my other questions?
If you want to put your money where your mouth is, import en.wikipedia. It'll only take 5 days, right?
If I actually had a server or the disc space to do it I would, just to prove your smartass comments as stupid as they actually are. However given my current resource limitations (fairly crappy internet connection, older laptops, and lack of HDD) I tried to select something that could give reliable benchmarks. If your willing to foot the bill for the new hardware Ill gladly prove my point
What you seem to be saying is that you're *not* putting your money where your mouth is.
Anyway, if you want, I'll make a deal with you. A neutral third party rents the hardware at Amazon Web Services (AWS). We import simple.wikipedia full history (concatenating and compressing during import). We take the ratio of revisions in simple.wikipedia to the ratio of revisions in en.wikipedia. We import en.wikipedia full history (concatenating and compressing during import). If the ratio of time it takes to import en.wikipedia vs simple.wikipedia is greater than or equal to twice the ratio of revisions, then you reimburse the third party. If the ratio of import time is less than twice the ratio of revisions (you claim it is linear, therefore it'll be the same ratio), then I reimburse the third party.
Either way, we save the new dump, with the processing already done, and send it to archive.org (and WMF if they're willing to host it). So we actually get a useful result out of this. It's not just for the purpose of settling an argument.
Either of us can concede defeat at any point, and stop the experiment. At that point if the neutral third party wishes to pay to continue the job, s/he would be responsible for the additional costs.
Shouldn't be too expensive. If you concede defeat after 5 days, then your CPU-time costs are $54 (assuming Extra Large High Memory Instance). Including 4 terabytes of EBS (which should be enough if you compress on the fly) for 5 days should be less than $100.
I'm tempted to do it even if you don't take the bet.
I'd like to point out that the increasingly technical nature of this conversation probably belongs either on wikitech-l, or off-list, and that the strident nature of the comments is fast approaching inappropriate.
Alex Wikimedia-l list administrator
2012/5/17 Anthony wikimail@inbox.org
On Thu, May 17, 2012 at 2:06 AM, John phoenixoverride@gmail.com wrote:
On Thu, May 17, 2012 at 1:52 AM, Anthony wikimail@inbox.org wrote:
On Thu, May 17, 2012 at 1:22 AM, John phoenixoverride@gmail.com
wrote:
Anthony the process is linear, you have a php inserting X number of
rows
per Y time frame.
Amazing. I need to switch all my databases to MySQL. It can insert X rows per Y time frame, regardless of whether the database is 20 gigabytes or 20 terabytes in size, regardless of whether the average row is 3K or 1.5K, regardless of whether I'm using a thumb drive or a RAID array or a cluster of servers, etc.
When refering to X over Y time, its an average of a of say 1000 revisions per 1 minute, any X over Y period must be considered with averages in
mind,
or getting a count wouldnt be possible.
The *average* en.wikipedia revision is more than twice the size of the *average* simple.wikipedia revision. The *average* performance of a 20 gig database is faster than the *average* performance of a 20 terabyte database. The *average* performance of your laptop's thumb drive is different from the *average* performance of a(n array of) drive(s) which can handle 20 terabytes of data.
If you setup your sever/hardware correctly it will compress the text information during insertion into the database
Is this how you set up your simple.wikipedia test? How long does it take import the data if you're using the same compression mechanism as WMF (which, you didn't answer, but I assume is concatenation and compression). How exactly does this work "during insertion" anyway? Does it intelligently group sets of revisions together to avoid decompressing and recompressing the same revision several times? I suppose it's possible, but that would introduce quite a lot of complication into the import script, slowing things down dramatically.
What about the answers to my other questions?
If you want to put your money where your mouth is, import en.wikipedia. It'll only take 5 days, right?
If I actually had a server or the disc space to do it I would, just to
prove
your smartass comments as stupid as they actually are. However given my current resource limitations (fairly crappy internet connection, older laptops, and lack of HDD) I tried to select something that could give reliable benchmarks. If your willing to foot the bill for the new
hardware
Ill gladly prove my point
What you seem to be saying is that you're *not* putting your money where your mouth is.
Anyway, if you want, I'll make a deal with you. A neutral third party rents the hardware at Amazon Web Services (AWS). We import simple.wikipedia full history (concatenating and compressing during import). We take the ratio of revisions in simple.wikipedia to the ratio of revisions in en.wikipedia. We import en.wikipedia full history (concatenating and compressing during import). If the ratio of time it takes to import en.wikipedia vs simple.wikipedia is greater than or equal to twice the ratio of revisions, then you reimburse the third party. If the ratio of import time is less than twice the ratio of revisions (you claim it is linear, therefore it'll be the same ratio), then I reimburse the third party.
Either way, we save the new dump, with the processing already done, and send it to archive.org (and WMF if they're willing to host it). So we actually get a useful result out of this. It's not just for the purpose of settling an argument.
Either of us can concede defeat at any point, and stop the experiment. At that point if the neutral third party wishes to pay to continue the job, s/he would be responsible for the additional costs.
Shouldn't be too expensive. If you concede defeat after 5 days, then your CPU-time costs are $54 (assuming Extra Large High Memory Instance). Including 4 terabytes of EBS (which should be enough if you compress on the fly) for 5 days should be less than $100.
I'm tempted to do it even if you don't take the bet.
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
On Thu, May 17, 2012 at 7:27 AM, J Alexandr Ledbury-Romanov alexandrdmitriromanov@gmail.com wrote:
I'd like to point out that the increasingly technical nature of this conversation probably belongs either on wikitech-l, or off-list, and that the strident nature of the comments is fast approaching inappropriate.
Really? I think we're really getting somewhere.
In fact, I think someone at WMF should contact Amazon and see if they'll let us conduct the experiment for free, in exchange for us creating the dump for them to host as a public data set (http://aws.amazon.com/publicdatasets/).
In case you got lost in the technical details, the original post was asking "Has anyone recently set up a full-external-duplicate of (for instance) en.wp?" and suggesting that we should do this on a yearly basis as a fire drill.
My latest post was a concrete proposal for doing exactly that.
Please have someone at WMF coordinate this so that there aren't multiple requests made. In my opinion, it should preferably be made by a WMF employee.
Fill out the form at https://aws-portal.amazon.com/gp/aws/html-forms-controller/aws-dataset-inqui...
Tell them you want to create a public data set which is a snapshot of the English Wikipedia. We can coordinate any questions, and any implementation details, on a separate list.
On Thu, May 17, 2012 at 7:43 AM, Anthony wikimail@inbox.org wrote:
On Thu, May 17, 2012 at 7:27 AM, J Alexandr Ledbury-Romanov alexandrdmitriromanov@gmail.com wrote:
I'd like to point out that the increasingly technical nature of this conversation probably belongs either on wikitech-l, or off-list, and that the strident nature of the comments is fast approaching inappropriate.
Really? I think we're really getting somewhere.
In fact, I think someone at WMF should contact Amazon and see if they'll let us conduct the experiment for free, in exchange for us creating the dump for them to host as a public data set (http://aws.amazon.com/publicdatasets/).
In case you got lost in the technical details, the original post was asking "Has anyone recently set up a full-external-duplicate of (for instance) en.wp?" and suggesting that we should do this on a yearly basis as a fire drill.
My latest post was a concrete proposal for doing exactly that.
On 17/05/12 12:49, Anthony wrote:
Please have someone at WMF coordinate this so that there aren't multiple requests made. In my opinion, it should preferably be made by a WMF employee.
Fill out the form at https://aws-portal.amazon.com/gp/aws/html-forms-controller/aws-dataset-inqui...
Tell them you want to create a public data set which is a snapshot of the English Wikipedia. We can coordinate any questions, and any implementation details, on a separate list.
That's a fantastic idea, and would give en: Wikipedia yet another public replica for very little effort. I would imagine that if they are willing to host enwiki, they may also be be willing to host most, or all, of the other projects.
It will also mean that running Wikipedia data-munching experiments on EC2 will become much easier.
Neil
On 17 May 2012 12:43, Anthony wikimail@inbox.org wrote:
In fact, I think someone at WMF should contact Amazon and see if they'll let us conduct the experiment for free, in exchange for us creating the dump for them to host as a public data set (http://aws.amazon.com/publicdatasets/).
What dump are you going to create? You are starting from a dump, why can't Amazon just host that?
On Thu, May 17, 2012 at 8:11 AM, Thomas Dalton thomas.dalton@gmail.com wrote:
On 17 May 2012 12:43, Anthony wikimail@inbox.org wrote:
In fact, I think someone at WMF should contact Amazon and see if they'll let us conduct the experiment for free, in exchange for us creating the dump for them to host as a public data set (http://aws.amazon.com/publicdatasets/).
What dump are you going to create? You are starting from a dump, why can't Amazon just host that?
Because the XML dump is semi-useless - it's compressed in all the wrong places to use for an actual running system.
Anyway, looking at how the AWS Public Data Sets work, it probably would be best not to even create a dump, but just put up the running (object compressed) database.
On Thu, May 17, 2012 at 07:43:09AM -0400, Anthony wrote:
In fact, I think someone at WMF should contact Amazon and see if they'll let us conduct the experiment for free, in exchange for us creating the dump for them to host as a public data set (http://aws.amazon.com/publicdatasets/).
That sounds like an excellent plan. At the same time, it might be useful to get Archive Team involved.
* They have warm bodies. (always useful, one can never have enough volunteers ;) * They have experience with very large datasets * They'd be very happy to help (it's their mission) * Some of them may be able to provide Sufficient Storage(tm) and server capacity. Saves us the Amazon AWS bill. * We might set a precedent where others might provide their data to AT directly too.
AT's mission dovetails nicely with ours. We provide the sum of all human knowledge to people. AT ensures that the sum of all human knowledge is not subtracted from.
sincerely, Kim Bruning
Hello People, I have completed my first set in uploading the osm/fosm dataset (350gb unpacked) to archive.org http://osmopenlayers.blogspot.de/2012/05/upload-finished.html
We can do something similar with wikipedia, the bucket size of archive.org is 10gb, we need to split up the data in a way that it is useful. I have done this by putting each object on one line and each file contains the full data records and the parts that belong to the previous block and next block, so you are able to process the blocks almost stand alone.
mike
There is no such 10GB limit, http://archive.org/details/ARCHIVETEAM-YV-6360017-6399947 (238 GB example)
ArchiveTeam/WikiTeam is uploading some dumps to Internet Archive, if you want to join the effort use the mailing list https://groups.google.com/group/wikiteam-discuss to avoid wasting resources.
2012/5/18 Mike Dupont jamesmikedupont@googlemail.com
Hello People, I have completed my first set in uploading the osm/fosm dataset (350gb unpacked) to archive.org http://osmopenlayers.blogspot.de/2012/05/upload-finished.html
We can do something similar with wikipedia, the bucket size of archive.org is 10gb, we need to split up the data in a way that it is useful. I have done this by putting each object on one line and each file contains the full data records and the parts that belong to the previous block and next block, so you are able to process the blocks almost stand alone.
mike
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
there is no 10gb limit, but it is the recommended bucket size if you want to split up the file, according to my recent discussion with the archive.org team, and they have been helping me optimize the storage. the idea of mine is to make smaller blocks that can be fetched quickly and that people for example reading an article could just load the data needed to display would be availab le via json(p) or xml/text from a file. we can make the wikipedia in a read only mode hosted totallz on the archive org without a database server by encoding the search binary trees in json data stored also on archive org, the clients can perform the searches themselves. that is my current research on fosm.org and i hope it can apply to the wikipedia as well. mike
On Fri, May 18, 2012 at 9:41 AM, emijrp emijrp@gmail.com wrote:
There is no such 10GB limit, http://archive.org/details/ARCHIVETEAM-YV-6360017-6399947 (238 GB example)
ArchiveTeam/WikiTeam is uploading some dumps to Internet Archive, if you want to join the effort use the mailing list https://groups.google.com/group/wikiteam-discuss to avoid wasting resources.
2012/5/18 Mike Dupont jamesmikedupont@googlemail.com
Hello People, I have completed my first set in uploading the osm/fosm dataset (350gb unpacked) to archive.org http://osmopenlayers.blogspot.de/2012/05/upload-finished.html
We can do something similar with wikipedia, the bucket size of archive.org is 10gb, we need to split up the data in a way that it is useful. I have done this by putting each object on one line and each file contains the full data records and the parts that belong to the previous block and next block, so you are able to process the blocks almost stand alone.
mike
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
-- Emilio J. Rodríguez-Posada. E-mail: emijrp AT gmail DOT com Pre-doctoral student at the University of Cádiz (Spain) Projects: AVBOT http://code.google.com/p/avbot/ | StatMediaWikihttp://statmediawiki.forja.rediris.es | WikiEvidens http://code.google.com/p/wikievidens/ | WikiPapershttp://wikipapers.referata.com | WikiTeam http://code.google.com/p/wikiteam/ Personal website: https://sites.google.com/site/emijrp/ _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
On Thu, May 17, 2012 at 12:13 AM, John phoenixoverride@gmail.com wrote:
that two week estimate was given worst case scenario. Given the best case we are talking as little as a few hours for the smaller wikis to 5 days or so for a project the size of enwiki. (see http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-May/000491.htmlfor progress on image dumps`)
Where are you getting these figures from?
Are you talking about a full history copy?
Also, what about the copyright issues (especially, attribution)?
Well to be honest, I am still upset about how much data is deleted from wikipedia because it is not "notable", there are so many articles that I might be interested in that are lost in the same garbage as spam and other things. We should make non notable articles and non harmful ones available in the backups as well. mike
On Thu, May 17, 2012 at 2:28 AM, Kim Bruning kim@bruning.xs4all.nl wrote:
On Wed, May 16, 2012 at 11:11:04PM -0400, John wrote:
I know from experience that a wiki can be re-built from any one of the dumps that are provided, (pages-meta-current) for example contains everything needed to reboot a site except its user database (names/passwords ect). see http://www.mediawiki.org/wiki/Manual:Moving_a_wiki
Sure. Does this include all images, including commons images, eventually converted to operate locally?
I'm thinking about full snapshot-and-later-restore, say 25 or 50 years from now, or in an academic setting, (or FSM-forbid in a worst case scenario <knock on wood>). That's what the AT folks are most interested in.
==Fire Drill== Has anyone recently set up a full-external-duplicate of (for instance) en.wp? This includes all images, all discussions, all page history (excepting the user accounts and deleted pages)
This would be a useful and important exercise; possibly to be repeated once per year.
I get a sneaky feeling that the first few iterations won't go so well.
I'm sure AT would be glad to help out with the running of these fire drills, as it seems to be in line with their mission.
sincerely, Kim Bruning
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
On Thu, May 17, 2012 at 1:11 PM, John phoenixoverride@gmail.com wrote:
I know from experience that a wiki can be re-built from any one of the dumps that are provided, (pages-meta-current) for example contains everything needed to reboot a site except its user database (names/passwords ect). see http://www.mediawiki.org/wiki/Manual:Moving_a_wiki
How would we regain control of our existing usernames in the event that the user database was lost in the move?
-- John Vandenberg
On Thu, May 17, 2012 at 1:14 PM, John Vandenberg jayvdb@gmail.com wrote:
How would we regain control of our existing usernames in the event that the user database was lost in the move?
That would be up to the end project to decide, Although ideally they shouldn't unless you can prove some how it was you otherwise there is possible issues with mis-attribution if someone else managed to regain the account.
If both are accessible I've seen an extension that allowed you to claim your username. Saw it in action when Wowpedia forked from the Wikia Wowwiki and they let people claim their old usernames with an edit (and code in edit summary iirc) on the other wiki.
James
On Wed, May 16, 2012 at 10:03 PM, K. Peachey p858snake@gmail.com wrote:
On Thu, May 17, 2012 at 1:14 PM, John Vandenberg jayvdb@gmail.com wrote:
How would we regain control of our existing usernames in the event that the user database was lost in the move?
That would be up to the end project to decide, Although ideally they shouldn't unless you can prove some how it was you otherwise there is possible issues with mis-attribution if someone else managed to regain the account.
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
The only issues for Wikimedia projects perservation/forking by third parties are the missing image dumps (which are being created since some days ago, thanks Ariel) and the usernames/passwords table (not a big problem in an apocalyptic scenario, where articles and images have top priority).
We at WikiTeam are uploading wiki dumps to Internet Archive, and recently some official mirrors of Wikimedia dumps (articles + images) are being created around the globe (currently in 3 different locations).
I think we are taking great steps in the last year.
2012/5/17 Kim Bruning kim@bruning.xs4all.nl
We at Archive Team are attempting to download all the 700,000 Knols.[3]
For
the sake of history. Join us, #archiveteam EFNET.
I did some followup. I'm not sure I can help out with Knol anymore, but I discovered that AT is having some trouble making good archives of wikimedia sites.
Theoretically, wikipedia et al SHOULD be easy to reconstitute, right? That's why we're using CC licenses and all. Else if we drop the ball, WP will be gone. This seems like a priority to me!
The main problem seems to be obtaining commons images: http://archiveteam.org/index.php?title=Wikiteam
So at the very least, we don't appear to have very good documentation. Who could best help Archive Team out? Has anyone done/written documentation on completely restoring 1 or more wikimedia wikis from 'public backup' [1]?
What can we do to help them?
sincerely, Kim Bruning
[1] "Real Men don't make backups. They upload it via ftp and let the world mirror it." - Linus Torvalds
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
WARNING: The following post is a work of technical fantasy rather than practical reality.
On the usernames and passwords thing, if we imagine our doomsday scenario (meteor hits the WMF data centre, the Foundation turn into evil psychopathic Nazis, whatever), one thing that might be useful and archive-oriented developers might want to consider would be some way of 'namespacing' usernames. That way, we could have it so a fork/new version could specify that, say, all the usernames on all the existing content are usernames on en.wikipedia.org, and distinguish those from the usernames on post-apocalyptic Wikipedia. That way we can keep the attribution chain to the old usernames without the issue of identity theft.
It'd also be a good step towards attribution in distributed wikis. This might be for something like a future attempt at Citizendium (or perhaps someone wants to make a version of Wikipedia with pending changes or the image filter or one of the other many things the community cannot agree on).
In addition, it would be useful to be able to distinguish with usernames on sites that reuse Commons images (if I upload an image to Commons with the username 'Tom Morris' and then some non-WMF wiki reuses it, it may be attributing it to the local user 'Tom Morris' rather than the Commons user).
Finally, it'd be potentially useful for wikis which use some Wikipedia content combined with some local content. For instance, I know wikiqueer.org uses Wikipedia content with attribution, and combines the encyclopaedic content of Wikipedia with non-encyclopedic community content that wouldn't meet up with Wikipedia's mission or NPOV (they have the supposedly very controversial POV that LGBT people deserve equal rights).
In all these cases, as well as our potential doomsday scenario, being able to clearly distinguish between local usernames and usernames on other wikis might be quite useful. The inner semantic web dork suggests that perhaps we could consider using something like a uniform resource indicator (URI) to identify users. ;-)
We could also consider the possibility of allowing users to use OpenID or OAuth or whatever the web identity mechanism du jour is to allow loose affiliation of usernames between MediaWiki installs. That way you can establish the link between identities across wikis (of course, if you don't want to, you don't have to).
On Thu, May 17, 2012 at 8:31 AM, Tom Morris tom@tommorris.org wrote:
We could also consider the possibility of allowing users to use OpenID or OAuth or whatever the web identity mechanism du jour is to allow loose affiliation of usernames between MediaWiki installs. That way you can establish the link between identities across wikis (of course, if you don't want to, you don't have to).
Also, there's http://en.wikipedia.org/wiki/Template:User_committed_identity
But most people don't seem to care about these things.
On Thursday, 17 May 2012 at 13:34, Anthony wrote:
On Thu, May 17, 2012 at 8:31 AM, Tom Morris <tom@tommorris.org (mailto:tom@tommorris.org)> wrote:
We could also consider the possibility of allowing users to use OpenID or OAuth or whatever the web identity mechanism du jour is to allow loose affiliation of usernames between MediaWiki installs. That way you can establish the link between identities across wikis (of course, if you don't want to, you don't have to).
Also, there's http://en.wikipedia.org/wiki/Template:User_committed_identity
But most people don't seem to care about these things.
Sure, the use cases of Committed Identities are slightly different.
wikimedia-l@lists.wikimedia.org