Xmldatadumps-l November 2011

xmldatadumps-l@lists.wikimedia.org

11 participants
8 discussions

Re: [Xmldatadumps-l] [Wikitech-l] Fwd: Old English Wikipedia image dump from 2005

by John Vandenberg

On Fri, Nov 11, 2011 at 11:18 PM, emijrp <emijrp(a)gmail.com> wrote: > Forwarding... > > ---------- Forwarded message ---------- > From: emijrp <emijrp(a)gmail.com> > Date: 2011/11/11 > Subject: Old English Wikipedia image dump from 2005 > To: wikiteam-discuss(a)googlegroups.com > > > Hi all; > > I want to share with you this Archive Team link[1]. It is an old English > Wikipedia image dump from 2005. One of the last ones, probably, before > Wikimedia Foundation stopped publishing image dumps. Enjoy. > > Regards, > emijrp > > [1] http://www.archive.org/details/wikimedia-image-dump-2005-11 People interested in image dumps may be also interested in my post relating to the GFDL requirements, which I think mean images need to be included in the dumps. https://meta.wikimedia.org/w/index.php?title=Talk:Terms_of_use&diff=prev&ol… excerpt: "..the [GFDL] license requires that someone can download a ''complete'' Transparent copy for one year after the last Opaque copy is distributed. As a result, I believe the BoT needs to ensure that the dumps are available ''and'' that they can be available for one year after WMF turns of the lights on the core servers (it allows 'agents' to provide this service). As Wikipedia contains images, the images are required to be included. .." discussion continues .. https://meta.wikimedia.org/wiki/Talk:Terms_of_use#Right_to_Fork -- John Vandenberg

12 years, 2 months

Stopped mirroring?

by Hydriz Wikipedia

Hi guys, While I was trying to rsync all the dumps available on the mirror site, I noticed that the latest dump available was only made in early October, what happened to the newer dumps made on the Brazilian mirror? Has it stopped following updates from dumps.wikimedia.org? BTW if anyone of you were following the mirroring XML dumps talk page, I have just sent an email to ask for AARNet to mirror our dumps too. They have about 84TB of space available and is located in Australia. Cross your fingers! Regards, Hydriz

12 years, 4 months

"add/changes" dumps... highly experimental, you have been warned

by Ariel T. Glenn

Hello folks, So this has been running all of one day now, and I expect it to break in wild and crazy ways over the next period while we get the bugs out. But, throwing caution to the winds... I'm generating dumps each day for each non-closed non-private project, of revisions added since the previous day. It uses the standard xml format, writing out stubs and history files. This is a sort of poor person's incremental dump. What do I mean by that? Well... It doesn't contain a list of deletions, page moves, undeletes. It just dumps the metadata and text for every revision between X1 (last revision dumped the day before) and X2 (last revision in db as of the time it's dumped). The reason for that? Dumping a range of revisions is relatively easy. Accounting for page deletions, moves etc. since the previous dump is hard, so that is an exercise left to the reader :-P Even with these limitations I'm hoping the data will be useful to folks. These are specifically *not* intended to be kept around forever; we'll keep some reasonable number, 20-30 of them, and then start tossing old ones after that. A note about the timing of the dumps: they run once a day, there's no progress reporting. An updated index file is published near the end of the day. Also, we dump content with a delay of 12 hours, to allow admins to delete things that might contain sensitive information. This was less of a concern for dumps generated once a week, but daily runs increase the odds of something bad getting dumped. And speaking of the index file, it's here: http://dumps.wikimedia.org/other/incr/ Guess I'll add some documentation on wikitech too. The code is in my branch in svn, see http://svn.wikimedia.org/viewvc/mediawiki/branches/ariel/xmldumps-backup/in… I may well be patching things tomorrow at this time for jobs that failed to run, so feel free to point out issues, but also don't be surprised by frequent outages. Happy trails, Ariel

12 years, 5 months

Re: [Xmldatadumps-l] Xmldatadumps-l Digest, Vol 21, Issue 2 - proposals for categories of Commons

by burslem

> Ariel: >> Providing multiple terabyte sized files for download doesn't make any kind of sense to me. >> However, if we get concrete proposals for categories of Commons images people really want >> and would use, we can put those together. I think this has been said before on wikitech-l if not here. The Picture of the Year (POTY) collections are truly stunning! I am not very interested in having terabytes of random snapshots on my computer, instead I find smaller collections of "best of the best" much more suiting for the public. This way, it will be accessible to those with smaller amounts of diskspace and they'll be equally impressed. I doubt there are many people interested in the image tarballs at all, they're just going for the principle of accessibility. Presumably wikipedia has plenty of back-up capabilities and there are enough gurus doing everything to prevent possible data loss. Offering public back-ups has no additional value in this perspective. Most of us probably use the wikipedia XML's for offline usage or research. I have not yet come across image research projects requiring tens of terabytes of images to be successful! I say, if the POTY downloads are popular according to statistics, why not compile a couple more years? The thing I'm talking about is hosted at http://dumps.wikimedia.org/other/poty/ .

12 years, 5 months

No new german dump

by Andreas Meier

Hello, today we ja-dump was finished, so a new de-dump should start. Best regards Andreas

12 years, 5 months

inter-page links in the data dump

by Greg Morrison

I am interested in looking at the links between webpages on wikipedia for scientific research. I have been to http://en.wikipedia.org/wiki/Wikipedia:Database_download which suggested that the latest pages-articles is likely the one people want. However, I'm unclear on some things. (1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different files, and I can't actually tell if one of them would actually contain only link information. Is there a description of what each file contains? (2) The enwiki-latest-pages-articles.xml file uncompresses as 31.55GB. Is it correct that this contains the current snapshot of all pages and articles in wikipedia? (I only ask because this seems small) (3) If I am constrained to use latest-pages-articles.xml, I'm unclear on the method used to denote a link. It would appear that links are denoted by [[link]] or [[link | word]]. Such patterns would be fairly easy to find using perl. However, I've noticed some odd cases, such as "[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the first to formulate ...... in his work".<refname="EB1910" />]]" If I must search through the page-articles file, and if the [[ ]] notation is overloaded, is there a description of the patterns that are used in this file? I.e. a way for me to ensure that I'm only grabbing links, not figure captions or some other content. Thanks for your help!

12 years, 5 months

dumps halted for updates

by Ariel T. Glenn

Some OS security updates were installed and there's code updates that have just been deployed, so dumps are halted for a little while; I'll be resuming them later today. Ariel

12 years, 5 months

dump titles form other namespaces than 0?

by Ariel T. Glenn

A while back (over 2 years ago, urk!) we had a request for dumps of titles of things other than articles [1]. I haven't seen that request repeated, but I'm wondering how useful that would be to folks and which namespaces we should dump, if we were going to add a few. Article talk pages? Other? Ariel [1] https://bugzilla.wikimedia.org/show_bug.cgi?id=19542

12 years, 5 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l November 2011