Xmldatadumps-l September 2011

xmldatadumps-l@lists.wikimedia.org

9 participants
8 discussions

inter-page links in the data dump
by Greg Morrison 17 Nov '11

17 Nov '11

I am interested in looking at the links between webpages on wikipedia for scientific research. I have been to http://en.wikipedia.org/wiki/Wikipedia:Database_download which suggested that the latest pages-articles is likely the one people want. However, I'm unclear on some things. (1) http://dumps.wikimedia.org/enwiki/latest/ has a lot of different files, and I can't actually tell if one of them would actually contain only link information. Is there a description of what each file contains? (2) The enwiki-latest-pages-articles.xml file uncompresses as 31.55GB. Is it correct that this contains the current snapshot of all pages and articles in wikipedia? (I only ask because this seems small) (3) If I am constrained to use latest-pages-articles.xml, I'm unclear on the method used to denote a link. It would appear that links are denoted by [[link]] or [[link | word]]. Such patterns would be fairly easy to find using perl. However, I've noticed some odd cases, such as "[[File:WilliamGodwin.jpg|left|thumb|[[William Godwin]], "the first to formulate ...... in his work".<refname="EB1910" />]]" If I must search through the page-articles file, and if the [[ ]] notation is overloaded, is there a description of the patterns that are used in this file? I.e. a way for me to ensure that I'm only grabbing links, not figure captions or some other content. Thanks for your help!

5 5

Mirroring wikipedia
by とある白い猫 29 Sep '11

29 Sep '11

Hi, I'd like to know if there is a webpage that is more of a walk-through on how o create a mirror for wikipedia. I have had a few people asked me how this can be done here at CLEF2011 and I was not able to give them a satisfactory answer. -- とある白い猫 (To Aru Shiroi Neko)

2 2

het deployment round one tonight
by Ariel T. Glenn 26 Sep '11

26 Sep '11

I've shot all but a couple jobs, and those will die before I go to bed. Tomorrow after the first round of deployments and fixes has gone around and the dust has stelled, I'll crank the dumps back up until it's time for the second round. Ariel

1 0

dump bumps in the road
by Ariel T. Glenn 24 Sep '11

24 Sep '11

As people prepare for the "het deployment" of mw 1.18 you are going to see some interruptions of the dumps while I get these hosts ready for the switch. As a side effect of one of the configuration changes, the host running the "large-ish" wikis is temporarily not running any jobs. I should be starting those jobs again on Moday. The four "small wiki" processes should be ok. When we get around to the actual deployment, I will likely stop all jobs til deployment is complete, as I cannot begin to guess what the output would look like if the codebase were switched out underneath the dumpers in the middle of a run. Ariel

2 2

Re: [Xmldatadumps-l] [Foundation-l] Request: WMF commitment as a long term cultural archive?
by emijrp 21 Sep '11

21 Sep '11

Hi all; Just like the scripts to preserve wikis[1], I'm working in a new script to download all Wikimedia Commons images packed by day. But I have limited spare time. Sad that volunteers have to do this without any help from Wikimedia Foundation. I started too an effort in meta: (with low activity) to mirror XML dumps.[2] If you know about universities or research groups which works with Wiki[pm]edia XML dumps, they would be a possible successful target to mirror them. If you want to download the texts into your PC, you only need 100GB free and to run this Python script.[3] I heard that Internet Archive saves XML dumps quarterly or so, but no official announcement. Also, I heard about Library of Congress wanting to mirror the dumps, but not news since a long time. L'Encyclopédie has an "uptime"[4] of 260 years[5] and growing. Will Wiki[pm]edia projects reach that? Regards, emijrp [1] http://code.google.com/p/wikiteam/ [2] http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps [3] http://code.google.com/p/wikiteam/source/browse/trunk/wikipediadownloader.py [4] http://en.wikipedia.org/wiki/Uptime [5] http://en.wikipedia.org/wiki/Encyclop%C3%A9die 2011/6/2 Fae <faenwp(a)gmail.com> > Hi, > > I'm taking part in an images discussion workshop with a number of > academics tomorrow and could do with a statement about the WMF's long > term commitment to supporting Wikimedia Commons (and other projects) > in terms of the public availability of media. Is there an official > published policy I can point to that includes, say, a 10 year or 100 > commitment? > > If it exists, this would be a key factor for researchers choosing > where to share their images with the public. > > Thanks, > Fae > -- > http://enwp.org/user_talk:fae > Guide to email tags: http://j.mp/faetags > > _______________________________________________ > foundation-l mailing list > foundation-l(a)lists.wikimedia.org > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l >

1 0

another month, another dump. ho hum :-P
by Ariel T. Glenn 19 Sep '11

19 Sep '11

The September en wikipedia dumps are done. Folks who use them, note that this is the first run with the generation of a pile of smaller files. The naming scheme as you will have noticed has an additional string: -p<first-page-id-contained>p<last-pageid-contained> Expect the specific groupings to change from one run to the next; it's time-based, rather than based on the number of pages or revisions. You may notice a gap of a few numbers between files; this would indicate that those pages were deleted and not included in the dump at all. Since there were no issues with the network, database servers, broken MW deployments etc., the run finished without any need for restarts of a particular step; this is probably the fastest we'll ever see it run, in a little under 8 days. Any issues, please let me know. I expect people will need a script to download these files easily; didn't someone on this list have a tool in the works? Ariel

5 7

dump of ptwikibooks broken
by Bernd Fehling 09 Sep '11

09 Sep '11

Dear list, I noticed that ptwikibooks has some problems since end of May. Is someone able to fix this? Regards, Bernd Fehling

1 0

another month...
by Ariel T. Glenn 02 Sep '11

02 Sep '11

...another dump. August is done, July 7z are done, the last of the May history and 7z are done. That brings us up to date. I expect to test new code with production of many small files, as previously discussed on this list, starting within the next few days. This test will be for en wikipedia only, as that's the dump that's hardest to run to completion. The results might be a perfectly good dump, or not. Even if they are, I do not plan to try running en wikipedia dumps twice a month, so don't get your hopes up. (Who would process all that data every two weeks anyways?) Ariel

3 3

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l September 2011