Xmldatadumps-l October 2013

xmldatadumps-l@lists.wikimedia.org

9 participants
10 discussions

pbzip2 proposal
by Richard Jelinek 16 Jan '16

16 Jan '16

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

6 11

Date/time of AllPages snapshot /enwiki/latest/enwiki-latest-all-titles.gz
by Nathan Larson 24 Oct '13

24 Oct '13

I created mw:Extension:InterwikiExistence<https://www.mediawiki.org/wiki/Extension:InterwikiExistence>, which imports a file of Wikipedia page titles and then polls the Wikipedia API to keep the local list of Wikipedia's pages up to date. But it helps if it knows what timestamp to start its polling at. Specifically, system administrators installing the extension need to know the date/time at which the AllPages snapshot gzipped as enwiki-latest-all-titles.gz<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles.gz>was generated; that way, the API poll function can begin with the right value in the rcstart (recentchanges) and lestart (logevents) parameters. Can system administrators rely on the "Last Modified" date/time of the file as that snapshot date/time? Or is a better date/time to use listed somewhere else? Thanks.

1 0

Re: [Xmldatadumps-l] Missing incremental dumps
by Hydriz Scholz 21 Oct '13

21 Oct '13

This certainly does not only affect Wikidata, but most of the other wikis on the same date. I am not entirely sure if there is such a function as restarting a specific date though. On Mon, Oct 21, 2013 at 8:07 PM, Hydriz Scholz <admin(a)alphacorp.tk> wrote: > This certainly does not only affect Wikidata, but most of the other wikis > on the same date. I am not entirely sure if there is such a function as > restarting a specific date though. > > > On Mon, Oct 21, 2013 at 4:45 PM, <nix8436(a)mail.ru> wrote: > >> Hello, >> >> some Wikidata incremental dumps are missing: >> http://dumps.wikimedia.org/**other/incr/wikidatawiki/**20131007/<http://dumps.wikimedia.org/other/incr/wikidatawiki/20131007/> >> http://dumps.wikimedia.org/**other/incr/wikidatawiki/**20131013/<http://dumps.wikimedia.org/other/incr/wikidatawiki/20131013/> >> http://dumps.wikimedia.org/**other/incr/wikidatawiki/**20131020/<http://dumps.wikimedia.org/other/incr/wikidatawiki/20131020/> >> >> Could somebody restrart dumping procedure to create missed files please? >> >> Sincerely, >> Ivan >> >> ______________________________**_________________ >> Xmldatadumps-l mailing list >> Xmldatadumps-l(a)lists.**wikimedia.org <Xmldatadumps-l(a)lists.wikimedia.org> >> https://lists.wikimedia.org/**mailman/listinfo/xmldatadumps-**l<https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l> >> > > > > -- > Regards, > Hydriz > > Be social, follow/add me: > Facebook: http://tinyurl.com/hydrizfb > Google+: http://tinyurl.com/hydrizgl > Twitter: @hydrizwiki > -- Regards, Hydriz Be social, follow/add me: Facebook: http://tinyurl.com/hydrizfb Google+: http://tinyurl.com/hydrizgl Twitter: @hydrizwiki

2 1

Missing incremental dumps
by nix8436＠mail.ru 21 Oct '13

21 Oct '13

Hello, some Wikidata incremental dumps are missing: http://dumps.wikimedia.org/other/incr/wikidatawiki/20131007/ http://dumps.wikimedia.org/other/incr/wikidatawiki/20131013/ http://dumps.wikimedia.org/other/incr/wikidatawiki/20131020/ Could somebody restrart dumping procedure to create missed files please? Sincerely, Ivan

1 0

MWDumper issue
by Robert Slippey 20 Oct '13

20 Oct '13

Good evening, I hope this is an appropriate place for this question. I've been trying to import the current enwiki dump from Oct. Using Mwdumper it runs fine until hitting 247,000 records and then mysql throws one of the following 2 errors. Error (2006 I think it was) Mysql went away Or Error (2013) Lost connection to mysql server I've tried several things on the mysql end including increasing the innodb_log_file_size and increasing max_allowed_packets. All suggestions by a few web resources. Still to no avail.. I'm not sure where to true issue is or how to go about correcting it. Any help would be greatly appreciated. I'm running latest version of mediawiki, importing the october 2013 dump and mysql version 5.6.12 via WAMP, and mwdumper 1.16 thanks in advance

1 0

Matching and retrieving List pages
by Peyman Faratin 18 Oct '13

18 Oct '13

Hi I've a set of list page titles that i've extracted from the Category dump (where "cl_from" is of type "page") http://www.mediawiki.org/wiki/Manual:Categorylinks_table Now I want to extract the CONTENT of the page from the pages dump enwiki-latest-pages-articles.xml Although there are guidelines on how editors should mark these pages http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lists ("The titles of list articles typically begin with the type of list it is (List of, Index of, etc.), followed by the article's subject; like: List of vegetable oils.") The majority of the times the above rule is not implemented. So my concrete question is: - if i am consuming the pages-articles.xml dump (D) page by page, and i have a list of pages (L) i've extracted from the category (C) links dump, then how can i check that page in pages dump file D is a member of L? The titles do not resolve the names. For instance, if I have the page title "List of the longest Asian rivers" (http://en.wikipedia.org/wiki/List_of_the_longest_Asian_rivers) then what in that page's content (http://en.wikipedia.org/w/index.php?title=List_of_the_longest_Asian_rivers&…) can tell me it is the same page "List of the longest Asian rivers"? None-list pages appear to place the title as first token with ''' markings. Any suggestions of a robust solution would be much appreciated. Best

2 2

Question: Content of Category pages
by Peyman Faratin 18 Oct '13

18 Oct '13

Apologize if this is not the appropriate forum for the question. I am trying to access the content of Category pages from either the dump or APIs. For example, I would like to get a complete list of rivers http://en.wikipedia.org/wiki/Category:Lists_of_rivers The API does provide the content but it is throttled https://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmlimi… Therefore I would like to find the content in the dumps. However, I cannot find this information in the dumps. I have looked inside http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml and find nothing there. The pages are referenced in the the page SQL dumps http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.sql.gz Do any of the dumps contain the category page content? thank you

2 2

Re: [Xmldatadumps-l] [Commons-l] [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org
by Federico Leva (Nemo) 15 Oct '13

15 Oct '13

Paul A. Houle, 15/10/2013 00:35: > I’d like to see the Commons backups available in the AMZN S3 cloud, even > if it is only as “requester pays”. Frankly, my experience is that > getting data from the Internet Archive is so slow that I wonder if they > are on the Moon. When did you try last time? They recently increased their bandwidth. > My infovore framework > AMZN has had a policy of offering free S3 storage for public data sets – > I’d like to see them take this program to the next level with data sets > of this nature. It seems anyone can request it, anyway I sent an inquiry. The datasets they have (XML dumps) are very outdated: https://aws.amazon.com/datasets/Encyclopedic/4182 https://aws.amazon.com/datasets/Encyclopedic/2506 https://aws.amazon.com/datasets/Encyclopedic/2596 Everybody can ask to other mirrors (I already asked GARR), some ideas at https://sourceforge.net/apps/trac/sourceforge/wiki/Mirrors Nemo

1 0

"Tarballs" of all 2004-2012 Commons files now available at archive.org
by Federico Leva (Nemo) 13 Oct '13

13 Oct '13

WikiTeam has just finished archiving all Wikimedia Commons files up to 2012 (and some more) on the Internet Archive: https://archive.org/details/wikimediacommons So far it's about 24 TB of archives and there are also a hundred torrents you can help seed, ranging from few hundred MB to over a TB, most around 400 GB. Everything is documented at <https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media…> and if you want here are some ideas to help WikiTeam with coding: <https://code.google.com/p/wikiteam/issues/list>. Nemo

2 1

Information Required Regarding Wikipedia DB setup from xml dumps (Whole revision Dumps Terabytes in size)
by Imran Latif 03 Oct '13

03 Oct '13

Hey , I'm setting up the database from wikipedia XML dumps. As you know that if we import all revision dumps then it may size to 22 - 25 terabytes Aprox. This size is too huge, what is workaround for it ?? If it is necessary to dumps all xmls to DB then i think linux only support up to 16 TB maximum. SO how we can import into single MYSQL server deployed on linux OS. Please some one also tell me that how currently wikipedia it self managing such huge data in terms of software and hardware solutions. Regards, IMran

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l October 2013