Xmldatadumps-l January 2012

xmldatadumps-l@lists.wikimedia.org

13 participants
11 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

POTY 2009 plus a few historical dumps now available

by Ariel T. Glenn

Good morning campers! :-) At the POTY collection link http://dumps.wikimedia.org/other/poty/ you'll notice that the 2009 files have been added. For people who have wanted older (2002 through 2006) dumps, one or two dumps of the projects for most of that period are now online at our new archive link: http://dumps.wikimedia.org/archive/ I'd love to get a couple dumps of each of the projects for the years 2005, 2007 and 2008. If you have an old mirror on a hard drive gathering dust in your closet, give me a shout. Thanks! Hmm, and one other thing while I'm at it, I've been cleaning up our dumps documentation on wikitech, and there's now a rough outline of the dumps history here: http://wikitech.wikimedia.org/view/Dumps/History If people remember any milestones that should be added there, or if you see any glaring errors, please edit, or send me mail with the corrections. Ariel

12 years, 2 months

Re: [Xmldatadumps-l] [Wikitech-l] Fwd: Old English Wikipedia image dump from 2005

by John Vandenberg

On Fri, Nov 11, 2011 at 11:18 PM, emijrp <emijrp(a)gmail.com> wrote: > Forwarding... > > ---------- Forwarded message ---------- > From: emijrp <emijrp(a)gmail.com> > Date: 2011/11/11 > Subject: Old English Wikipedia image dump from 2005 > To: wikiteam-discuss(a)googlegroups.com > > > Hi all; > > I want to share with you this Archive Team link[1]. It is an old English > Wikipedia image dump from 2005. One of the last ones, probably, before > Wikimedia Foundation stopped publishing image dumps. Enjoy. > > Regards, > emijrp > > [1] http://www.archive.org/details/wikimedia-image-dump-2005-11 People interested in image dumps may be also interested in my post relating to the GFDL requirements, which I think mean images need to be included in the dumps. https://meta.wikimedia.org/w/index.php?title=Talk:Terms_of_use&diff=prev&ol… excerpt: "..the [GFDL] license requires that someone can download a ''complete'' Transparent copy for one year after the last Opaque copy is distributed. As a result, I believe the BoT needs to ensure that the dumps are available ''and'' that they can be available for one year after WMF turns of the lights on the core servers (it allows 'agents' to provide this service). As Wikipedia contains images, the images are required to be included. .." discussion continues .. https://meta.wikimedia.org/wiki/Talk:Terms_of_use#Right_to_Fork -- John Vandenberg

12 years, 2 months

Re: [Xmldatadumps-l] Import of an XML dump of the RU Wiktionary with mwdumper / problem

by Sébastien Druon

Hello! Taking as example "Дом" from the russian wiktionary, the templates are not rendered (not found by the wikimedia program either, as show the red links): Морфологические и синтаксические свойства Шаблон:сущ ru m ina 1c(1)<http://localhost/mediawiki-1.18.0/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%…> Шаблон:морфо<http://localhost/mediawiki-1.18.0/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%…> Let's take the example of the template "сущ ru m ina 1c(1)<http://localhost/mediawiki-1.18.0/index.php?title=%D0%A8%D0%B0%D0%B1%D0%BB%…> " I have some questions: - in the SQL generated from mwdumper the following command does not return any result: grep "'сущ ru m ina 1c(1)" ruktionary.sql" ruwiktionary.sql - The string is also not to be found in the generated DB, as a consequence More general: - Is there some special option for mwdumper or a special dump to import to also get the templates?+ - Is there any way to log the requests sent by mediawiki to the DB? Thanks a lot in advance Sebastien On 9 January 2012 22:05, Platonides <platonides(a)gmail.com> wrote: > On 08/01/12 23:03, Sébastien Druon wrote: > >> It seems that mwdumper did not import the templates (select * from page >> where page_title like 'Шаблон:%' does not return anything), though they >> are present in the xml dump. >> Is there some special option to use? >> > > The pages aren't stored with the namespaces name as a literal, but with > the namespace number. > Try SELECT page_title FROM page WHERE page_namespace=10 LIMIT 5; >

12 years, 3 months

Best place to post questions about mwdumper

by Sébastien Druon

hello! what is the best place to post questions about mwdumper? regards, sebastien

12 years, 3 months

Mediawiki: Issue with maintenance-script rebuildImages.php

by WOHLFAHRT Roland (Gemeindeinformatikzentrum Kärnten GI Z-K GmbH)

Hi everybody I have a migration-issue and tried your maintenance-Script rebuildImages.php. See http://meta.wikimedia.org/wiki/Talk:Data_dumps#rebuildImages.php_fails Environment: - Windows 2008 r2 with xampp - Mediawiki 1.18.0 I did the migration this way because I wanted a "fresh" mediawiki installation. I found no good place to ask for support, because the script has no article on mediawiki.org. Perhaps you can give me a small hint or workaround to get my wiki up&running. Thx a lot. Cheers, Roland

12 years, 3 months

List of all words of a wiktionary

by Sébastien Druon

Hello, How is it possible to get the list of all the entries (words) of a wiktionary? For example, for the russian wiktionary, I want to get the list of all the russian entries (no other languages) Thanks a lot in advance, Sebastien

12 years, 3 months

Best way to get html/parsed version of a wiktionary

by Sébastien Druon

Hi! What is the best/easiest way to get a parsed version (including template resolution) of all entries of a wiktionary (separate html files for each entry for example). Thanks in advance Sebastien

12 years, 3 months

Import of an XML dump of the RU Wiktionary with mwdumper / problem

by Sébastien Druon

Hello! I imported the ruwiktionary dump with mwdumper, but I cannot access any word. For each of the I get a similar page: For example for the page http://localhost/mediawiki/index.php/Ребёнок I get the following result: Ребёнок (Redirected from Ребёнок<http://localhost/mediawiki/index.php?title=%D0%A0%D0%B5%D0%B1%D1%91%D0%BD%D…> ) [image: #REDIRECT]Ребёнок<http://localhost/mediawiki/index.php/%D0%A0%D0%B5%D0%B1%D1%91%D0%BD%D0%BE%D…> The templates do not seem to be correctly rendered either. Any idea what is missing? Thnaks in advance, Sebastien Druon

12 years, 3 months

Pages-logging dump - more information?

by Katja Mueller

Dear list, do you provide more information about the pages-logging dump somewhere? While parsing it we came across some questions that we are trying to clarify: * Why are there only ~40 million logs (at over 450 million revisions)? Which logs does the pages-logging dump / the "public" logging table (not) contain? (We double-checked the number of logs on the database, using our Toolserver-Account (Logging table).) * Can you give us more information about the TextElement that is defined in the XML Schema? Definition: <element name="text" type="mw:TextType"/> In pages-logging it sometimes occurs as: <text deleted="deleted" /> How is this being used? Kind regards, Katja Mueller do you provide more information about the pages-logging dump somewhere? While parsing it we came across some questions that we are trying to clarify: * Why are there only ~40 million logs (at over 450 million revisions)? Which logs does the pages-logging dump / the "public" logging table contain? We double-checked the number of logs on the database, using our Toolserver-Account (logging-table). * Can you give us more information about the TextElement that is defined in the XML Schema? Definition: <element name="text" type="mw:TextType"/> In pages-logging it sometimes occurs as: <text deleted="deleted" /> How is this used? Kind regards, Katja Mueller

12 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l January 2012