Xmldatadumps-l November 2012

xmldatadumps-l@lists.wikimedia.org

9 participants
8 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Translation extraction from Wiktionaries

by Judit, Ács

Hi, I am trying to tranlations from Wiktionaries in different languages. Currently I use the "All pages, current versions only" dump. Is there a way to find out the language template tags (is that the correct term?) for each Wiktionary and each language? For example: This is the Hungarian page 'karcsu' (slim, slender) http://hu.wiktionary.org/wiki/karcs%C3%BA (the edit page: http://hu.wiktionary.org/w/index.php?title=karcs%C3%BA&action=edit) The translation table always (?) starts like this: {{-ford-}} {{trans-top}} *{{en}}: {{t|en|slim}}, {{t|en|slender}} Where {{-ford-}} comes from the word forditas (translation in Hungarian, I skipped the accents). The translations look like the 3rd row and (hopefully) contain the other languages wiki codes (en, fr, de). Also on the page 'slim' in the Hungarian Wiktionary there are some tags which nobody would understand unless they are Hungarian and they have learned some Hungarian grammar. http://hu.wiktionary.org/wiki/slim and http://hu.wiktionary.org/w/index.php?title=slim&action=edit The first line is: {{engmell|comp=slimmer|sup=slimmest|pron=/slɪm/|audio=us}} Where 'engmell' is derived from 'english melleknev', melleknev meaning adjective in Hungarian. There rest is similarly confusing. It gets even more confusing if I look at other Wiktionaries. It seems that there are no standards that all Wiktionaries follow. Is this meta-information available somewhere? I hope I managed to explain it clearly and I am asking on the right list. Thank you in advance, Judit Acs

11 years, 5 months

Issues importing Wikipedia XML dumps

by Christoph Sackl

Hello, I am new to this list and have a question about importing XML dumps from Wikipedia (http://dumps.wikimedia.org/enwiki/20121101/) into an offline MediaWiki database. I have locally installed XAMPP on Windows 8 and replaced the included 32-bit MySQL version with the latest 64-bit version. I then installed MediaWiki 1.20.0 with an empty database. When trying to import an XML dump (Nov 2011 dump) with importDump.php in the maintenance folder of the MediaWiki installation, I get the following error after about 2 seconds: "WikiRevision given a null title in import. You may need to adjust $wgLegalTitleChars." which is thrown at line 1032 in Import.php, because some $title seems to be null. Replacing the exception with "$this->title = null" (evil ^^) leads to other errors. xml2sql and mwdumper seem to be outdated as I cannot get them working with the current dumps. Special:Import is not an option due to the size of the XML files. Any help would be appreciated :) P.S. it's not the missing + in $wgLegalTitleChars that is missing which is suggested by a Google Search on that error Best Regards Chris

11 years, 5 months

media tarballs

by Ariel T. Glenn

Yay, the network (or nfs) performance issues on your.org seem to have been straightened out and last month's full is available; this month's is running now. Ariel

11 years, 5 months

poty bundles for 2008, 2011 available

by Ariel T. Glenn

Thanks to Erik for reminding me about these. Years 2008 and 2011 now appear in the index http://dumps.wikimedia.org/other/poty/ so enjoy. Ariel

11 years, 5 months

es wikipedia woes

by Ariel T. Glenn

This dump is failing and due to our MediaWiki config setup on the production cluster we don't get the exception message so I have no idea what the problem is. I'll do some live hacks and look at this tomorrow. Thanks for your patience. Ariel

11 years, 5 months

Format

by John

I am looking to create a script for creating manual dumps for those wikis that either dont or wont publish their own dumps and that I dont have server access to. To that end I am writing a python dump creator, however I would like to ensure that my format is the same as the existing. I could reverse engineer it by looking at multiple different dumps but that takes a lot of time and is not fool proof, is there or can I get documentation and details on exactly how the XML dumps are formatted?

11 years, 5 months

Need access to song (album song, movie song, single) information for English language

by Venkatesh Channal

Hi, I am trying to create a project that has the above mentioned information and can be used to correct the metadata information for songs. Is there a data dump available that contains the song data like song title, singer, duration, music director, genre, album/movie, language and country information? If there is no such dump available is there any tool to extract that information from the entire page-articles.xml dump of English wikipedia. If this is not the correct mailing list, could you please point me to the right mailing list to get data dump of song information. Thanks and regards, Venkatesh

11 years, 5 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l November 2012