Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Hi,
I am trying to tranlations from Wiktionaries in different languages.
Currently I use the "All pages, current versions only" dump. Is there a way
to find out the language template tags (is that the correct term?) for each
Wiktionary and each language?
For example:
This is the Hungarian page 'karcsu' (slim, slender)
http://hu.wiktionary.org/wiki/karcs%C3%BA (the edit page:
http://hu.wiktionary.org/w/index.php?title=karcs%C3%BA&action=edit)
The translation table always (?) starts like this:
{{-ford-}}
{{trans-top}}
*{{en}}: {{t|en|slim}}, {{t|en|slender}}
Where {{-ford-}} comes from the word forditas (translation in Hungarian, I
skipped the accents). The translations look like the 3rd row and
(hopefully) contain the other languages wiki codes (en, fr, de).
Also on the page 'slim' in the Hungarian Wiktionary there are some tags
which nobody would understand unless they are Hungarian and they have
learned some Hungarian grammar.
http://hu.wiktionary.org/wiki/slim and
http://hu.wiktionary.org/w/index.php?title=slim&action=edit
The first line is:
{{engmell|comp=slimmer|sup=slimmest|pron=/slɪm/|audio=us}}
Where 'engmell' is derived from 'english melleknev', melleknev meaning
adjective in Hungarian. There rest is similarly confusing.
It gets even more confusing if I look at other Wiktionaries. It seems that
there are no standards that all Wiktionaries follow.
Is this meta-information available somewhere?
I hope I managed to explain it clearly and I am asking on the right list.
Thank you in advance,
Judit Acs
Hello,
I am new to this list and have a question about importing XML dumps from
Wikipedia (http://dumps.wikimedia.org/enwiki/20121101/) into an offline
MediaWiki database. I have locally installed XAMPP on Windows 8 and replaced
the included 32-bit MySQL version with the latest 64-bit version. I then
installed MediaWiki 1.20.0 with an empty database.
When trying to import an XML dump (Nov 2011 dump) with importDump.php in the
maintenance folder of the MediaWiki installation, I get the following error
after about 2 seconds:
"WikiRevision given a null title in import. You may need to adjust
$wgLegalTitleChars." which is thrown at line 1032 in Import.php, because
some $title seems to be null. Replacing the exception with "$this->title =
null" (evil ^^) leads to other errors.
xml2sql and mwdumper seem to be outdated as I cannot get them working with
the current dumps. Special:Import is not an option due to the size of the
XML files.
Any help would be appreciated :)
P.S. it's not the missing + in $wgLegalTitleChars that is missing which is
suggested by a Google Search on that error
Best Regards
Chris
Yay, the network (or nfs) performance issues on your.org seem to have
been straightened out and last month's full is available; this month's
is running now.
Ariel
This dump is failing and due to our MediaWiki config setup on the
production cluster we don't get the exception message so I have no idea
what the problem is. I'll do some live hacks and look at this tomorrow.
Thanks for your patience.
Ariel
I am looking to create a script for creating manual dumps for those
wikis that either dont or wont publish their own dumps and that I dont
have server access to. To that end I am writing a python dump creator,
however I would like to ensure that my format is the same as the
existing. I could reverse engineer it by looking at multiple different
dumps but that takes a lot of time and is not fool proof, is there or
can I get documentation and details on exactly how the XML dumps are
formatted?
Hi,
I am trying to create a project that has the above mentioned information
and can be used to correct the metadata information for songs.
Is there a data dump available that contains the song data like song title,
singer, duration, music director, genre, album/movie, language and country
information?
If there is no such dump available is there any tool to extract that
information from the entire page-articles.xml dump of English wikipedia.
If this is not the correct mailing list, could you please point me to the
right mailing list to get data dump of song information.
Thanks and regards,
Venkatesh