don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
I'm working on a research project aimed at crawling web data in near
digitally extinct languages and I would like to use the test wikipedias'
articles as input. Unfortunately I haven't been able to find the dumps
of these wikipedias. Do they exist somewhere, am I missing something? If
not, are you planning to create a dump? I would like to avoid having to
download all of them manually.
Thanks in advance,
Computer and Automation Research Institute of
Hungarian Academy of Sciences
Folks who are interested in downloading tarballs of media for their
particular project can now do so from:
In this directory you will see two subdirectories, "fulls" and "incrs".
The way this works is that once a month near the beginning of the month
we will produce a series of tarballs for each project of media uploaded
locally and media stored on commons. During the month, at least once
but hopefully twice, we will produce tarballs containing the files
uploaded locally or included from commons since the "full" tarball
No tarballs are being produced for commons itself, given that it's 14T
and there would be no separate locally uploaded/remotely uploaded lists.
Instead please use rsync to get those files directly from:
Also, please bear in mind that for author and license information you
should download the corresponding pages-meta-current.xml.bz2 file from
http://dumps.wikimedia.org/ or your local mirror, and check the
corresponding File: description pages for your project or commons.
And now it's time as usual for the Big Fat Disclaimer:
We're still running these manually, it's possible that we won't make the
planned schedule for a given run or runs, network connectivity might be
unstable, etc. etc. Its also possible we will restructure the fulls so
that they take a lot less space and rely on three series of tarballs for
Thanks once again to your.org for donating the time, space and bandwidth
to make this possible.
for a dissertation study, I try to find a reliable datasource from where I can extract user account events, specifically creation and blocking of user accounts, with usernames, the event name and timestamps,
Is such data available? If yes, could anybody point be to where I can get it from?
Thanks a lot
On Sunday, I posted the following to the Analytics mailing list,
but didn't see any response there, so I'm reposting here.
At the Berlin hackathon, I improved the script I wrote in December
for compiling statistics on external links. My goal is to learn how many
links Wikipedia has to a particular website, and to monitor this over time.
I figure this might be intresting for GLAM cooperations.
This is found in the external links table, but since I want to filter out
links from talk and project pages, I need to join it with the page table,
where I can find the namespace. I've tried the join on the German Toolserver,
and it works fine for the minor wikis, but it tends to time out (beyond
30 minutes) for the ten largest Wikipedias. This is not because I fail to
use indexes, but because I want to run a substring operation on millions
of rows. Even an optimized query takes some time.
As a faster alternative, I have downloaded the database dumps, and processed
them with regular expressions. Since the page ID is a small integer, counting
from 1 up to a few millions, and all I want to know for each page ID is
whether or not it belongs to a content namespace, I can do with a bit vector
of a few hundred kilobytes. When this is loaded, and I read the dump of the
external links table, I can see if the page ID is of interest, truncate the
external link down to the domain name, and use a hash structure to count the
number of links to each domain. It runs fast and has a small RAM footprint.
In December 2011 I downloaded all the database dumps I could find, and
uploaded the resulting statistics to the Internet Archive, see e.g.
One problem though is that I don't get links to Wikisource, Wikiquote this
way, because they are not in the external links table. Instead they are
interwiki links, found in the iwlinks table. The improvement I made in Berlin
is that I now also read the interwiki prefix table and the iwlinks table.
It works fine.
One issue here, is the definition of content namespaces. Back in December,
I decided to count links found in namespaces 0 (main), 6 (File:),
Portal, Author and Index. Since then, the concept of "content namespaces"
has been introduced, as part of refining the way MediaWiki counts articles
in some projects (Wiktionary, Wikisource), where the normal definition
(all wiki pages in the main namespace that contain at least one link)
doesn't make sense. When Wikisource, using the ProofreadPage extension,
adds a lot of scanned books in the Page: namespace, this should count as
content, despite these pages not being in the main namespace, and whether
or not the pages contain any link (which they most often do not).
One problem is that I can't see which namespaces are "content" namespaces
in any of the database dumps. I can only see this from the API,
The API only provides the current value, which can change over time. I can't
get the value that was in effect when the database dump was generated.
Another problem is that I want to count links that I find in the File:
(ns=6) and Portal: (mostly ns=100) namespaces, but these aren't marked as
content namespaces by the API. Shouldn't they be?
Is anybody else doing similar things? Do you have opinions on what should
count as content? Should I submit my script (300 lines of Perl) somewhere?
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
Project Runeberg - free Nordic literature - http://runeberg.org/
I'm finding strange issues when trying to decompress the 7z version of this dump for the French Wikipedia:
At some point around 3M revisions the 7z process stalls. After a long time (few hours) it recovers normal execution, but then stalls again around 55M revisions to never recover normal cruise again.
Maybe there are some issues with frwiki dumps, since I can see that subsequent processes are experimenting failures (in May and June).
I'm now checking with the previous dump (http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in case I find any more problems.