Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
We were shuffling db passwords around this morning, and that change has
not made it out completely to the dump hosts. I'm taking the
opportunity to clean up the way that configuration is handled in the
scripts and we'll be back in business a little bit later today.
Ariel
On Sep 23, 2013 9:25 AM, "Mihai Chintoanu" <mihai.chintoanu(a)skobbler.com>
wrote:
> I have a list of about 1.8 million images which I have to download from
commons.wikimedia.org. Is there any simple way to do this which doesn't
involve an individual HTTP hit for each image?
You mean full size originals, not thumbs scaled to a certain size, right?
You should rsync from a mirror[0] (rsync allows specifying a list of files
to copy) and then fill in the missing images from upload.wikimedia.org ;
for upload.wikimedia.org I'd say you should throttle yourself to 1 cache
miss per second (you can check headers on a response to see if was a hit or
miss and then back off when you get a miss) and you shouldn't use more than
one or two simultaneous HTTP connections. In any case, make sure you have
an accurate UA string with contact info (email address) so ops can contact
you if there's an issue.
At the moment there's only one mirror and it's ~6-12 months out of date so
there may be a substantial amount to fill in. And of course you should be
getting checksums from somewhere (the API?) and verifying them. If your
images are all missing from the mirror than it should take around 40 days
at 0.5 img/sec but I guess you probably could do it in less than 10 days if
you have a fast enough pipe. (depends on if you get a lit of misses or hits)
See also [1] but not all of that applies because upload.wikimedia.org isn't
MediaWiki. so e.g. no maxlag param.
-Jeremy
[0]
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media
[1] https://www.mediawiki.org/wiki/API:Etiquette
Dear all,
I'd like to extract information from the infobox and textual definition,
but I realize that some entries are not fully contained in the French
dump, compared with the interactive pages.
For instance, the entry "Los Angeles" (the city in California and not
the one in Chili) is incomplete.
Only the first two paragraphs are there. The infobox and the rest of the
page are missing.
I noticed a call to the "q" template like this: {{q|Los Angeles}} which
states that this entry is an entry of good quality.
My questions are:
* do you think that it is a bug?
* is there a relation between the "q" call and the fact that the entry
is partial?
* where is it possible to get the infobox for Los Angeles?
Merci d'avance,
Gil Francopoulo
Tagmatica/Spotter/CNRS
Hi,
Are the values in the columns pr_id and log_id equivalent? I'm trying to
select all changes in editing protection status for Wikipedia articles but
the table Page_restrictions doesn't contain a time stamp, and the table
logging doesn't specify the kind of protection so I'm trying to join them
somehow...
Thanks!
--
Xavi
Hi,
after a month of work on my GSoC project Incremental Dumps [1], I think I
have now something worth sharing and talking about, though it's still far
from complete.
What the code can do now is to read a pages-history XML dump and create the
various kinds of dumps (pages/stub, current/history) in the new format from
that.
It can then convert a dump in the new format back to XML.
The XML output is almost the same as existing XML dumps, but there are some
differences [2].
The current state of the new format also now has a detailed specification
[3] (this describes the current version, the format is still in flux and
can change daily).
If you want, you can also try running the code. [4]
It's not production-quality yet (e.g. it doesn't report errors properly),
but it should work.
Compilation instructions are in the README file.
Any comments or questions are welcome.
Petr Onderka
User:Svick
[1]: http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps
[2]:
http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/XML_…
[3]:
http://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/File_format/Spec…
[4]: https://github.com/wikimedia/operations-dumps-incremental/tree/gsoc
Hi all,
Back in March I remember there was some experimentation with releasing dumps in
TSV format. I downloaded a bunch of files and just recently imported the
pagelinks table without any problem. Are there any future plans to continue with
releasing TSV dumps?
Best,
--
Giovanni Luca Ciampaglia
Postdoctoral fellow
Center for Complex Networks and Systems Research
Indiana University
✎ 910 E 10th St ∙ Bloomington ∙ IN 47408
☞ http://cnets.indiana.edu/
✉ gciampag(a)indiana.edu
I need the Wikipedia dump from 2011-07-22 (from which DBpedia 3.7 was
extracted). It is no longer available from the official Wikipedia
dumps page. Can you please point me to a place to download it from.
Preferable non-torrent version.
Thanks,
Mohamed
Hi, It seems like all dump processes failed and stopped some time
yesterday. What happened, and is there any prognosis for when the
dumping will resume?
Regards,
- Byrial