Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
I created mw:Extension:InterwikiExistence<https://www.mediawiki.org/wiki/Extension:InterwikiExistence>,
which imports a file of Wikipedia page titles and then polls the Wikipedia
API to keep the local list of Wikipedia's pages up to date. But it helps if
it knows what timestamp to start its polling at. Specifically, system
administrators installing the extension need to know the date/time at which
the AllPages snapshot gzipped as
enwiki-latest-all-titles.gz<http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles.gz>was
generated; that way, the API poll function can begin with the right
value in the rcstart (recentchanges) and lestart (logevents) parameters.
Can system administrators rely on the "Last Modified" date/time of the file
as that snapshot date/time? Or is a better date/time to use listed
somewhere else? Thanks.
Good evening, I hope this is an appropriate place for this question.
I've been trying to import the current enwiki dump from Oct. Using Mwdumper
it runs fine until hitting 247,000 records and then mysql throws one of the
following 2 errors.
Error (2006 I think it was) Mysql went away
Or
Error (2013) Lost connection to mysql server
I've tried several things on the mysql end including increasing the
innodb_log_file_size
and increasing max_allowed_packets.
All suggestions by a few web resources. Still to no avail..
I'm not sure where to true issue is or how to go about correcting it. Any
help would be greatly appreciated.
I'm running latest version of mediawiki, importing the october 2013 dump
and mysql version 5.6.12 via WAMP, and mwdumper 1.16
thanks in advance
Hi
I've a set of list page titles that i've extracted from the Category dump (where "cl_from" is of type "page")
http://www.mediawiki.org/wiki/Manual:Categorylinks_table
Now I want to extract the CONTENT of the page from the pages dump
enwiki-latest-pages-articles.xml
Although there are guidelines on how editors should mark these pages
http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lists
("The titles of list articles typically begin with the type of list it is (List of, Index of, etc.), followed by the article's subject; like: List of vegetable oils.")
The majority of the times the above rule is not implemented. So my concrete question is:
- if i am consuming the pages-articles.xml dump (D) page by page, and i have a list of pages (L) i've extracted from the category (C) links dump, then how can i check that page in pages dump file D is a member of L? The titles do not resolve the names.
For instance, if I have the page title "List of the longest Asian rivers" (http://en.wikipedia.org/wiki/List_of_the_longest_Asian_rivers) then what in that page's content (http://en.wikipedia.org/w/index.php?title=List_of_the_longest_Asian_rivers&…) can tell me it is the same page "List of the longest Asian rivers"? None-list pages appear to place the title as first token with ''' markings.
Any suggestions of a robust solution would be much appreciated.
Best
Paul A. Houle, 15/10/2013 00:35:
> I’d like to see the Commons backups available in the AMZN S3 cloud, even
> if it is only as “requester pays”. Frankly, my experience is that
> getting data from the Internet Archive is so slow that I wonder if they
> are on the Moon.
When did you try last time? They recently increased their bandwidth.
> My infovore framework
> AMZN has had a policy of offering free S3 storage for public data sets –
> I’d like to see them take this program to the next level with data sets
> of this nature.
It seems anyone can request it, anyway I sent an inquiry.
The datasets they have (XML dumps) are very outdated:
https://aws.amazon.com/datasets/Encyclopedic/4182https://aws.amazon.com/datasets/Encyclopedic/2506https://aws.amazon.com/datasets/Encyclopedic/2596
Everybody can ask to other mirrors (I already asked GARR), some ideas at
https://sourceforge.net/apps/trac/sourceforge/wiki/Mirrors
Nemo
Hey ,
I'm setting up the database from wikipedia XML dumps. As you know that if
we import all revision dumps then it may size to 22 - 25 terabytes Aprox.
This size is too huge, what is workaround for it ??
If it is necessary to dumps all xmls to DB then i think linux only support
up to 16 TB maximum. SO how we can import into single MYSQL server deployed
on linux OS.
Please some one also tell me that how currently wikipedia it self managing
such huge data in terms of software and hardware solutions.
Regards,
IMran