Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Dear Ariel,
0) Context
I am working towards the next release for WP-MIRROR (0.8) and that entails
updating `mwxml2sql' from 1.24 to 1.26.
1) Database Schema for `page' table
There are three fields in the `page' table that require attention:
`page_counter', `page_no_title_convert', and `page_lang'.
These fields are present (YES) or not (NO) in three different places:
MediaWiki 1.26:
(shell) less maintenance/table.sql
`page_counter' NO
`page_no_title_convert' NO
`page_lang' YES
XML Dump:
(shell) zless simplewiki-20150603-page.sql.gz
`page_counter' YES
`page_no_title_convert' YES
`page_lang' NO
mwxml2sql:
(shell) mwxml2sql -m 1.24 -s simplewiki-20150603-stub-articles.sql.gz ...
(shell) less simplewiki-20150603-page.sql-1.24.gz
`page_counter' YES
`page_no_title_convert' NO
`page_lang' NO
2) XML Dumps
Are there plans to remove `page_counter' and `page_no_title_convert' from
the XML dumps?
Are there plans to add `page_lang' to the XML dumps?
3) mwxml2sql
Do you have any suggestions for how I should update `mwxml2sql' for
MediaWiki 1.26?
Sincerely Yours,
Kent
Dear Ariel,
An update for the `mwxml2sql' utility has been pushed to gerrit for your
review.
1) Features
Extend maximum allowed mediawiki version to 1.26.
Extend maximum allowed XML dump schema to 0.10.
2) Database schemata
The `page' table has been updated:
o page_counter is removed;
o page_links_updated is repositioned; and
o page_lang is added.
The `revision' table has been updated:
o rev_comment is resized (was tinyblob, now is varbinary(767)).
3) Test
(shell) mwxml2sql -m 1.22,1.23,1.24,1.25,1.26 \
-s simplewiki-20150603-stub-articles.xml.gz \
-t simplewiki-20150603-pages-articles.xml.bz2 \
-f simplewiki-20150603.gz
(shell) zcat simplewiki-20150603-createtables.sql-1.26.gz | \
mysql --host=localhost --user=root --password
(shell) zcat simplewiki-20150603-page.sql-1.26.gz | \
mysql --host=localhost --user=root --password
(shell) zcat simplewiki-20150603-revision.sql-1.26.gz | \
mysql --host=localhost --user=root --password
(shell) zcat simplewiki-20150603-text.sql-1.26.gz | \
mysql --host=localhost --user=root --password
Sincerely Yours,
Kent
Hi!
I have tried to get a list of all .svg-files on commons.wikipedia.
Of course I could just parse through commons, but
* if there would be any way to provide a dump with the names of the
really existing .svg-files, that would be a tremendeous help for me, *
and in my estimation it would reduce the download size and cpu-burden
and most importantly the HD-burden about 70% to 80% compared to browse
and parse through commons. (Wheres the cpu-usage and heat problems
because of the HD-burden on my notebook would be much more adverse than
the burden on wikipedias server, I assume. I lost already two HDs over
the years when downloading larger amounts of files in one go.)
Though I have asked at various places so far I haven't found a good
solution.
One suggestion was to downloadcommonswiki-20150417-all-titles, which I did.
But this file does contain deleted names and renamed names, and names
the partly have "File:" and some that don't have "File:" or a similiar
indicator at the start.
Doing just a small sample resulted in 5 correct names, and around 7
deleted and 7 renamed names. I have asked at various places, and
especially one person tried to help me, but this even he couldn't solve.
Beside this I didn't much feedback.
Is it possible to get such a dump? Or to get another dump that I could
use to update and crosscheck the all-titles file?
Greetings
D. Hansen
As people know, we're running all stubs dumps across all wikis first
and then doubling back to run other dump steps. As we try to hash out
the details so the order and contents of the dump stages works out best
for folks, please have a look at the discussion on the task below and
weigh in there. Thanks!
https://phabricator.wikimedia.org/T89273
Hi,
I am a phd student from Computer Science Department, University of
California, Santa Barbara. I would like to ask if anyone knows if Wikipedia
still keeps the old data of the Wikipedia English Text snapshots?
>From this website http://dumps.wikimedia.org/enwiki/, it seems that they
keep the English snapshot of Wikipedia for the last 10 months. Does anyone
know if they have the snapshot of the ones from 201304 to 201404?
I downloaded these datasets previously to write a research paper but for
the reason of disk failure, I lost it...(So BAD LUCK!!!) Thus, I am writing
to check if I can download the old versions of it? If yes, could anyone
please share me the link of the old 13 version datasets? It would be a
great help for me if you can help!
Thanks a lot!
Best regards,
Xin
--
Xin Jin,
PhD Candidate,
Computer Science Department,
University of California, Santa Barbara
To catch everyone who would stop reading right after the updates, let me
put the question first.
Who uses the abstract dumps? Anyone here? Anyone you know? Please
forwar this to other lists where there might be users of these dumps.
We're trying to figure out if we need to keep generating them or not.
Now the updates.
We got more space for the dumps server, which means we don't need to
reduce the number of dumps kept for some time. You'll also see other
items showing up there soon-ish, not part of the xml dumps.
We've long had a request to run stubs early on in the dumps process so
that stats can be produced right away, and we finally have that going.
As of this month all dump runs will be done in stages, stubs first, then
tables, then page logs, and then the rest. I'm open to negotiation
about the order of jobs after the stubs, if folks have other
preferences.
We've worked around the eternal php memory leak(s), which lets us now
run 7 workers for small wikis at once. This means we'll get through
those dumps quicker.
Nemo_bis did some testing with an option to 7zip which means much faster
compression with a relatively small increase in size. I've adopted that
everywhere and we should see the difference, primarily in the big wikis,
this month and on.
New code brings new bugs. This month's stub and page log runs for
smaller wikis may have a duplicate entry at the end, the last item
appearing twice. This has been fixed for all future runs. It shouldn't
have a real impact on stats but folks importing from these dumps should
be aware.
Happy June,
Ariel