Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Ack, sorry for the (no subject); again in the right thread:
> For external uses like XML dumps integrating the compression
> strategy into LZMA would however be very attractive. This would also
> benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of
current 7zip (or histzip), and goes at 30MB/s on my box,
which is still 8x faster than the status quo (going by a 1GB
benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply,
Randall
Dear Ariel,
Thank you for your guidance. I pushed another change to gerrit for
review that should address the issue of the new `page_links_updated'
field.
Sincerely Yours,
Kent
On 2/7/14, Ariel T. Glenn <ariel(a)wikimedia.org> wrote:
> Last reply: I double-checked the content/format model stuff, and the
> only nagging question I have remaining is how well it works with
> non-text handlers. But that would not be a new issue, and the code for
> the base case is certainly correct. So I think we are down to just the
> page_links_updated variable for > 1.22 and that would do it.
>
> Ariel
>
> Στις 05-02-2014, ημέρα Τετ, και ώρα 02:43 -0500, ο/η wp mirror έγραψε:
>> Dear Ariel,
>>
>> I have been reading your code for `mwxml2sql-0.0.2' with a view
>> towards updating it for mediawiki-1.23 LTS.
>>
>> 0) Support status
>>
>> Currently, the version info for `mwxml2sql' states the following:
>>
>> (shell)$ mwxml2sql --version
>> mwxml2sql 0.0.2
>> Supported input schema versions: 0.4 through 0.8.
>> Supported output MediaWiki versions: 1.5 through 1.21.
>>
>> 1) Current input schema version
>>
>> Currently, your XML dump files have the following header:
>>
>> (shell)$ head -n 1 zuwiki-20140121-pages-articles.xml
>> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/"
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>> xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/
>> http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8"
>> xml:lang="zu">
>>
>> From this I gather that XML schema is still 0.8, and that `mwxml2sql'
>> needs no update on that head.
>>
>> 2) Current output MediaWiki version
>>
>> I reviewed the database schema for the `page', `revision', and `text'
>> tables:
>>
>> <https://www.mediawiki.org/wiki/Manual:Page_table>,
>> <https://www.mediawiki.org/wiki/Manual:Revision_table>, and
>> <https://www.mediawiki.org/wiki/Manual:Text_table>
>>
>> It appears that the most recent changes to the schema for these three
>> tables occurred for mediawiki versions 1.21, 1.21, and 1.19,
>> respectively.
>>
>> From this I gather that the database schema used for mediawiki 1.23
>> LTS is the same as that used for mediawiki 1.21; and therefore
>> `mwxml2sql' needs no update on that head.
>>
>> 3) Recommended updates
>>
>> From a review of your code, I concluded that two minor changes would be
>> useful.
>>
>> 3.1) mwxml2sql.c
>>
>> The following three lines:
>>
>> (shell)$ grep 21 mwxml2sql.c
>> fprintf(stderr,"Supported output MediaWiki versions: 1.5 through
>> 1.21.\n\n");
>> /* we know MW 1.5 through MW 1.21 even though there is no MW 1.21 yet
>> */
>> if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 21) {
>>
>> should read
>>
>> fprintf(stderr,"Supported output MediaWiki versions: 1.5 through
>> 1.23.\n\n");
>> /* we know MW 1.5 through MW 1.23 */
>> if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 23) {
>>
>> 3.2) mwxmlelts.c
>>
>> The following line:
>>
>> (shell)$ grep 21 mwxmlelts.c
>> <generator>MediaWiki 1.21wmf6</generator>
>>
>> should read
>>
>> <generator>MediaWiki 1.23wmf10</generator>
>>
>> 4) Request
>>
>> Please let me know if you agree with the above assessment. If you do,
>> I would be happy to submit the changes to
>> <https://gerrit.wikimedia.org/> for review.
>>
>> Sincerely Yours,
>> Kent
>
>
>
Folks who use things like the adds-changes dumps will have noticed or
even reported that they and other similar jobs have been broken.
Maintenance scripts broke on the snapshots when the scap rewrite broken
on them, and I've been trying to get things moved to eqiad snashot hosts
(where scap is not broken) ever since. Today everything should be
running from these other hosts, but I'll likely find things that need to
be fixed up over the next day or two.
In the meantime I cleaned up the adds changes process so it's one script
and I can run it on the back dates to fill in the missing data. So I'm
in the process of doing that now; it will take a few days to catch up,
since an average run is about 2.5 hours.
Happy trails,
Ariel
Dear Gnosygnu,
I would like to thank you very much for your e-mail of 2014-Jan-11.
It was very helpful and I have made much progress towards WP-MIRROR
0.7.
0) Status
Using dump files from <http://ftpmirror.your.org/>, WP-MIRROR 0.7 now
builds a mirror of `simplewiki', `simplewiktionary', and `wikidata'.
In the process, WP-MIRROR installs and configures `mediawiki 1.23'
and about 40 extensions. A page rendered by the mirror now looks very
similar to the same page rendered by WMF.
1) Problems
There are a couple of anomalies.
1.1) Wikidata
Infoboxes are now rendered. The property fields are populated with
data drawn from <http://www.wikidata.site>. However, in the
navigation bar, under the ``In other languages'' link, the expected
list of interlanguage links still does not appear.
Here is the portion of `LocalSettings.php' that pertains to `wikidata'.
# Wikidata
putenv( "MW_INSTALL_PATH=$IP" );
#define( 'WB_EXPERIMENTAL_FEATURES', true );
$wgEnableWikibaseRepo = true;
$wgEnableWikibaseClient = true;
$wmgUseWikibaseRepo = true;
$wmgUseWikibaseClient = true;
require_once( "$IP/extensions/Wikidata/Wikidata.php" );
require_once( "$IP/extensions/Wikidata/extensions/Wikibase/repo/ExampleSettings.php"
);
$wgWBSettings['repoUrl'] = 'http://www.wikidata.site';
$wgWBSettings['repoScriptPath'] = '/w';
$wgWBSettings['repoArticlePath'] = '/wiki/$1';
$wgWBSettings['siteGlobalID'] = $wgDBname;
$wgWBSettings['siteGroup'] = $project;
$wgWBSettings['sort'] = 'code';
$wgWBSettings['sortPrepend'] = array ( 'en', 'simple', );
$wgWBSettings['repoDatabase'] = 'wikidatawiki';
$wgWBSettings['changesDatabase'] = 'wikidatawiki';
1.2) Gadgets
Under <http://simple.wikipedia.site/wiki/Special:Gadgets> there is no
list of gadgets. I can see that the underlying database does have
them (or rather links to WMF where they can be found).
The relevant portion of `LocalSettings.php' is:
require_once( "$IP/extensions/Gadgets/Gadgets.php" );
2) Advance news
WP-MIRROR 0.7 will be released (probably this month) as a DEB package.
It will not depend upon the DEB package for `mediawiki 1.19' LTS as
did WP-MIRROR 0.6. Rather it will depend upon a new DEB package of
the latest development version (currently 1.23).
To that end, I have set up a tool chain that `pull's from the GIT
repositories at <https://gerrit.wikimedia.org/r/p/mediawiki/> and
generates four DEB packages:
`wp-mirror-mediawiki_1.23-1_all.deb'
`wp-mirror-mediawiki-extensions_1.23-1_all.deb'.
`wp-mirror-mediawiki-extensions-math-texvc_1.23-1_amd64.deb'
`wp-mirror-mediawiki-extensions-scribunto-lua_1.23-1_amd64.deb'
The second of these packages contains over forty of the extensions
listed on <http://en.wikipedia.org/wiki/Special:Version>.
Every effort has been made to avoid stepping on the namespace of the
DEB packages currently distributed by Debian. For example, the first
two packages install into `/usr/share/wp-mirror-mediawiki/' rather
than into `/usr/share/mediawiki/'.
I have set up a Debian repository hosted by Free Software Foundation.
3) Help requested
Any advice concerning the above mentioned bugs would be appreciated.
Sincerely Yours,
Kent
Hello,
I am a researcher, member of a project which aims at collecting
controversial scientific discussions which happened around a set of wiki
pages. Hence we want to start from these pages, collect their history
(various diff), discussions around these pages (including history of
discussions), and discussions pages of all authors who participated
(with history of these pages). After data collection, we will build a
structured corpus and launch analysis on these discussions.
But we faced a real problem when working on wiki dumps because it seems
that data are missing. Here are some details.
I used French wikipedia dump below:
"frwiki-20140208-pages-meta-history1.xml" (509 GB which has all pages
and history pages)
"frwiki-20140208-pages-meta-current.xml" (19 GB, which has current page
and current discussion page)
I was in trouble about "missing revision and missing text":
*Missing revision*
Starting with the article concerning the French word "Chiropratique" at
http://fr.wikipedia.org/wiki/Chiropratique
I found its history pages have 500+ pages, but in the
"frwiki-20140208-pages-meta-history1.xml", I extracted this page and
history pages contain only 6 revisions (see attached-file
"page-Chiropratique.xml"), which are not the most recent revisions. They
are the first six revisions.
Same problem for the user page "Utilisateur:Albi:n"
(http://fr.wikipedia.org/wiki/Utilisateur:Albin), its history pages have
9 revisions, but i found only 5 revisions in the
"frwiki-20140208-pages-meta-history1.xml". (see attached-file
"page-Utilisateur:Albin.xml").
*Missing text*
I have another problem with "frwiki-20140208-pages-meta-current.xml". I
tried to extract "
Discussion:Apple"(http://fr.wikipedia.org/wiki/Discussion:Apple). In
this dump, i got last revision of course, but the page has missing text
(see Attached-file "page-Discussion:Apple.xml")
Are these data really missing from the dumps or did we miss something?
is there another better way to collected the data we are seeking?
Thank you in advance for your cooperation.
--
Kun JIN
Laboratoire de Recherche sur le Langage (LRL)
Université Blaise Pascal (Clermont 2)
kun.jin(a)univ-bpclermont.fr
Tel : +33 3 4 73 34 68 35
Adresse: Université Blaise Pascal,
Maison des Sciences de l'Homme - LRL,
4 rue Ledru
63057 Clermont-Ferrand cedex 1