Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Dear Ariel,
Some time ago, I generated a DEB package named
`mwxml2sql_0.0.2-1_amd64.deb'. It works very well, and was the source
of the patches that I submitted upstream. Now that I have learned
that you welcome such patches, it occurs to me that you might want the
DEB package as well. If so, then there are a number of things we
should discuss.
0) Naming
Your other DEB packages have names like:
mediawiki_1.19.6-1_all.deb
mediawiki-extensions-base_3.3_all.deb
mediawiki-math_1.0+git20120528-7_amd64.deb
For naming consistency, would you like the `mwxml2sql' package to be
renamed something like
mediawiki-mwxml2sql_0.0.2-1_amd64.deb
1) ITP
Debian policy requires that new packages first be announced with an
Intent-To-Package (ITP) bug report. Then a `Debian Developer' may or
may not step forward to sponsor the package for inclusion in a Debian
distribution.
Do you have someone in-house, who is serving as a `Debian Maintainer'?
If so, could you introduce us?
2) Architectures
All my systems are AMD64. Whereas `mwxml2sql' contains C language
programs, and whereas Debian is a binary distribution; a set of
`mwxml2sql' DEB packages should be prepared, one for each
architecture. Do you have a way of generating DEB packages for other
architectures?
Sincerely Yours,
Kent
Dear list members,
I would like some advise on how to submit a `mediawiki` related DEB
package. Jeremy Baron recommended that I contact this mailing list.
0) New utilities
Ariel T. Glenn at WMF wrote a set of utilities, `mwxml2sql', that help
convert XML dump files into a format that can be readily loaded into
the database for a local instance of MediaWiki. These utilities are
written using C language, and offer some performance advantage over
existing utilities such as `importDump.php'.
The upstream source code may be found at
<https://gerrit.wikimedia.org/r/#/admin/projects/operations/dumps>.
1) Reason for packaging
I wrote `wp-mirror' which is a free utility for mirroring any desired
set of WMF wikis. This I distribute as a DEB package. My next
release, wp-mirror-0.6, is focused on performance improvement; and,
among other things, will make use of Ariel's utilities.
To facilitate the handling of dependencies, I decided to package
Ariel's utilities.
2) DEB package
I prepared a DEB package which is now named
`mediawiki-mwxml2sql_0.0.2-1_amd64.deb'. It builds correctly with
`debuild' and with `pbuilder'. `Lintian' only complains that it does
not close any ITP bug.
3) Patches
I patched Ariel's source code and Makefile, so that man pages could be
generated using `help2man'. I submitted the patch upstream, and Ariel
graciously applied it. One more patch is under review (a few typos).
4) ITP
I submitted an Intent-To-Package (ITP) bug to Debian, but have not yet
received the bug number.
Do you know anyone who would like to sponsor the package?
Sincerely Yours,
Kent
On 5/28/13, Jeremy Baron <jeremy(a)tuxmachine.com> wrote:
> On May 28, 2013 12:34 AM, "Ariel T. Glenn" <ariel(a)wikimedia.org> wrote:
>> Στις 27-05-2013, ημέρα Δευ, και ώρα 21:00 -0400, ο/η wp mirror έγραψε:
>> From
>> looking at http://packages.debian.org/sid/mediawiki-extensions-base it
>> seems we want to get in contact with Romain Beauxis or Thorsten Glaser
>> and see how to proceed.
>
> pkg-mediawiki-devel(a)lists.alioth.debian.org is the place to mail.
>
>> Hmm I really have no idea what will happen to some of these on a 32-bit
>> system, I should check that out in a vm sometime...
>
> sounds like you just need tests in the Debian package and then Debian can
> run those for you on all archs/ports.
>
> -Jeremy
>
I'm seeing some issues with the history phase of the wikidata dumps
taking a huge amount of memory and causing the server they are on to
swap. I've shot the jobs and left less worker running on the one host
for now; I'll investigate in depth tomorrow.
Ariel
Hi there,
Is this list still active? Where does one find XML data dumps from
Wikipedia?
Christine Bush
On Tuesday, May 21, 2013, wrote:
> Send Xmldatadumps-l mailing list submissions to
> xmldatadumps-l(a)lists.wikimedia.org <javascript:;>
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
> or, via email, send a message with subject or body 'help' to
> xmldatadumps-l-request(a)lists.wikimedia.org <javascript:;>
>
> You can reach the person managing the list at
> xmldatadumps-l-owner(a)lists.wikimedia.org <javascript:;>
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Xmldatadumps-l digest..."
>
Dear Ariel,
I submitted the patches upstream.
0) Review
The changes may be found at <https://gerrit.wikimedia.org/r/64343/>.
1) Git
In the hope that others may find it useful, I am posting the sequence
of Git commands used, organized as a Makefile.
#-----------------------------------------------------------------------------+
# Makefile for submitting patches to the Wikimedia Foundation |
# Copyright (C) 2013 Dr. Kent L. Miller. All rights reserved. |
# |
# This program is free software: you can redistribute it and/or modify |
# it under the terms of the GNU General Public License as published by |
# the Free Software Foundation, either version 3 of the License, or (at |
# your option) any later version. |
# |
# This program is distributed in the hope that it will be useful, but |
# WITHOUT ANY WARRANTY; without even the implied warranty of |
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU |
# General Public License for more details. |
# |
# You should have received a copy of the GNU General Public License |
# along with this program. If not, see <http://www.gnu.org/licenses/>") |
#-----------------------------------------------------------------------------+
GERRIT = wpmirrordev(a)gerrit.wikimedia.org
PORT = 29418
WMFLABS = wpmirrordev(a)bastion.wmflabs.org
all: clone hooks pull checkout branch edit diff add commit rebase review
clone:
# create directory `dumps' and initialize a repository in it
# copy all commit objects and head references (from remote to local)
# add `remote repository reference' named `origin' (saves typing)
# add `remote heads' named `origin/[head-name]'
# add `HEAD' to track `origin/master'
git clone ssh://$(GERRIT):$(PORT)/operations/dumps
hooks:
# get `pre-commit-hook' to add `change id' to commit summary
scp -p -P $(PORT) $(GERRIT):hooks/commit-msg ~/dumps/.git/hooks/.
cd dumps; git review -s
pull:
# list `remote heads'
cd dumps; git branch -r
# setup tracking branch `ariel'
cd dumps; git branch --track ariel origin/ariel
# add new commit objects (if any)
# update `remote heads'
cd dumps; git fetch origin
# update `local heads' (`master' and `ariel') to `remote-heads'
# merge `origin/HEAD' into `HEAD'
cd dumps; git pull origin
checkout:
# point `HEAD' to `ariel's commit object
cd dumps; git checkout ariel
cd dumps; git status
branch:
# create head `wpmirrordev'
# point `wpmirrordev' to `ariel's commit object
cd dumps; git branch wpmirrordev ariel
# point `HEAD' to `wpmirrordev's commit object
cd dumps; git checkout wpmirrordev
cd dumps; git status
edit:
# apply patched files
cp temp/* dumps/xmlfileutils/.
diff:
# diff files (but not added files) against `HEAD'
cd dumps; git diff
# list changed files against `HEAD'
cd dumps; git status
add:
# stage the files to be committed
cd dumps/xmlfileutils; git add mwxml2sql.c sql2txt.c sqlfilter.c
cd dumps/xmlfileutils; git add Makefile
# diff added files against `HEAD'
cd dumps; git diff --cached
# list changed files against `HEAD'
cd dumps; git status
commit:
# create `commit object'
# point `HEAD' to the new `commit object'
cd dumps; git commit -m "Fix for compatibility with help2man
and Debian Policy"
# list all commits from `HEAD' back to initial commit
cd dumps; git log
rebase:
# add new commit objects (if any)
# update `remote head' `origin/ariel'
# merge `origin/ariel' into `ariel'
# point `ariel' to `origin/ariel's commit object
cd dumps; git pull origin ariel
# rebase `wpmirrordev' branch on updated `ariel' head
cd dumps; git rebase ariel
review:
# push changes to Gerrit
cd dumps; git review -R ariel
#-----------------------------------------------------------------------------+
shell:
ssh -A $(WMFLABS)
purge:
rm -r dumps
clean:
rm -f *~
Sincerely Yours,
Kent
Dear Ariel,
I am having some trouble submitting the patches. On-line examples
that I have seen do not cover the case of submitting patches to a
branch. Here is what I have tried so far:
0) clone, hooks, and review setup
(shell) git clone ssh://wpmirrordev@gerrit.wikimedia.org:29418/operations/dumps
(shell) scp -p -P 29418
wpmirrordev@gerrit.wikimedia.org:hooks/commit-msg ~/dumps/.git/hooks/.
(shell) cd dumps
(shell) git review -s
(shell) git status
1) pull the ariel branch
(shell) git pull origin master
(shell) git branch ariel
(shell) git pull origin ariel # throws errors
(shell) git status # emits a long list of revisions
2) commit
(shell) git commit -a # needed to quell errors from the `pull'
(shell) git checkout ariel
(shell) git status
3) create branch for the patches
(shell) git branch wpmirrordev master
(shell) git checkout wpmirrordev
(shell) git status
4) apply patches
(shell) cp ../patched-files/* xmlfileutils/.
(shell) git diff
(shell) git status
5) commit the patches
(shell) cd xmlfileutils
(shell) git add mwxml2sql.c sql2txt.c sqlfilter.c Makefile
(shell) git diff --cached
(shell) git commit
(shell) git status
6) rebase
(shell) git pull origin master
(shell) git rebase master
(shell) cat .gitreview
gerrit]
host=gerrit.wikimedia.org
port=29418
project=operations/dumps.git
<<<<<<< HEAD
defaultbranch=master
=======
defaultbranch=ariel
>>>>>>> 3b82bbea24f999f1a5af721d37ec0684615bc3ae
(shell) git checkout ariel # need to fix error in .gitreview
7) submit for review
(shell) review -R ariel
Enter passphrase for key '/home/wikimedia/.ssh/id_rsa':
Creating a git remote called "gerrit" that maps to:
ssh://wpmirrordev@gerrit.wikimedia.org:29418/operations/dumps.git
Enter passphrase for key '/home/wikimedia/.ssh/id_rsa':
You have more than one commit that you are about to submit.
The outstanding commits are:
bf268a2 (HEAD, origin/master, origin/HEAD, gerrit/master, ariel) pep8
whitespaces fixing
6dff615 pep8: E302 expected 2 blank lines, found 1
671beaa .pep8 configuration file
1235eaa Merge "Add .gitreview file"
9a814c2 README is obsolete :(
67a61fa Add .gitreview file
47b6db7 add CC-BY_SA license for text, plus pointer to terms of use
689fa7c dump iwlinks table
e4bc572 Kill .cvsignore, svn ignore is doing the same
25f4a46 svn:eol-style native
Is this really what you meant to do?
Type 'yes' to confirm: yes
Enter passphrase for key '/home/wikimedia/.ssh/id_rsa':
X11 forwarding request failed on channel 0
remote: Processing changes: refs: 1, done
To ssh://wpmirrordev@gerrit.wikimedia.org:29418/operations/dumps.git
! [remote rejected] HEAD -> refs/publish/ariel/ariel (no new changes)
error: failed to push some refs to
'ssh://wpmirrordev@gerrit.wikimedia.org:29418/operations/dumps.git'
make: *** [review] Error 1
wikimedia@darkstar-7:~$ QDBusConnection: session D-Bus connection
created before QCoreApplication. Application may misbehave.
QDBusConnection: session D-Bus connection created before
QCoreApplication. Application may misbehave.
QSystemTrayIcon::setVisible: No Icon set
Connecting to deprecated signal
QDBusConnectionInterface::serviceOwnerChanged(QString,QString,QString)
Any help is welcome.
Sincerely Yours,
Kent
Dear Ariel,
0) INTRO
I am close to releasing WP-MIRROR 0.6. It will exhibit reliability
and performance improvements in all areas of operation.
As a part of the development process, I have been testing `mwxml2sql'
with a view towards using it to replace `importDump.php' in WP-MIRROR
0.6. These tests have worked out well.
There are however some issues that I should discuss with you.
1) Packaging
I distribute WP-MIRROR as a DEB package. In order to use `mwxml2sql',
I would have to package your tools as a separate DEB package. This I
have done. However, in the process, I had to apply some patches; and
the question now arises as to how to submit them upstream.
2) Makefile
I have patched the `Makefile' that you distribute with `mwxml2sql'
because: a) the `install' target must use `install' rather than `mv';
and b) it lacked a `deinstall' target. Both changes are required by
Debian policy.
3) Man pages
Man pages are also required by Debian policy. To that end, I have
written man pages for `mwxml2sql', `sql2txt', and `sqlfilter'.
However, the better approach would be to patch those tools so that man
pages could automatically be generated using `help2man'. The later
approach has the benefit of eliminating duplication, and hence, helps
keep code and documentation in sync.
4) Upstream
I would like to know:
a) if patches are welcome upstream; and, if so,
b) what procedures you prefer for submitting, reviewing, and applying
patches; and
c) whether you would prefer that I submit the man pages I wrote, or
submit patches to your utilities to make them compatible with
`help2man'.
Sincerely Yours,
Kent
Please i m french,i look for a simple link to download all the wikipedia
image.I already use all xml dump i my local wikimedia but i don't have
image.It's it also possible to have a minified image because i think all
the image have several GigaOctet.
Tank to all
--
guigui777