Hi,
don't know if this issue came up already - in case it did and has been
dismissed, I beg your pardon. In case it didn't...
I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used
to compress the xml dumps instead of bzip2. Why? Because its sibling
(pbunzip2) has a bug bunzip2 hasn't. :-)
Strange? Read on.
A few hours ago, I filed a bug report for pbzip2 (see
https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test
results done even some few hours before that.
The results indicate that:
bzip2 and pbzip2 are vice-versa compatible each one can create
archives, the other one can read. But if it is for uncomressing, only
pbzip2 compressed archives are good for pbunzip2.
I propose compressing the archives with pbzip2 for the following
reasons:
1) If your archiving machines are SMP systems this could lead to a
better usage of system ressources (i.e. faster compression).
2) Compression with pbzip2 is harmless for regular users of bunzip2,
so everything should run for these people as usual.
3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a
speedup that scales nearly linearly with the number of CPUs in the
host.
So to sum up: It's a no loose and two win situation if you migrate to
pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that
interesting? :-)
cheers,
--
Dipl.-Inf. Univ. Richard C. Jelinek
PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek
Human Language Technology Experts Sitz der Gesellschaft: Fürth
69216618 Mind Units Registergericht: AG Fürth, HRB-9201
Ack, sorry for the (no subject); again in the right thread:
> For external uses like XML dumps integrating the compression
> strategy into LZMA would however be very attractive. This would also
> benefit other users of LZMA compression like HBase.
For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
That has a 4 MB buffer, compression ratios within 15-25% of
current 7zip (or histzip), and goes at 30MB/s on my box,
which is still 8x faster than the status quo (going by a 1GB
benchmark).
Trying to get quick-and-dirty long-range matching into LZMA isn't
feasible for me personally and there may be inherent technical
difficulties. Still, I left a note on the 7-Zip boards as folks
suggested; feel free to add anything there:
https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
Thanks for the reply,
Randall
Dear Ariel,
Thank you for your guidance. I pushed another change to gerrit for
review that should address the issue of the new `page_links_updated'
field.
Sincerely Yours,
Kent
On 2/7/14, Ariel T. Glenn <ariel(a)wikimedia.org> wrote:
> Last reply: I double-checked the content/format model stuff, and the
> only nagging question I have remaining is how well it works with
> non-text handlers. But that would not be a new issue, and the code for
> the base case is certainly correct. So I think we are down to just the
> page_links_updated variable for > 1.22 and that would do it.
>
> Ariel
>
> Στις 05-02-2014, ημέρα Τετ, και ώρα 02:43 -0500, ο/η wp mirror έγραψε:
>> Dear Ariel,
>>
>> I have been reading your code for `mwxml2sql-0.0.2' with a view
>> towards updating it for mediawiki-1.23 LTS.
>>
>> 0) Support status
>>
>> Currently, the version info for `mwxml2sql' states the following:
>>
>> (shell)$ mwxml2sql --version
>> mwxml2sql 0.0.2
>> Supported input schema versions: 0.4 through 0.8.
>> Supported output MediaWiki versions: 1.5 through 1.21.
>>
>> 1) Current input schema version
>>
>> Currently, your XML dump files have the following header:
>>
>> (shell)$ head -n 1 zuwiki-20140121-pages-articles.xml
>> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/"
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>> xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/
>> http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8"
>> xml:lang="zu">
>>
>> From this I gather that XML schema is still 0.8, and that `mwxml2sql'
>> needs no update on that head.
>>
>> 2) Current output MediaWiki version
>>
>> I reviewed the database schema for the `page', `revision', and `text'
>> tables:
>>
>> <https://www.mediawiki.org/wiki/Manual:Page_table>,
>> <https://www.mediawiki.org/wiki/Manual:Revision_table>, and
>> <https://www.mediawiki.org/wiki/Manual:Text_table>
>>
>> It appears that the most recent changes to the schema for these three
>> tables occurred for mediawiki versions 1.21, 1.21, and 1.19,
>> respectively.
>>
>> From this I gather that the database schema used for mediawiki 1.23
>> LTS is the same as that used for mediawiki 1.21; and therefore
>> `mwxml2sql' needs no update on that head.
>>
>> 3) Recommended updates
>>
>> From a review of your code, I concluded that two minor changes would be
>> useful.
>>
>> 3.1) mwxml2sql.c
>>
>> The following three lines:
>>
>> (shell)$ grep 21 mwxml2sql.c
>> fprintf(stderr,"Supported output MediaWiki versions: 1.5 through
>> 1.21.\n\n");
>> /* we know MW 1.5 through MW 1.21 even though there is no MW 1.21 yet
>> */
>> if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 21) {
>>
>> should read
>>
>> fprintf(stderr,"Supported output MediaWiki versions: 1.5 through
>> 1.23.\n\n");
>> /* we know MW 1.5 through MW 1.23 */
>> if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 23) {
>>
>> 3.2) mwxmlelts.c
>>
>> The following line:
>>
>> (shell)$ grep 21 mwxmlelts.c
>> <generator>MediaWiki 1.21wmf6</generator>
>>
>> should read
>>
>> <generator>MediaWiki 1.23wmf10</generator>
>>
>> 4) Request
>>
>> Please let me know if you agree with the above assessment. If you do,
>> I would be happy to submit the changes to
>> <https://gerrit.wikimedia.org/> for review.
>>
>> Sincerely Yours,
>> Kent
>
>
>
Hello,
I am a researcher, member of a project which aims at collecting
controversial scientific discussions which happened around a set of wiki
pages. Hence we want to start from these pages, collect their history
(various diff), discussions around these pages (including history of
discussions), and discussions pages of all authors who participated
(with history of these pages). After data collection, we will build a
structured corpus and launch analysis on these discussions.
But we faced a real problem when working on wiki dumps because it seems
that data are missing. Here are some details.
I used French wikipedia dump below:
"frwiki-20140208-pages-meta-history1.xml" (509 GB which has all pages
and history pages)
"frwiki-20140208-pages-meta-current.xml" (19 GB, which has current page
and current discussion page)
I was in trouble about "missing revision and missing text":
*Missing revision*
Starting with the article concerning the French word "Chiropratique" at
http://fr.wikipedia.org/wiki/Chiropratique
I found its history pages have 500+ pages, but in the
"frwiki-20140208-pages-meta-history1.xml", I extracted this page and
history pages contain only 6 revisions (see attached-file
"page-Chiropratique.xml"), which are not the most recent revisions. They
are the first six revisions.
Same problem for the user page "Utilisateur:Albi:n"
(http://fr.wikipedia.org/wiki/Utilisateur:Albin), its history pages have
9 revisions, but i found only 5 revisions in the
"frwiki-20140208-pages-meta-history1.xml". (see attached-file
"page-Utilisateur:Albin.xml").
*Missing text*
I have another problem with "frwiki-20140208-pages-meta-current.xml". I
tried to extract "
Discussion:Apple"(http://fr.wikipedia.org/wiki/Discussion:Apple). In
this dump, i got last revision of course, but the page has missing text
(see Attached-file "page-Discussion:Apple.xml")
Are these data really missing from the dumps or did we miss something?
is there another better way to collected the data we are seeking?
Thank you in advance for your cooperation.
--
Kun JIN
Laboratoire de Recherche sur le Langage (LRL)
Université Blaise Pascal (Clermont 2)
kun.jin(a)univ-bpclermont.fr
Tel : +33 3 4 73 34 68 35
Adresse: Université Blaise Pascal,
Maison des Sciences de l'Homme - LRL,
4 rue Ledru
63057 Clermont-Ferrand cedex 1
wp mirror, 23/02/2014 15:26:
> c) Third best, would be to patch `mwxml2sql'. This I also favor, but
> would like some guidance from its author, Ariel Glenn, before I start
> hacking.
This seems the most likely. Probably, mwxml2sql has to be fixed so that
it does whatever importDump.php/Special:Import do. Only if they both
have the same problem with full page names in <title>, then the export
should be changed. This is my guess only; at any rate do file a bug if
there is a difference in behaviour.
Nemo
Dear Nemo,
Thanks for enlightening me regarding <title>. I did not know that it
was intended to be a compound of namespace word and `page_title'
field.
Still, I have some thoughts on this matter.
1) importDump.php
As of WP-MIRROR 0.6, `importDump.php' is not longer used.
The disadvantage of `importDump.php' is that it is slow. Importation
of `enwiki' takes about two months, which is greater than the interval
between XML dumps.
The advantage of `importDump.php' is that it handles any idiosyncrasy
(such as compound <title> entries) in the XML dumps.
2) mwxml2sql
As of WP-MIRROR 0.6, `mwxml2sql' is used to convert the XML dump into
a set of SQL dumps (for the `page', `revision', `text' tables) which
can then be directly loaded into the underlying database tables.
The advantage of `mwxml2sql' is that it is very fast. And, when used
in conjunction with MySQL 5.5 fast index creation, one can load
`enwiki' using 80% less time.
The disadvantage is that it faithfully copies the <title> field into
the SQL statement for INSERTing the `page_title' field. We now know
that this results in pages from the Template and other namespaces
being not found by MediaWiki, which then renders them as red-links.
3) First Normal Form
One issue in the back of my mind concerns the recent changes in the
XML schema. As of `export-0.6.xsd.gz' we note that ``Version 0.6 adds
a separate namespace tag''. To my mind, the presence of the <ns>
field should obviate the need to include a namespace word (e.g.
`Category:', `Template:', etc.) within the <title> field.
The principle is known as first normal form (1NF) which basically
means that the contents of a field should be atomic rather than
compound.
4) Solution
Granted that the objective is to faithfully mirror the WMF database
tables; the issue before us is this: Where along the tool chain
should the patch be made.
a) My instinct is to correct the issue upstream (the XML dump generation phase).
The WMF `page_namespace' field should be copied to the <ns> field.
The WMF `page_title' field should be copied to the <title> field.
Adhere to principles of database normalization.
b) Second best, would be to patch WP-MIRROR 0.7 to normalize the XML
dump prior to feeding it into `mwxml2sql'. This I have done.
c) Third best, would be to patch `mwxml2sql'. This I also favor, but
would like some guidance from its author, Ariel Glenn, before I start
hacking.
d) A last resort would be to write an SQL query to clean up compound
`page_title' entries in the mirror's database. But I really would
rather not load unnormalized data in the first place.
Sincerely Yours,
Kent
On 2/22/14, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote:
> wp mirror, 22/02/2014 23:40:
>> Still, it would be nice if the dump files could be fixed.
>
> Fixed? <title> is the full page name as it's supposed to be. Either
> you're doing something wrong with the import, or the import
> script/special page has a bug (not uncommon, but needs a bug report with
> steps to reproduce). I see nothing to blame on the export side.
>
> Nemo
>
> _______________________________________________
> Xmldatadumps-l mailing list
> Xmldatadumps-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
Dear Sir or Madam,
I am not sure to which person or list I should address this question to.
0) Objective
I am in the process of building DEB packages for: WP-MIRROR 0.7, the
latest development version of MediaWiki 1.23, and a set of MediaWiki
extensions.
The objective is to this: That a page rendered by a mirror should
look the same a that page rendered by the WMF site.
1) Problem
In the process of testing mirrors, I noticed that many templates were
not expanding, and instead being rendered as red-links.
2) Example
To illustrate, consider the Ndash template, which appears on many
pages such as <http://simple.wikipedia.org/wiki/August>. It appears
in the underlying database:
mysql> select page_id,page_title,rev_len,old_text from
simplewiki.page,simplewiki.revision,simplewiki.text where
page_id=rev_page and rev_text_id=old_id and page_title like
'Template:Ndash' limit 10\G
*************************** 1. row ***************************
page_id: 132985
page_title: Template:Ndash
rev_len: 65
old_text: –<noinclude>
[[Category:Formatting templates]]
</noinclude>
1 row in set (0.25 sec)
3) Special:ExpandTemplates
To test the above example ``Template:Ndash'', I use Special:ExpandTemplates.
3.1) Input text
Today is the {{CURRENTDAY}} day.</br>
This server is {{SERVER}}, script path {{SCRIPTPATH}}, current MW
version {{CURRENTVERSION}}.</br>
This site is {{SITENAME}}. Full page name is {{FULLPAGENAME}}.</br>
<table>
<tr><th>Template</th><th>Expanded</th><th>page_id</th><th>rev_len</th></tr>
<tr><td>Ndash</td><td>{{Ndash}}</td><td>{{PAGEID:
Ndash}}</td><td>{{PAGESIZE: Ndash}}</td></tr>
<tr><td>Template:Ndash</td><td>{{Template:Ndash}}</td>
<td>{{PAGEID: Template:Ndash}}</td><td>{{PAGESIZE:
Template:Ndash}}</td></tr>
<tr><td>Template:Template:Ndash</td><td>{{Template:Template:Ndash}}</td>
<td>{{PAGEID: Template:Template:Ndash}}</td><td>{{PAGESIZE:
Template:Template:Ndash}}</td></tr>
</table>
3.2) <http://simple.wikipedia.site/wiki/Special:ExpandTemplates> Preview
Here is the result from the WMF site:
Today is the 21 day.
This server is //simple.wikipedia.org, script path /w, current MW
version 1.23wmf14 (f8b9201).
This site is Wikipedia. Full page name is My template.
Template Expanded page_id rev_len
Ndash – 0 0
Template:Ndash – 132985 65
Template:Template:Ndash Template:Template:Ndash 0 0
Both {{Ndash}} and {{Template:Ndash}} expand as expected.
3.3) <http://simple.wikipedia.site/wiki/Special:ExpandTemplates> Preview
Here is the result from the mirrored site:
Today is the 21 day.
This server is http://simple.wikipedia.site, script path /w, current
MW version 1.23alpha.
This site is simplewiki. Full page name is My template.
Template Expanded page_id rev_len
Ndash Template:Ndash 0 0
Template:Ndash Template:Ndash 0 0
Template:Template:Ndash – 132985 65
Only {{Template:Template:Ndash}} expands!
4) Question
Why do I need to prepend an extra ``Template:'' to make the templates
work for the mirror?
Better yet: Could someone tell me where in the MediaWiki core I can
find the code that takes the template (e.g. {{Ndash}} or
{{Template:Ndash}}) and converts it into an SQL query that SELECTs the
template expansion from the underlying database?
Sincerely Yours,
Kent
Dear Ariel,
I have been reading your code for `mwxml2sql-0.0.2' with a view
towards updating it for mediawiki-1.23 LTS.
0) Support status
Currently, the version info for `mwxml2sql' states the following:
(shell)$ mwxml2sql --version
mwxml2sql 0.0.2
Supported input schema versions: 0.4 through 0.8.
Supported output MediaWiki versions: 1.5 through 1.21.
1) Current input schema version
Currently, your XML dump files have the following header:
(shell)$ head -n 1 zuwiki-20140121-pages-articles.xml
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8"
xml:lang="zu">
>From this I gather that XML schema is still 0.8, and that `mwxml2sql'
needs no update on that head.
2) Current output MediaWiki version
I reviewed the database schema for the `page', `revision', and `text' tables:
<https://www.mediawiki.org/wiki/Manual:Page_table>,
<https://www.mediawiki.org/wiki/Manual:Revision_table>, and
<https://www.mediawiki.org/wiki/Manual:Text_table>
It appears that the most recent changes to the schema for these three
tables occurred for mediawiki versions 1.21, 1.21, and 1.19,
respectively.
>From this I gather that the database schema used for mediawiki 1.23
LTS is the same as that used for mediawiki 1.21; and therefore
`mwxml2sql' needs no update on that head.
3) Recommended updates
>From a review of your code, I concluded that two minor changes would be useful.
3.1) mwxml2sql.c
The following three lines:
(shell)$ grep 21 mwxml2sql.c
fprintf(stderr,"Supported output MediaWiki versions: 1.5 through 1.21.\n\n");
/* we know MW 1.5 through MW 1.21 even though there is no MW 1.21 yet */
if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 21) {
should read
fprintf(stderr,"Supported output MediaWiki versions: 1.5 through 1.23.\n\n");
/* we know MW 1.5 through MW 1.23 */
if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 23) {
3.2) mwxmlelts.c
The following line:
(shell)$ grep 21 mwxmlelts.c
<generator>MediaWiki 1.21wmf6</generator>
should read
<generator>MediaWiki 1.23wmf10</generator>
4) Request
Please let me know if you agree with the above assessment. If you do,
I would be happy to submit the changes to
<https://gerrit.wikimedia.org/> for review.
Sincerely Yours,
Kent
A recent update to the mediawiki multiversion scripts broke the abstract
dumps; a bug report and a fix have been submitted so I expect this to
get taken care of by Monday at the latest and hopefully over the
weekend. In the meantime no new jobs for small wikis will be produced;
I'll start those up again once the fix is in, as well as rerunning the
abstract dumps where it failed. Currently running jobs will run to
completion.
Ariel