Xmldatadumps-l February 2014

xmldatadumps-l@lists.wikimedia.org

6 participants
10 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Re: [Xmldatadumps-l] [Wikitech-l] Compressing full-history dumps faster

by Randall Farmer

Ack, sorry for the (no subject); again in the right thread: > For external uses like XML dumps integrating the compression > strategy into LZMA would however be very attractive. This would also > benefit other users of LZMA compression like HBase. For dumps or other uses, 7za -mx=3 / xz -3 is your best bet. That has a 4 MB buffer, compression ratios within 15-25% of current 7zip (or histzip), and goes at 30MB/s on my box, which is still 8x faster than the status quo (going by a 1GB benchmark). Trying to get quick-and-dirty long-range matching into LZMA isn't feasible for me personally and there may be inherent technical difficulties. Still, I left a note on the 7-Zip boards as folks suggested; feel free to add anything there: https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/ Thanks for the reply, Randall

10 years

Re: [Xmldatadumps-l] Updating mwxml2sql-0.0.2 for mediawiki 1.23 LTS

by wp mirror

Dear Ariel, Thank you for your guidance. I pushed another change to gerrit for review that should address the issue of the new `page_links_updated' field. Sincerely Yours, Kent On 2/7/14, Ariel T. Glenn <ariel(a)wikimedia.org> wrote: > Last reply: I double-checked the content/format model stuff, and the > only nagging question I have remaining is how well it works with > non-text handlers. But that would not be a new issue, and the code for > the base case is certainly correct. So I think we are down to just the > page_links_updated variable for > 1.22 and that would do it. > > Ariel > > Στις 05-02-2014, ημέρα Τετ, και ώρα 02:43 -0500, ο/η wp mirror έγραψε: >> Dear Ariel, >> >> I have been reading your code for `mwxml2sql-0.0.2' with a view >> towards updating it for mediawiki-1.23 LTS. >> >> 0) Support status >> >> Currently, the version info for `mwxml2sql' states the following: >> >> (shell)$ mwxml2sql --version >> mwxml2sql 0.0.2 >> Supported input schema versions: 0.4 through 0.8. >> Supported output MediaWiki versions: 1.5 through 1.21. >> >> 1) Current input schema version >> >> Currently, your XML dump files have the following header: >> >> (shell)$ head -n 1 zuwiki-20140121-pages-articles.xml >> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" >> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >> xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ >> http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" >> xml:lang="zu"> >> >> From this I gather that XML schema is still 0.8, and that `mwxml2sql' >> needs no update on that head. >> >> 2) Current output MediaWiki version >> >> I reviewed the database schema for the `page', `revision', and `text' >> tables: >> >> <https://www.mediawiki.org/wiki/Manual:Page_table>, >> <https://www.mediawiki.org/wiki/Manual:Revision_table>, and >> <https://www.mediawiki.org/wiki/Manual:Text_table> >> >> It appears that the most recent changes to the schema for these three >> tables occurred for mediawiki versions 1.21, 1.21, and 1.19, >> respectively. >> >> From this I gather that the database schema used for mediawiki 1.23 >> LTS is the same as that used for mediawiki 1.21; and therefore >> `mwxml2sql' needs no update on that head. >> >> 3) Recommended updates >> >> From a review of your code, I concluded that two minor changes would be >> useful. >> >> 3.1) mwxml2sql.c >> >> The following three lines: >> >> (shell)$ grep 21 mwxml2sql.c >> fprintf(stderr,"Supported output MediaWiki versions: 1.5 through >> 1.21.\n\n"); >> /* we know MW 1.5 through MW 1.21 even though there is no MW 1.21 yet >> */ >> if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 21) { >> >> should read >> >> fprintf(stderr,"Supported output MediaWiki versions: 1.5 through >> 1.23.\n\n"); >> /* we know MW 1.5 through MW 1.23 */ >> if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 23) { >> >> 3.2) mwxmlelts.c >> >> The following line: >> >> (shell)$ grep 21 mwxmlelts.c >> <generator>MediaWiki 1.21wmf6</generator> >> >> should read >> >> <generator>MediaWiki 1.23wmf10</generator> >> >> 4) Request >> >> Please let me know if you agree with the above assessment. If you do, >> I would be happy to submit the changes to >> <https://gerrit.wikimedia.org/> for review. >> >> Sincerely Yours, >> Kent > > >

10 years

problem with french wikipedia dump

by Kun JIN

Hello, I am a researcher, member of a project which aims at collecting controversial scientific discussions which happened around a set of wiki pages. Hence we want to start from these pages, collect their history (various diff), discussions around these pages (including history of discussions), and discussions pages of all authors who participated (with history of these pages). After data collection, we will build a structured corpus and launch analysis on these discussions. But we faced a real problem when working on wiki dumps because it seems that data are missing. Here are some details. I used French wikipedia dump below: "frwiki-20140208-pages-meta-history1.xml" (509 GB which has all pages and history pages) "frwiki-20140208-pages-meta-current.xml" (19 GB, which has current page and current discussion page) I was in trouble about "missing revision and missing text": *Missing revision* Starting with the article concerning the French word "Chiropratique" at http://fr.wikipedia.org/wiki/Chiropratique I found its history pages have 500+ pages, but in the "frwiki-20140208-pages-meta-history1.xml", I extracted this page and history pages contain only 6 revisions (see attached-file "page-Chiropratique.xml"), which are not the most recent revisions. They are the first six revisions. Same problem for the user page "Utilisateur:Albi:n" (http://fr.wikipedia.org/wiki/Utilisateur:Albin), its history pages have 9 revisions, but i found only 5 revisions in the "frwiki-20140208-pages-meta-history1.xml". (see attached-file "page-Utilisateur:Albin.xml"). *Missing text* I have another problem with "frwiki-20140208-pages-meta-current.xml". I tried to extract " Discussion:Apple"(http://fr.wikipedia.org/wiki/Discussion:Apple). In this dump, i got last revision of course, but the page has missing text (see Attached-file "page-Discussion:Apple.xml") Are these data really missing from the dumps or did we miss something? is there another better way to collected the data we are seeking? Thank you in advance for your cooperation. -- Kun JIN Laboratoire de Recherche sur le Langage (LRL) Université Blaise Pascal (Clermont 2) kun.jin(a)univ-bpclermont.fr Tel : +33 3 4 73 34 68 35 Adresse: Université Blaise Pascal, Maison des Sciences de l'Homme - LRL, 4 rue Ledru 63057 Clermont-Ferrand cedex 1

10 years, 1 month

Re: [Xmldatadumps-l] Template expansion inconsistency

by Federico Leva (Nemo)

wp mirror, 23/02/2014 15:26: > c) Third best, would be to patch `mwxml2sql'. This I also favor, but > would like some guidance from its author, Ariel Glenn, before I start > hacking. This seems the most likely. Probably, mwxml2sql has to be fixed so that it does whatever importDump.php/Special:Import do. Only if they both have the same problem with full page names in <title>, then the export should be changed. This is my guess only; at any rate do file a bug if there is a difference in behaviour. Nemo

10 years, 2 months

Fwd: Template expansion inconsistency

by wp mirror

Dear Nemo, Thanks for enlightening me regarding <title>. I did not know that it was intended to be a compound of namespace word and `page_title' field. Still, I have some thoughts on this matter. 1) importDump.php As of WP-MIRROR 0.6, `importDump.php' is not longer used. The disadvantage of `importDump.php' is that it is slow. Importation of `enwiki' takes about two months, which is greater than the interval between XML dumps. The advantage of `importDump.php' is that it handles any idiosyncrasy (such as compound <title> entries) in the XML dumps. 2) mwxml2sql As of WP-MIRROR 0.6, `mwxml2sql' is used to convert the XML dump into a set of SQL dumps (for the `page', `revision', `text' tables) which can then be directly loaded into the underlying database tables. The advantage of `mwxml2sql' is that it is very fast. And, when used in conjunction with MySQL 5.5 fast index creation, one can load `enwiki' using 80% less time. The disadvantage is that it faithfully copies the <title> field into the SQL statement for INSERTing the `page_title' field. We now know that this results in pages from the Template and other namespaces being not found by MediaWiki, which then renders them as red-links. 3) First Normal Form One issue in the back of my mind concerns the recent changes in the XML schema. As of `export-0.6.xsd.gz' we note that ``Version 0.6 adds a separate namespace tag''. To my mind, the presence of the <ns> field should obviate the need to include a namespace word (e.g. `Category:', `Template:', etc.) within the <title> field. The principle is known as first normal form (1NF) which basically means that the contents of a field should be atomic rather than compound. 4) Solution Granted that the objective is to faithfully mirror the WMF database tables; the issue before us is this: Where along the tool chain should the patch be made. a) My instinct is to correct the issue upstream (the XML dump generation phase). The WMF `page_namespace' field should be copied to the <ns> field. The WMF `page_title' field should be copied to the <title> field. Adhere to principles of database normalization. b) Second best, would be to patch WP-MIRROR 0.7 to normalize the XML dump prior to feeding it into `mwxml2sql'. This I have done. c) Third best, would be to patch `mwxml2sql'. This I also favor, but would like some guidance from its author, Ariel Glenn, before I start hacking. d) A last resort would be to write an SQL query to clean up compound `page_title' entries in the mirror's database. But I really would rather not load unnormalized data in the first place. Sincerely Yours, Kent On 2/22/14, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote: > wp mirror, 22/02/2014 23:40: >> Still, it would be nice if the dump files could be fixed. > > Fixed? <title> is the full page name as it's supposed to be. Either > you're doing something wrong with the import, or the import > script/special page has a bug (not uncommon, but needs a bug report with > steps to reproduce). I see nothing to blame on the export side. > > Nemo > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l >

10 years, 2 months

Incremental dump generator does not work

by Ivan

Hi, incremental dumps generator stopped working two days ago. Folders: http://dumps.wikimedia.org/other/incr/wikidatawiki/20140221/ http://dumps.wikimedia.org/other/incr/wikidatawiki/20140222/ are empty. Could somebody help with it? Sincerely, Ivan A. Krestinin

10 years, 2 months

Template expansion inconsistency

by wp mirror

Dear Sir or Madam, I am not sure to which person or list I should address this question to. 0) Objective I am in the process of building DEB packages for: WP-MIRROR 0.7, the latest development version of MediaWiki 1.23, and a set of MediaWiki extensions. The objective is to this: That a page rendered by a mirror should look the same a that page rendered by the WMF site. 1) Problem In the process of testing mirrors, I noticed that many templates were not expanding, and instead being rendered as red-links. 2) Example To illustrate, consider the Ndash template, which appears on many pages such as <http://simple.wikipedia.org/wiki/August>. It appears in the underlying database: mysql> select page_id,page_title,rev_len,old_text from simplewiki.page,simplewiki.revision,simplewiki.text where page_id=rev_page and rev_text_id=old_id and page_title like 'Template:Ndash' limit 10\G *************************** 1. row *************************** page_id: 132985 page_title: Template:Ndash rev_len: 65 old_text: –<noinclude> [[Category:Formatting templates]] </noinclude> 1 row in set (0.25 sec) 3) Special:ExpandTemplates To test the above example ``Template:Ndash'', I use Special:ExpandTemplates. 3.1) Input text Today is the {{CURRENTDAY}} day.</br> This server is {{SERVER}}, script path {{SCRIPTPATH}}, current MW version {{CURRENTVERSION}}.</br> This site is {{SITENAME}}. Full page name is {{FULLPAGENAME}}.</br> <table> <tr><th>Template</th><th>Expanded</th><th>page_id</th><th>rev_len</th></tr> <tr><td>Ndash</td><td>{{Ndash}}</td><td>{{PAGEID: Ndash}}</td><td>{{PAGESIZE: Ndash}}</td></tr> <tr><td>Template:Ndash</td><td>{{Template:Ndash}}</td> <td>{{PAGEID: Template:Ndash}}</td><td>{{PAGESIZE: Template:Ndash}}</td></tr> <tr><td>Template:Template:Ndash</td><td>{{Template:Template:Ndash}}</td> <td>{{PAGEID: Template:Template:Ndash}}</td><td>{{PAGESIZE: Template:Template:Ndash}}</td></tr> </table> 3.2) <http://simple.wikipedia.site/wiki/Special:ExpandTemplates> Preview Here is the result from the WMF site: Today is the 21 day. This server is //simple.wikipedia.org, script path /w, current MW version 1.23wmf14 (f8b9201). This site is Wikipedia. Full page name is My template. Template Expanded page_id rev_len Ndash – 0 0 Template:Ndash – 132985 65 Template:Template:Ndash Template:Template:Ndash 0 0 Both {{Ndash}} and {{Template:Ndash}} expand as expected. 3.3) <http://simple.wikipedia.site/wiki/Special:ExpandTemplates> Preview Here is the result from the mirrored site: Today is the 21 day. This server is http://simple.wikipedia.site, script path /w, current MW version 1.23alpha. This site is simplewiki. Full page name is My template. Template Expanded page_id rev_len Ndash Template:Ndash 0 0 Template:Ndash Template:Ndash 0 0 Template:Template:Ndash – 132985 65 Only {{Template:Template:Ndash}} expands! 4) Question Why do I need to prepend an extra ``Template:'' to make the templates work for the mirror? Better yet: Could someone tell me where in the MediaWiki core I can find the code that takes the template (e.g. {{Ndash}} or {{Template:Ndash}}) and converts it into an SQL query that SELECTs the template expansion from the underlying database? Sincerely Yours, Kent

10 years, 2 months

Updating mwxml2sql-0.0.2 for mediawiki 1.23 LTS

by wp mirror

Dear Ariel, I have been reading your code for `mwxml2sql-0.0.2' with a view towards updating it for mediawiki-1.23 LTS. 0) Support status Currently, the version info for `mwxml2sql' states the following: (shell)$ mwxml2sql --version mwxml2sql 0.0.2 Supported input schema versions: 0.4 through 0.8. Supported output MediaWiki versions: 1.5 through 1.21. 1) Current input schema version Currently, your XML dump files have the following header: (shell)$ head -n 1 zuwiki-20140121-pages-articles.xml <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="zu"> >From this I gather that XML schema is still 0.8, and that `mwxml2sql' needs no update on that head. 2) Current output MediaWiki version I reviewed the database schema for the `page', `revision', and `text' tables: <https://www.mediawiki.org/wiki/Manual:Page_table>, <https://www.mediawiki.org/wiki/Manual:Revision_table>, and <https://www.mediawiki.org/wiki/Manual:Text_table> It appears that the most recent changes to the schema for these three tables occurred for mediawiki versions 1.21, 1.21, and 1.19, respectively. >From this I gather that the database schema used for mediawiki 1.23 LTS is the same as that used for mediawiki 1.21; and therefore `mwxml2sql' needs no update on that head. 3) Recommended updates >From a review of your code, I concluded that two minor changes would be useful. 3.1) mwxml2sql.c The following three lines: (shell)$ grep 21 mwxml2sql.c fprintf(stderr,"Supported output MediaWiki versions: 1.5 through 1.21.\n\n"); /* we know MW 1.5 through MW 1.21 even though there is no MW 1.21 yet */ if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 21) { should read fprintf(stderr,"Supported output MediaWiki versions: 1.5 through 1.23.\n\n"); /* we know MW 1.5 through MW 1.23 */ if (mwv->major != 1 || mwv->minor < 5 || mwv->minor > 23) { 3.2) mwxmlelts.c The following line: (shell)$ grep 21 mwxmlelts.c <generator>MediaWiki 1.21wmf6</generator> should read <generator>MediaWiki 1.23wmf10</generator> 4) Request Please let me know if you agree with the above assessment. If you do, I would be happy to submit the changes to <https://gerrit.wikimedia.org/> for review. Sincerely Yours, Kent

10 years, 2 months

abstract dumps broken, dumps paused til fixed

by Ariel T. Glenn

A recent update to the mediawiki multiversion scripts broke the abstract dumps; a bug report and a fix have been submitted so I expect this to get taken care of by Monday at the latest and hopefully over the weekend. In the meantime no new jobs for small wikis will be produced; I'll start those up again once the fix is in, as well as rerunning the abstract dumps where it failed. Currently running jobs will run to completion. Ariel

10 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l February 2014