Xmldatadumps-l June 2015

xmldatadumps-l@lists.wikimedia.org

12 participants
10 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

added phabricator links to database schemata web page

by wp mirror

Dear Ariel, For most of the issues described in < https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improv… > I have added links to phabricator tasks. Sincerely Yours, Kent

8 years, 9 months

Inconsistency of database schema for the `page` table

by wp mirror

Dear Ariel, 0) Context I am working towards the next release for WP-MIRROR (0.8) and that entails updating `mwxml2sql' from 1.24 to 1.26. 1) Database Schema for `page' table There are three fields in the `page' table that require attention: `page_counter', `page_no_title_convert', and `page_lang'. These fields are present (YES) or not (NO) in three different places: MediaWiki 1.26: (shell) less maintenance/table.sql `page_counter' NO `page_no_title_convert' NO `page_lang' YES XML Dump: (shell) zless simplewiki-20150603-page.sql.gz `page_counter' YES `page_no_title_convert' YES `page_lang' NO mwxml2sql: (shell) mwxml2sql -m 1.24 -s simplewiki-20150603-stub-articles.sql.gz ... (shell) less simplewiki-20150603-page.sql-1.24.gz `page_counter' YES `page_no_title_convert' NO `page_lang' NO 2) XML Dumps Are there plans to remove `page_counter' and `page_no_title_convert' from the XML dumps? Are there plans to add `page_lang' to the XML dumps? 3) mwxml2sql Do you have any suggestions for how I should update `mwxml2sql' for MediaWiki 1.26? Sincerely Yours, Kent

8 years, 10 months

update for the mwxml2sql utility posted

by wp mirror

Dear Ariel, An update for the `mwxml2sql' utility has been pushed to gerrit for your review. 1) Features Extend maximum allowed mediawiki version to 1.26. Extend maximum allowed XML dump schema to 0.10. 2) Database schemata The `page' table has been updated: o page_counter is removed; o page_links_updated is repositioned; and o page_lang is added. The `revision' table has been updated: o rev_comment is resized (was tinyblob, now is varbinary(767)). 3) Test (shell) mwxml2sql -m 1.22,1.23,1.24,1.25,1.26 \ -s simplewiki-20150603-stub-articles.xml.gz \ -t simplewiki-20150603-pages-articles.xml.bz2 \ -f simplewiki-20150603.gz (shell) zcat simplewiki-20150603-createtables.sql-1.26.gz | \ mysql --host=localhost --user=root --password (shell) zcat simplewiki-20150603-page.sql-1.26.gz | \ mysql --host=localhost --user=root --password (shell) zcat simplewiki-20150603-revision.sql-1.26.gz | \ mysql --host=localhost --user=root --password (shell) zcat simplewiki-20150603-text.sql-1.26.gz | \ mysql --host=localhost --user=root --password Sincerely Yours, Kent

8 years, 10 months

Proposal or question for a dump of all .svg files in commons.

by D. Hansen

Hi! I have tried to get a list of all .svg-files on commons.wikipedia. Of course I could just parse through commons, but * if there would be any way to provide a dump with the names of the really existing .svg-files, that would be a tremendeous help for me, * and in my estimation it would reduce the download size and cpu-burden and most importantly the HD-burden about 70% to 80% compared to browse and parse through commons. (Wheres the cpu-usage and heat problems because of the HD-burden on my notebook would be much more adverse than the burden on wikipedias server, I assume. I lost already two HDs over the years when downloading larger amounts of files in one go.) Though I have asked at various places so far I haven't found a good solution. One suggestion was to downloadcommonswiki-20150417-all-titles, which I did. But this file does contain deleted names and renamed names, and names the partly have "File:" and some that don't have "File:" or a similiar indicator at the start. Doing just a small sample resulted in 5 correct names, and around 7 deleted and 7 renamed names. I have asked at various places, and especially one person tried to help me, but this even he couldn't solve. Beside this I didn't much feedback. Is it possible to get such a dump? Or to get another dump that I could use to update and crosscheck the all-titles file? Greetings D. Hansen

8 years, 10 months

Database schemata comments posted

by wp mirror

Dear list members, I posted some comments regarding the database schemata of SQL dumps under: < https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improv… > Sincerely Yours, Kent

8 years, 10 months

dumps in stages: conversation

by Ariel T. Glenn

As people know, we're running all stubs dumps across all wikis first and then doubling back to run other dump steps. As we try to hash out the details so the order and contents of the dump stages works out best for folks, please have a look at the discussion on the task below and weigh in there. Thanks! https://phabricator.wikimedia.org/T89273

8 years, 10 months

Download Multiversion Wikipedia Dataset for Research Use

by Xin Jin

Hi, I am a phd student from Computer Science Department, University of California, Santa Barbara. I would like to ask if anyone knows if Wikipedia still keeps the old data of the Wikipedia English Text snapshots? >From this website http://dumps.wikimedia.org/enwiki/, it seems that they keep the English snapshot of Wikipedia for the last 10 months. Does anyone know if they have the snapshot of the ones from 201304 to 201404? I downloaded these datasets previously to write a research paper but for the reason of disk failure, I lost it...(So BAD LUCK!!!) Thus, I am writing to check if I can download the old versions of it? If yes, could anyone please share me the link of the old 13 version datasets? It would be a great help for me if you can help! Thanks a lot! Best regards, Xin -- Xin Jin, PhD Candidate, Computer Science Department, University of California, Santa Barbara

8 years, 10 months

Updates and a question

by Ariel T. Glenn

To catch everyone who would stop reading right after the updates, let me put the question first. Who uses the abstract dumps? Anyone here? Anyone you know? Please forwar this to other lists where there might be users of these dumps. We're trying to figure out if we need to keep generating them or not. Now the updates. We got more space for the dumps server, which means we don't need to reduce the number of dumps kept for some time. You'll also see other items showing up there soon-ish, not part of the xml dumps. We've long had a request to run stubs early on in the dumps process so that stats can be produced right away, and we finally have that going. As of this month all dump runs will be done in stages, stubs first, then tables, then page logs, and then the rest. I'm open to negotiation about the order of jobs after the stubs, if folks have other preferences. We've worked around the eternal php memory leak(s), which lets us now run 7 workers for small wikis at once. This means we'll get through those dumps quicker. Nemo_bis did some testing with an option to 7zip which means much faster compression with a relatively small increase in size. I've adopted that everywhere and we should see the difference, primarily in the big wikis, this month and on. New code brings new bugs. This month's stub and page log runs for smaller wikis may have a duplicate entry at the end, the last item appearing twice. This has been fixed for all future runs. It shouldn't have a real impact on stats but folks importing from these dumps should be aware. Happy June, Ariel

8 years, 10 months

Wikidata dumps

by Ivan

Hello, could somebody checks Wikidata dumps generation. The latest available *-pages-meta-current.xml.bz2 dump is two months old. Latter dumps are failed: http://dumps.wikimedia.org/wikidatawiki/20150526/ http://dumps.wikimedia.org/wikidatawiki/20150423/ Sincerely, Ivan

8 years, 10 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l June 2015