Xmldatadumps-l April 2013

xmldatadumps-l@lists.wikimedia.org

12 participants
10 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 4 months

[Fwd: Re: possible gsoc idea, comments?]

by Ariel T. Glenn

Ok, my 'reply all' is failing me in this mail user agent. Anyways, third time's a charm...

11 years

Encodage with french dump

by Yannick Guigui

Hi for everybody, Im french,please sorry for my english. Please correct me if it's not the right place for my help message I have a problem with my French dump of Wikipedia using XML dump. I'm having a problem with accented characters. When i install Mediawiki, I choose innoBdb, this my MySQL configuration: Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 179 Server version: 5.5.8-log MySQL Community Server (GPL) mysql > status c:/wamp/bin/mysql/mysql5.5.8/bin/mysql.exe Ver 14.14 Distrib 5.5.8, for Win32 ( x86) Connection id: 179 Current database: Current user: root@localhost SSL: Not in use Using delimiter: ; Server version: 5.5.8-log MySQL Community Server (GPL) Protocol version: 10 Connection: localhost via TCP/IP Server characterset: latin1 Db characterset: latin1 Client characterset: cp850 Conn. characterset: cp850 TCP port: 3306 Uptime: 3 hours 47 min 6 sec Threads: 8 Questions: 35648 Slow queries: 3 Opens: 976 Flush tables: 1 Open tables: 50 Queries per second avg: 2.616 I'm using Mwdumper,this my code for the command set class=mwdumper.jar;driver_mysql.jar set data="frwikis_fr.xml" java -client -classpath %class% org.mediawiki.dumper.Dumper "--output=mysql://127.0.0.1/my_wiki?user=root&password=" "--format=sql:1.5" %data% --default-character-set=utf8 pause I don't know the java language but with this,the tranfert to sql database is good, but the accented characters are not good when I try to retrieve articles. What can I do? Tank's a lot -- guigui777

11 years, 1 month

GSoC 2013 - Incremental XML Dumps

by Wyatt Winters

Hello everyone! My name is Wyatt, and I would like to present to you the first draft of my GSoC proposal, available here: https://www.mediawiki.org/wiki/User:Wywin and on the official Melange. On the Melange, should I clean out the Mediawiki syntax, and convert it to look nice in their formatting, or is leaving it wiki-fied ok? I am not particularly familiar with mailing lists and their specific etiquette, so please correct me if I do anything too outrageous. I look forward to your feedback, and hopefully working with you in the future, whether I am accepted for GSoC or not! Wyatt Winters

11 years, 1 month

wikidatawiki -- toooo many edits

by Ariel T. Glenn

Hello dumps users and developers, You may have noticed that the wikidata pages-logging xml dump step has taken days for the last couple of runs. In fact for the most recent run, it did not complete properly, as the database handling the query was upgraded in the middle to mariadb. So the short version is, if you are using that file, go get a new copy: http://dumps.wikimedia.org/wikidatawiki/20130417/wikidatawiki-20130417-page… If I don't have a patch in by next run, I have a workaround I will run by hand that takes 2 hours or less, as opposed to 4 days. The long version is that the pages-logging file is already about half the size of en wp's table, and that the number of edits per minute is much larger, see: https://wikipulse.herokuapp.com/ There's a lot of deletion and a lot of churn too due to the dispatch mechanism. Also, they apparently have RCPatrol enabled and a pile of bots, which means that the log consists of 99% entries 'bot X editing Y marked it as autopatrolled'. These things in combo turn out to be the perfect storm for my simple select query, causing it to start at normal speed and then get ever slower. I suppose in another couple months it would take so long to run it would never finish... Ariel

11 years, 1 month

Encodage problem in french dump

by Yannick Guigui

HI, Sorry for my english I use to speak in french. I have a problem with the encodages in french dump. i use the Api of mediawiki to retrieve text of article.But there still have some %C3%A8 \u00e8 \ufffd to replace accented caracter(é,è,...). How can i resolve this problem? I hope that i posted this in the rigth place. Tank's -- guigui777

11 years, 1 month

Is Chinese Variants dump available

by Jiang BIAN

Hi, Chinese Wikipedia supports a few variants, zh-cn, zh-tw, zh-hk, same wikitext is rendered differently under these variants. e.g. "software" in zh-cn [1] and "software" in zh-tw [2]. But seems no HTML are included in dump file zhwiki. Do you know where can I get the HTML version of articles on Chinese Wikipedia? Thanks [1] http://zh.wikipedia.org/zh-cn/%E8%BD%AF%E4%BB%B6 [2] http://zh.wikipedia.org/zh-tw/%E8%BD%AF%E4%BB%B6 -- Jiang BIAN This email may be confidential or privileged. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it went to the wrong person. Thanks.

11 years, 1 month

Finding images within dumps

by Keith Schacht

Hi, I've downloaded the latest set of wikimedia dumps. I'm trying to understand where to find images within these dumps. I've studied the database schema and it seems to make sense, but then I take a single example such as: http://en.wikipedia.org/wiki/File:Carrizo_2a.JPG And I grep the dumps 'image', 'imagelinks', and 'page' looking for 'Carrizo_2a.JPG' and it's not found. I tried this on both the SQL and XML dumps. Are these dumps not complete? Am I misunderstanding the structure? Thanks in advance, Keith

11 years, 1 month

I need Database tables Mapping to DB Dumps

by Imran Latif

Hi, I'm doing research project on Wikipedia, so i need the Wikipedia data. I decided to use the database dumps of Wikipedia for this purpose but there are too much files there i don't know which file populates the which table. Would you please provide some information that tells the dumps file mapping to exact DB table. Your prompt response is much appreciated. Regards, IMran

11 years, 1 month

making imports suck less, the sequel

by Ariel T. Glenn

Hello folks, it's time for more alpha code around making imports suck less. The point of these tools, which augment the last ones published, is to allow folks to generate sql from a subset of page content, using the sql table dumps we provide and a downloaded (by one of these scripts or some other means) XML file of page content for import. I wanted a way to take importDump.php out of the loop, if the user finds that the script is too slow, too picky, too whatever. So this is one of those ways. The idea here is to get people thinking about how we can make small (or large) chunks of content more available to people. These are really meant to be demos of an idea, with the hope that others (you!) will find better ways to implement it, or even better ideas. Even so, please play, test, report bugs, submit patches, write new tools, etc. See the code below: https://gerrit.wikimedia.org/r/#/c/58568/ Ariel

11 years, 1 month

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l April 2013