Xmldatadumps-l February 2013

xmldatadumps-l@lists.wikimedia.org

6 participants
12 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Housekeeping categories?

by Robert Crowe

Is there any way to distinguish between categories like History, or Literature for example, and what I would think of as categories that are used for internal housekeeping like "Unprintworthy_redirects" or "Nonindexed_pages"? They're not hidden categories, but conceptually there is a clear difference between housekeeping categories and categories that define fields of knowledge. But is there anything in the tables that distinguishes them? Thanks, Robert

11 years

Processing french dump

by Benoit Lelong

Hi all, I am currently planning to process the last french dump. I would like to ask if somebody has already found or used a good OpenNLP french sentence detection model. If yes please let me know where to find one. Thanks in advance, Best regards, Benoit.

11 years, 1 month

enwiki sql and tab delim files

by Ariel T. Glenn

I have the files from the February run for en wikipedia converted here: http://dumps.wikimedia.org/other/experimental/ In the sqlfiles directory are the page, revision and text tables in sql format for MediaWiki 1.20, and in the tabfiles directory are all the tables needed or a mirror (I omitted images, oldimages and a couple others) converted to tab delimted format for use with LOAD DATA INFILE for MySQL. The contents may be garbage etc. etc. so be forewarned. Please check them out and let me know how they are. While I'll leave the files there for awhile, they won't be there forever, so don't be surprised if in the future they disappear. Missing is a script to write the tab delimted files to a fifo in reasnably sized chunks. I am told that percona has something like this if it turns out to make a difference in import speed/memory. Don't forget to make sure your client and server character sets are set up correctly, that you've disabled foreign key checks etc. etc. before attemptgin to shovel the data in. Happy trails, Ariel

11 years, 2 months

[Fwd: organizing Wikimedia mirrors]

by Ariel T. Glenn

Forwarding in case there are folks on this list interested. What I really want is something sourceforge-like: it knows which mirror has a copy of the file and has the most bandwidth available for the user. Ariel

11 years, 2 months

new mirror up: wansecurity.com

by Ariel T. Glenn

It's been a long time coming but wansecurity.com has finally come on line; they are hosting the 'other' files (includes goodies like Picture of the Year collections, pageview and other stats), as well as MediaWiki tarballs and the historical XML archives, at present. we're talking with them about expanding the mirror to other content as well. See: http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Curren… for the mirror, http://www.wansec.com/ for the company and http://www.wansec.com/Services/RemoteBackup for the service which they are donating. Ariel

11 years, 2 months

toos for import (setting up a local mirror)

by Ariel T. Glenn

So partly due to recent work by folks like Kent on creating local WP mirrors using the import process, and partly from helping walk someone through the process for the zillionth time, I have come to the realization that This Process Sucks (TM). I am not taking on the whole stack, but I am trying to make a dent in at least part of it. To that end: 1) mwdumper available from download.wikimedia.org is now the current version and should run without a bunch of fancy tricks. Thanks to Chad for fixing up the jenkins build. I tried it on a recent en wikipedia current pages dump and it seemed to work. though I did not test importing the output. 2) I have a couple of tools for *nix users importing into a MySQL database. * Somewhat equivalent to mwdumper is 'mwxml2sql', name chosen before I saw that there was a long abandonded xml2sql tool available in the wild. Input: stubs and page content xml files, output: sql files for each of page, revision, text table, reading 0.4 xsd through 0.7 and writing Mw 1.5 through 1.20 output, as specified by the user. Many specific combinations are untested (e.g. I spent most work on 0.7 xds to MW 1.20). * Converting an sql dump file to a tab delimited format suitable for 'LOAD DATA INFILE' is now possible via 'sql2txt' (also *nix platforms). I tested these on a smallish non-latin-character set wiki dump; a test on en wikipedia is in the works but loading all those other tables, even via LOAD DATA INFILE, takes some time. Link to source: https://gerrit.wikimedia.org/r/gitweb?p=operations/dumps.git;a=tree;f=xmlfi… So what I would love from folks is: Test, find bugs, ask for features, tell me where other pain points are in the import process. If you find bugs/want features and write a patch, and you have a gerrit account, feel free to submit a changeset righ there and add me as a reviewer. If you have a patch and don't have an account, get one :-) Once I know these are actually useful, I will try to make a dent in the pages on Meta and elsewhere that describe, sometimes referring to information several years old, how to import the dumps. Ah yeah and I'll put up static binaries for linux/freebsd then too. Thanks! Ariel

11 years, 2 months

mw errors, dumps off-line til it's fixed

by Ariel T. Glenn

Good morning folks, Peple monitoring the dumps progress pae will have noticed that dumps from last night and this morning are broken due to a fatal error from within MediaWiki. Dumps will be off-line until this is resolved, hopefully late today. Ariel

11 years, 2 months

Re: [Xmldatadumps-l] Weird page titles in page table

by Imene Bensalem

Hi Robert, May be you should use regular expressions that detect a long series of numbers without spaces between them. Regards Imene On Monday, February 11, 2013, wrote: > Send Xmldatadumps-l mailing list submissions to > xmldatadumps-l(a)lists.wikimedia.org <javascript:;> > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > or, via email, send a message with subject or body 'help' to > xmldatadumps-l-request(a)lists.wikimedia.org <javascript:;> > > You can reach the person managing the list at > xmldatadumps-l-owner(a)lists.wikimedia.org <javascript:;> > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Xmldatadumps-l digest..." > > > Today's Topics: > > 1. Weird page titles in page table (Robert Crowe) > 2. Re: Weird page titles in page table (Ariel T. Glenn) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 10 Feb 2013 14:08:56 -0800 > From: "Robert Crowe" <robert(a)ourwebhome.com <javascript:;>> > To: <xmldatadumps-l(a)lists.wikimedia.org <javascript:;>> > Subject: [Xmldatadumps-l] Weird page titles in page table > Message-ID: <010301ce07db$39477660$abd66320$@com> > Content-Type: text/plain; charset="us-ascii" > > I'm seeing rows in the page table that have weird titles, and I'd like to > be > able to identify and filter them out, but I don't see properties that seem > to identify them. For example: > > page.page_id = 21441554 > page.page_title = 4567797074e280934d6f726f63636f5f72656c6174696f6e73 > > What should I look for to identify pages like that? > > Thanks, > > Robert > > > > > > ------------------------------ > > Message: 2 > Date: Mon, 11 Feb 2013 07:51:03 +0200 > From: "Ariel T. Glenn" <ariel(a)wikimedia.org <javascript:;>> > To: Robert Crowe <robert(a)ourwebhome.com <javascript:;>> > Cc: xmldatadumps-l(a)lists.wikimedia.org <javascript:;> > Subject: Re: [Xmldatadumps-l] Weird page titles in page table > Message-ID: <1360561863.18140.5.camel(a)trouble.localdomain> > Content-Type: text/plain; charset="UTF-8" > > Στις 10-02-2013, ημέρα Κυρ, και ώρα 14:08 -0800, ο/η Robert Crowe > έγραψε: > > I'm seeing rows in the page table that have weird titles, and I'd like > to be > > able to identify and filter them out, but I don't see properties that > seem > > to identify them. For example: > > > > page.page_id = 21441554 > > page.page_title = 4567797074e280934d6f726f63636f5f72656c6174696f6e73 > > > > What should I look for to identify pages like that? > > Which dump is this from? > > Ariel > > > > > ------------------------------ > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l(a)lists.wikimedia.org <javascript:;> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > > > End of Xmldatadumps-l Digest, Vol 36, Issue 1 > ********************************************* >

11 years, 2 months

Re: [Xmldatadumps-l] Weird page titles in page table

by Platonides

On 11/02/13 00:58, Robert Crowe wrote: > Weird. Why is it that only some of the titles display as hex? I'm using phpMyAdmin, and the column is varbinary(255). Maybe it's only doing so when it contains non-ascii chars? (in the case of “Egypt–Morocco_relations”, that would be the en-dash)

11 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l February 2013