Xmldatadumps-l January 2013

xmldatadumps-l@lists.wikimedia.org

11 participants
11 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Processing french dump

by Benoit Lelong

Hi all, I am currently planning to process the last french dump. I would like to ask if somebody has already found or used a good OpenNLP french sentence detection model. If yes please let me know where to find one. Thanks in advance, Best regards, Benoit.

11 years, 1 month

probable outage warning (move to new datacenter)

by Ariel T. Glenn

As you may have heard, we're going to be switching to the Ashburn datacenter as the primary data center this week. The first window is due to start in a few hours. Databases may be read only, things may die. I am not going to do anything to the dumps during the transition except shoot them if needed. Once things have settled down, by the end of the week, if we have had issues I'll be able to address them in an orderly fashion, but not while craziness is going on. Hold on tight, the ride starts soon :-) Ariel

11 years, 2 months

Question about the meaning of " ''' " mark in dumps

by Chong Wang

Hi, everyone. I am doing a research project about wikipedia searching. I downloaded wiki dumps from this page: http://dumps.wikimedia.org/enwiki/20121001/ But I have a quick question about a symbol in the dumps. I am wondering the meaning of " ''' " in all wiki pages. For example, '''Port St. Lucie''' is a city in St. Lucie county, Florida. I thought the phrase between ''' is the title of the wiki page. But I saw, in the same page, several other phrases, like "city council" and "city Manager" are also quoted by '''. So could you help on this? Thanks in advance. Chong Wang

11 years, 2 months

Which files do I need?

by Robert Crowe

I'd like to mirror just the category structure of the English Wikipedia, and I'm wondering which of the dump files I need to start with. I don't need the page content, just the page names, and only for the most current revision. I need the categories and category members, and I'd like to exclude hidden categories. I also need to distinguish redirects, because I don't want to treat them as separate pages. As much as possible I'd like to work with SQL files, but I can crunch through XML if necessary. So which files do I need to download? I may also need some help in understanding the schemas. Thanks, Robert

11 years, 3 months

Mirrors without 'latest' directory

by wp mirror

Dear Ariel, 0) WP-MIRROR WP-MIRROR 0.6 now works with dumps from your.org. I am turning my attention to the other mirror sites. 1) LATEST I read with interest the thread about `latest' directories that began with <http://lists.wikimedia.org/pipermail/xmldatadumps-l/2012-October/000610.html>. I have some additional questions. The mirror sites at C3SL and Masaryk Univ. do not have a `latest' directory in the project directories that I looked at. Compare for example: (shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/ | tail -n 2 drwxr-xr-x 242 2013/01/04 07:52:13 20130102 drwxr-xr-x 1101 2013/01/03 18:48:34 latest (shell)$ rsync wikipedia.c3sl.ufpr.br::wikipedia/enwiki/ | tail -n 2 drwxr-xr-x 61440 2012/11/10 10:47:05 20121101 drwxr-xr-x 61440 2012/12/10 09:21:34 20121201 WP-MIRROR looks for the `latest' directory on the assumption that any links found there point to complete files (i.e. no partials). Whereas files found in dated directories may be partials. For example, the most recent `imagelinks': This file is complete: (shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/20121201/ | grep imagelinks -rw-r--r-- 356437362 2012/12/01 07:08:54 enwiki-20121201-imagelinks.sql.gz This file is a partial: (shell)$ rsync dumps.wikimedia.your.org::wikimedia-dumps/enwiki/20130102/ | grep imagelinks -rw-r--r-- 20 2013/01/02 07:47:35 enwiki-20130102-imagelinks.sql.gz The `latest' link points to the complete file: (shell)$ rsync -a dumps.wikimedia.your.org::wikimedia-dumps/enwiki/latest/ | grep image lrwxrwxrwx 40 2013/01/02 03:52:49 enwiki-latest-image.sql.gz -> ../20130102/enwiki-20130102-image.sql.gz So I am wondering what algorythm I should use if I want WP-MIRROR to pull dump files from C3SL and Masaryk U. Can you help with the following questions? 2) C3SL In the absence of a `latest' directory, can I be sure that all the files found there are complete files (i.e. not partials)? Is the mirroring process atomic? 3) Masaryk Univ. Several issues: a) No `latest' directories; b) no `enwiki'; and c) most recent dumps date from November: (shell)$ rsync ftp.fi.muni.cz::pub/wikimedia/zuwiki/ | tail -n 2 drwxr-xr-x 4096 2012/10/23 14:04:02 20121023 drwxr-xr-x 4096 2012/11/05 15:02:33 20121105 Will they be catching up? Sincerely Yours, Kent

11 years, 3 months

Encoding issue in the last ZH dump

by Mathieu Poumeyrol

All, I've been struggling to track this for a few hours. This file is a SQL dump, the headers says itf UTF-8. http://dumps.wikimedia.org/zhwiki/20130102/zhwiki-20130102-langlinks.sql.gz but: $ isutf8 zh-langlinks.sql zh-langlinks.sql: line 204, char 2361, byte offset 520707: invalid UTF-8 code $ head -204 zh-langlinks.sql | tail -1 | head -c 520750 | tail -c 50 | hexdump -C 00000000 64 69 61 3a 43 6f f6 72 64 69 6e 61 74 69 65 20 |dia:Co.rdinatie | 00000010 65 78 74 65 72 6e 65 20 70 75 62 6c 69 63 69 74 |externe publicit| 00000020 65 69 74 2f 69 6e 74 65 72 6e 61 74 69 6f 6e 61 |eit/internationa| 00000030 61 6c |al| 00000032 There might be other occurencies, but one is enough to make my import scripts crash, so... you guys are warned. -- K.

11 years, 3 months

interwiki cdb file tool

by Ariel T. Glenn

Well it's time for more alpha code, and I'm a bit behind on my mail from the weekend, so if there is stuff I should be replying to, that will happen tomorrow. As of MW 1.19 we ise interwiki.cdb on the projects, instead of the sql table. This makes life harder for folks setting up their own copies. So here's some docs and a tool, not vetted by anyone at all yet: http://www.mediawiki.org/wiki/Interwiki_cache#Setting_this_up_on_your_own_w… Please follow the link to the 'cheap(er) way' if you are willing to be a guineau pig. Also if you see errors or know something that was left out, feel free to edit. Hey, it's a wiki! Thanks, Ariel

11 years, 3 months

media tarballs for the month *not* ready

by Ariel T. Glenn

Don't start downloading them yet even if you see them showing up; some of them may be corrupt. I'll send an update when they are ready to go, there was a hardware issue at our hoster's end. Thanks. Ariel

11 years, 3 months

[WP-MIRROR] Questions regarding Metalink and SPDY

by wp mirror

Dear Jeremy, Happy New Year, and thanks for your e-mail of 2012-12-31. 0) ADMINISTRATIVE I fixed the capitalization of "Wikimedia" in both documentation and home page. I am now subscribed to this list and have read the last two years of postings. 1) SPDY WP-MIRROR 0.5 and prior versions, obtains image files from <http://upload.wikimedia.org/>. SPDY would reduce latency. WP-MIRROR 0.6 (not yet released) uses HTTP/1.1 persistent connections. WP-MIRROR 0.6 has built-in profiling, and the image downloading process now uses 64% less (wall clock) time. Therefore SPDY may not provide much advantage. Thanks also for informing me of the image tarballs. Conclusion: I will not pursue SPDY, for lack of a requirement. Action Item: WP-MIRROR 0.6 will make use of image tarballs. 2) METALINK WP-MIRROR 0.5 and prior version, had to deal with thousands of corrupt image files. Most of these were partial downloads. cURL would time-out and leave corrupt files. I currently deal with that by validating the images. Validation, however, consumes a lot of time. So I am looking for ways to improve the reliability of downloading. Metalink was brought to my attention by Jason Skomorowski. Relevant documents are RFC 5854, RFC 6249. From the later we have: "This document describes a mechanism by which the benefit of mirrors can be automatically and more effectively realized. All the information about a download, including mirrors, cryptographic hashes, digital signatures, and more can be transferred in coordinated HTTP header fields, hereafter referred to as a "Metalink". This Metalink transfers the knowledge of the download server (and mirror database) to the client. Clients can fall back to other mirrors if the current one has an issue. With this knowledge, the client is enabled to work its way to a successful download even under adverse circumstances. All this can be done without complicated user interaction, and the download can be much more reliable and efficient. In contrast, a traditional HTTP redirect to a mirror conveys only minimal information -- one link to one server -- and there is no provision in the HTTP protocol to handle failures. Furthermore, in order to provide better load distribution across servers and potentially faster downloads to users, Metalink/HTTP facilitates multi-source downloads, where portions of a file are downloaded from multiple mirrors (and, optionally, Peer-to-Peer) simultaneously. Upon connection to a Metalink/HTTP server, a client will receive information about other sources of the same resource and a cryptographic hash of the whole resource. The client will then be able to request chunks of the file from the various sources, scheduling appropriately in order to maximize the download rate." The benefit to WP-MIRROR would be much more reliable downloads, that would obviate the file validation process. The benefit to folks on this e-main list are: a) Your mirror sites would get more traffic (Ariel mentioned that they are getting very little); b) the download process (for metalink capable clients) would be robust against the outage of any one mirror; and c) metalink capable clients are now common (cURL, kget, ...). I understand that the idea for metalink originated in those who posted GNU/Linux distributions in .iso format. With each new .iso release, there would be a surge of downloading, causing many partial downloads (i.e. much wasted bandwidth). Metalink helped spread the load; and, by transporting hashes, improved download integrity. Conclusion: I will table the issue of metalink, for lack of an immediate requirement. Action Item: WP-MIRROR 0.6 will incorporate your list of dump/tarball mirror sites as a configurable parameter. 3) RSYNC Thanks for letting me know that dumps and tarballs are available using rsync. I much prefer rsync over http and ftp. I mirror the Debian archive, and recently switched from apt-mirror which uses wget, to ftpsync which uses rsync; and am very happy with the results. Action Item: WP-MIRROR 0.6 will make use of rsync. Ariel raised some other points with I shall address in a separate email. Sincerely Yours, Kent

11 years, 3 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l January 2013