Xmldatadumps-l June 2012

xmldatadumps-l@lists.wikimedia.org

12 participants
11 discussions

by Richard Jelinek

Hi, don't know if this issue came up already - in case it did and has been dismissed, I beg your pardon. In case it didn't... I hereby propose, that pbzip2 (https://launchpad.net/pbzip2) is used to compress the xml dumps instead of bzip2. Why? Because its sibling (pbunzip2) has a bug bunzip2 hasn't. :-) Strange? Read on. A few hours ago, I filed a bug report for pbzip2 (see https://bugs.launchpad.net/pbzip2/+bug/922804) together with some test results done even some few hours before that. The results indicate that: bzip2 and pbzip2 are vice-versa compatible each one can create archives, the other one can read. But if it is for uncomressing, only pbzip2 compressed archives are good for pbunzip2. I propose compressing the archives with pbzip2 for the following reasons: 1) If your archiving machines are SMP systems this could lead to a better usage of system ressources (i.e. faster compression). 2) Compression with pbzip2 is harmless for regular users of bunzip2, so everything should run for these people as usual. 3) pbzip2-compressed archives can be uncompressed with pbunzip2 with a speedup that scales nearly linearly with the number of CPUs in the host. So to sum up: It's a no loose and two win situation if you migrate to pbzip2. And that just because pbunzip2 is slightly buggy. Isn't that interesting? :-) cheers, -- Dipl.-Inf. Univ. Richard C. Jelinek PetaMem GmbH - www.petamem.com Geschäftsführer: Richard Jelinek Human Language Technology Experts Sitz der Gesellschaft: Fürth 69216618 Mind Units Registergericht: AG Fürth, HRB-9201

8 years, 3 months

Re: [Xmldatadumps-l] kernel update needed on the snapshot hosts

by Andreas Meier

Hello, at the moment there are only two jobs running for the bigger wikipedias. Best regard, Andreas

11 years, 10 months

wikipedia incubator dumps

by Ács Judit

Hi everyone, I'm working on a research project aimed at crawling web data in near digitally extinct languages and I would like to use the test wikipedias' articles as input. Unfortunately I haven't been able to find the dumps of these wikipedias. Do they exist somewhere, am I missing something? If not, are you planning to create a dump? I would like to avoid having to download all of them manually. Thanks in advance, Judit Acs Computer and Automation Research Institute of Hungarian Academy of Sciences

11 years, 10 months

kernel update needed on the snapshot hosts

by Ariel T. Glenn

I'll be letting processes complete on them one at a time without restarting new ones, so we can do the update and the reboot. Ariel

11 years, 10 months

media tarballs announcement

by Ariel T. Glenn

Folks who are interested in downloading tarballs of media for their particular project can now do so from: http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/ In this directory you will see two subdirectories, "fulls" and "incrs". The way this works is that once a month near the beginning of the month we will produce a series of tarballs for each project of media uploaded locally and media stored on commons. During the month, at least once but hopefully twice, we will produce tarballs containing the files uploaded locally or included from commons since the "full" tarball date. No tarballs are being produced for commons itself, given that it's 14T and there would be no separate locally uploaded/remotely uploaded lists. Instead please use rsync to get those files directly from: rsync://ftpmirror.your.org/wikimedia-images/wikipedia/commons/ Also, please bear in mind that for author and license information you should download the corresponding pages-meta-current.xml.bz2 file from http://dumps.wikimedia.org/ or your local mirror, and check the corresponding File: description pages for your project or commons. And now it's time as usual for the Big Fat Disclaimer: We're still running these manually, it's possible that we won't make the planned schedule for a given run or runs, network connectivity might be unstable, etc. etc. Its also possible we will restructure the fulls so that they take a lot less space and rely on three series of tarballs for some projects. Thanks once again to your.org for donating the time, space and bandwidth to make this possible. Ariel

11 years, 10 months

regarding latest wikipedia dump

by rajesh pandey

Hi, I want to download the recent changes made in the wikipedia dump. Can you please tell me the way so that I can download only the changes made. Thanks!!! Regards Rajesh Pandey

11 years, 10 months

anonymous user account logs (account created / account blocked)

by Gregor Martynus

Hi, for a dissertation study, I try to find a reliable datasource from where I can extract user account events, specifically creation and blocking of user accounts, with usernames, the event name and timestamps, Is such data available? If yes, could anybody point be to where I can get it from? Thanks a lot -- Gregor

11 years, 10 months

External links statistics

by Lars Aronsson

On Sunday, I posted the following to the Analytics mailing list, but didn't see any response there, so I'm reposting here. At the Berlin hackathon, I improved the script I wrote in December for compiling statistics on external links. My goal is to learn how many links Wikipedia has to a particular website, and to monitor this over time. I figure this might be intresting for GLAM cooperations. This is found in the external links table, but since I want to filter out links from talk and project pages, I need to join it with the page table, where I can find the namespace. I've tried the join on the German Toolserver, and it works fine for the minor wikis, but it tends to time out (beyond 30 minutes) for the ten largest Wikipedias. This is not because I fail to use indexes, but because I want to run a substring operation on millions of rows. Even an optimized query takes some time. As a faster alternative, I have downloaded the database dumps, and processed them with regular expressions. Since the page ID is a small integer, counting from 1 up to a few millions, and all I want to know for each page ID is whether or not it belongs to a content namespace, I can do with a bit vector of a few hundred kilobytes. When this is loaded, and I read the dump of the external links table, I can see if the page ID is of interest, truncate the external link down to the domain name, and use a hash structure to count the number of links to each domain. It runs fast and has a small RAM footprint. In December 2011 I downloaded all the database dumps I could find, and uploaded the resulting statistics to the Internet Archive, see e.g. http://archive.org/details/Wikipedia_external_links_statistics_201101 One problem though is that I don't get links to Wikisource, Wikiquote this way, because they are not in the external links table. Instead they are interwiki links, found in the iwlinks table. The improvement I made in Berlin is that I now also read the interwiki prefix table and the iwlinks table. It works fine. One issue here, is the definition of content namespaces. Back in December, I decided to count links found in namespaces 0 (main), 6 (File:), Portal, Author and Index. Since then, the concept of "content namespaces" has been introduced, as part of refining the way MediaWiki counts articles in some projects (Wiktionary, Wikisource), where the normal definition (all wiki pages in the main namespace that contain at least one link) doesn't make sense. When Wikisource, using the ProofreadPage extension, adds a lot of scanned books in the Page: namespace, this should count as content, despite these pages not being in the main namespace, and whether or not the pages contain any link (which they most often do not). One problem is that I can't see which namespaces are "content" namespaces in any of the database dumps. I can only see this from the API, http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespa… The API only provides the current value, which can change over time. I can't get the value that was in effect when the database dump was generated. Another problem is that I want to count links that I find in the File: (ns=6) and Portal: (mostly ns=100) namespaces, but these aren't marked as content namespaces by the API. Shouldn't they be? Is anybody else doing similar things? Do you have opinions on what should count as content? Should I submit my script (300 lines of Perl) somewhere? -- Lars Aronsson (lars(a)aronsson.se) Aronsson Datateknik - http://aronsson.se Project Runeberg - free Nordic literature - http://runeberg.org/

11 years, 10 months

Problems with frwiki dumps

by Felipe Ortega

Hello. I'm finding strange issues when trying to decompress the 7z version of this dump for the French Wikipedia: http://dumps.wikimedia.org/frwiki/20120430/ At some point around 3M revisions the 7z process stalls. After a long time (few hours) it recovers normal execution, but then stalls again around 55M revisions to never recover normal cruise again. Maybe there are some issues with frwiki dumps, since I can see that subsequent processes are experimenting failures (in May and June). I'm now checking with the previous dump (http://dumps.wikimedia.org/frwiki/20120404/). I'll let you know in case I find any more problems. Best, Felipe.

11 years, 10 months

Dump from 2010?

by Pablo Mendes

Hi all, I was just time travelling at http://dumps.wikimedia.org/enwiki/ and the oldest dump I could find was: 20110526. I am building an evaluation dataset that needs the state of Wikipedia in 2010. Is there a way I could get my hands on a 2010 version of pages-articles.xml.bz2 ? Thank you, Pablo PS: for anybody interested in navigating wikipedia's history nicely: https://bugzilla.wikimedia.org/show_bug.cgi?id=34778

11 years, 10 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l June 2012