Xmldatadumps-l

xmldatadumps-l@lists.wikimedia.org

720 discussions

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
by Tomasz Finc 16 Mar '10

16 Mar '10

Tomasz Finc wrote: > New full history en wiki snapshot is hot off the presses! > > It's currently being checksummed which will take a while for 280GB+ of > compressed data but for those brave souls willing to test please grab it > from > > http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-hi… > > and give us feedback about its quality. This run took just over a month > and gained a huge speed up after Tims work on re-compressing ES. If we > see no hiccups with this data snapshot, I'll start mirroring it to other > locations (internet archive, amazon public data sets, etc). > > For those not familiar, the last successful run that we've seen of this > data goes all the way back to 2008-10-03. That's over 1.5 years of > people waiting to get access to these data bits. > > I'm excited to say that we seem to have it :) So now that we've had it for a couple of days .. can I get a status report from someone about its quality? Even if you had no issues please let us know so that we start mirroring. --tomasz

1 0

Re: [Xmldatadumps-l] [Xmldatadumps-admin-l] [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
by Tomasz Finc 11 Mar '10

11 Mar '10

Thankfully due to an awesome volunteer we'll be able to get that 2008 snapshot in our archive. I'll mail out when it shows up in our snail mail. --tomasz Erik Zachte wrote: > I'm thrilled. Big thanks to Tim and Tomasz for pulling this off. > For the record the 2008-10-03 dump existed for a short while only. > It evaporated before wikistats and many others could parse it, > so now we can finally catch up from 3.5 (!) years backlog. > > Erik Zachte > >> -----Original Message----- >> From: wikitech-l-bounces(a)lists.wikimedia.org [mailto:wikitech-l- >> bounces(a)lists.wikimedia.org] On Behalf Of Tomasz Finc >> Sent: Thursday, March 11, 2010 4:11 >> To: Wikimedia developers; xmldatadumps-admin-l(a)lists.wikimedia.org; >> xmldatadumps(a)lists.wikimedia.org >> Subject: [Wikitech-l] 2010-03-11 01:10:08: enwiki Checksumming pages- >> meta-history.xml.bz2 :D >> >> New full history en wiki snapshot is hot off the presses! >> >> It's currently being checksummed which will take a while for 280GB+ of >> compressed data but for those brave souls willing to test please grab >> it >> from >> >> http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages- >> meta-history.xml.bz2 >> >> and give us feedback about its quality. This run took just over a month >> and gained a huge speed up after Tims work on re-compressing ES. If we >> see no hiccups with this data snapshot, I'll start mirroring it to >> other >> locations (internet archive, amazon public data sets, etc). >> >> For those not familiar, the last successful run that we've seen of this >> data goes all the way back to 2008-10-03. That's over 1.5 years of >> people waiting to get access to these data bits. >> >> I'm excited to say that we seem to have it :) >> >> --tomasz >> >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > _______________________________________________ > Xmldatadumps-admin-l mailing list > Xmldatadumps-admin-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-admin-l

6 11

suscribe
by jcms 11 Mar '10

11 Mar '10

-- Este mensaje le ha llegado mediante el servicio de correo electronico que ofrece Infomed para respaldar el cumplimiento de las misiones del Sistem a Nacional de Salud. La persona que envia este correo asume el compromiso de usar el servicio a tales fines y cumplir con las regulaciones establecidas Infomed: http://www.sld.cu/

1 0

2010-03-11 01:10:08: enwiki Checksumming pages-meta-history.xml.bz2 :D
by Tomasz Finc 11 Mar '10

11 Mar '10

New full history en wiki snapshot is hot off the presses! It's currently being checksummed which will take a while for 280GB+ of compressed data but for those brave souls willing to test please grab it from http://download.wikipedia.org/enwiki/20100130/enwiki-20100130-pages-meta-hi… and give us feedback about its quality. This run took just over a month and gained a huge speed up after Tims work on re-compressing ES. If we see no hiccups with this data snapshot, I'll start mirroring it to other locations (internet archive, amazon public data sets, etc). For those not familiar, the last successful run that we've seen of this data goes all the way back to 2008-10-03. That's over 1.5 years of people waiting to get access to these data bits. I'm excited to say that we seem to have it :) --tomasz

1 0

Wikipedia history dumps
by Lev Muchnik 09 Oct '09

09 Oct '09

Hi Guys, I've know, there are plans to optimize the dump engine so that it could handle the complete English history dump. Is it going to happen any time soon? I personally need it for research, and I've seen quite a few people around waiting for it as well. Thanks, Lev

1 0

SUSCRIBE
by cecilca 25 Sep '09

25 Sep '09

1 0

Looking for files forgotten by the wiki
by Platonides 21 Sep '09

21 Sep '09

Tomasz, can you grep old logging for an upload entry of File:Olympic Highway - Moorong.jpg (uploaded 16 jul 2009 <http://commons.wikimedia.org/w/index.php?title=File:Olympic_Highway_-_Mooro…>) or File:Renoir, Pierre-Auguste - The Two Sisters, On the Terrace.jpg (14 jul 2009 <http://commons.wikimedia.org/w/index.php?title=File:Renoir,_Pierre-Auguste_…>, not the one for 15 jul 2009) ? Dumps prior to 20090804 are not publicly available. The objective is to look for evidence about the disappeared upload logs of those files (bug 20744). It'd be something like gzip -d < commonswiki-200907*-pages-logging.xml.gz|grep -A 10 -B 10 Moorong.jpg Presence of Olympic Highway - Moorong.jpg into image.sql would also be interesting.

2 1

Re: [Xmldatadumps-l] Very slow import of XML dumps
by Felipe Ortega 03 Sep '09

03 Sep '09

> For example, I had to manually increase the number of > threads for 7ZIP to speed it up, as you can see. It will Sorry, I meant PIGZ :-). Fire fingers. F. > > _______________________________________________ > > Xmldatadumps-l mailing list > > Xmldatadumps-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > > > > > >

1 0

Very slow import of XML dumps
by Bilal Abdul Kader 03 Sep '09

03 Sep '09

Greetings, I am trying to import the French wiki (full history xml) on a Ubuntu machine with quad-core trendy CPU and 16 GB RAM. The import query is the following: java -Xmn256M -Xms396M -Xmx512M -XX:+DisableExplicitGC -verbose:gc -XX:NewSize=32m -XX:MaxNewSize=64m -XX:SurvivorRatio=6 -XX:+UseParallelGC -XX:GCTimeRatio=9 -XX:AdaptiveSizeDecrementScaleFactor=1 -jar mwdumper.jar --format=sql:1.5 frwiki-20090810-pages-meta-history.xml.bz2 | mysql -u wiki -p frwikiLatest I have disabled the autocommit for mysql, disabled foreign key checks and unique checks. I have set the pool size, buffer log size, and the buffer size to large values as recommended for mysql good performance. After around 3 minutes of running the above command, I have got: 6 pages (0.083/sec), 1,000 revs (13.889/sec) 8 pages (0.038/sec), 2,000 revs (9.378/sec) 13 pages (0.041/sec), 3,000 revs (9.458/sec) The source file is on its own physical disk and the mysql data folder is on another physical disks. Both disks are very fast. *Any suggestions on how to improve the speed? * another issue is that the InnoDB (page, revision, text) do not show the number of the records although the size of the table is non-zero. I think the might be related to disable keys query. *Is that correct? * bilal

3 3

Marking Redirects in Snapshots
by Tomasz Finc 28 Jul '09

28 Jul '09

We've started marking redirects within each one of the archive snapshots. Starting on 7/28/09 each history and article snapshot will contain a <page> .. <redirect /> <revision> .. </page> entry so that everyone can easily identify which articles are in fact simply redirects. This came as a request from Erik Zachte to further improve on our stats collection and has allowed us to surface more user contributor stats that are not filled with articles lacking significant content. --tomasz

1 0

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Xmldatadumps-l