Wikitech-l February 2010

wikitech-l@lists.wikimedia.org

92 participants
73 discussions

More dump problems?
by Mike.lifeguard 17 Feb '10

17 Feb '10

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi guys, Just wanted to be sure this got checked out. From #mediawiki: <Manny> hi. out of curiosity, are there any known issues with the recent batch of dumps? I noticed that some supposedly completed dumps seem to have ended with "Please provide a User-Agent header" Domas banned all UA-less requests earlier tonight, so it didn't seem random or nonsensical. Thanks, - -Mike -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkt5/jgACgkQst0AR/DaKHtmbwCfZtbjmwNbvOXF/u4/x5j95o7V cZEAn0LnBuhST3BpaSKSsy6D4jaas6NH =yvuR -----END PGP SIGNATURE-----

4 4

Re: [Wikitech-l] User-Agent:
by Jamie Morken 17 Feb '10

17 Feb '10

Message: 7 Date: Wed, 17 Feb 2010 13:47:47 +1100 From: John Vandenberg <jayvdb(a)gmail.com> Subject: Re: [Wikitech-l] User-Agent: To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org> Message-ID: <deea21831002161847o2f64f736w37e5a448a7642a5e(a)mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 On Wed, Feb 17, 2010 at 1:00 PM, Anthony <wikimail(a)inbox.org> wrote: > On Wed, Feb 17, 2010 at 11:57 AM, Domas Mituzas <midom.lists(a)gmail.com> wrote: >> Probably everything looks easier from your armchair. I'd love to have that >> view! :) >> > > Then stop volunteering. Did you miss the point? The graphs provided in this thread clearly show that the solution had a positive & desired effect. A few negative side-effects have been put forward, such as preventing browsing without a UA, but Domas has also indicated that other tech team members can overturn the change if they don't like it. -- John Vandenberg Hi, Don't forget some normal traffic was blocked from this unannounced change, ie. Google's translate service? How much of the traffic reduction was from services like this? Some of the cited reduced traffic proving the strategies success is coming from valid services. Be careful or soon you will be saying: "you are either with wikimedia or with the terrorists"? :) cheers, Jamie

6 6

enwiki complete page edit history
by Jamie Morken 17 Feb '10

17 Feb '10

Hi, I was looking at the enwiki dump progress and noticed the file size for the enwiki pages-meta-history.xml.bz2 has decreased from 255GB on 20100125 down to 105GB on 20100203. Is it possible that old page revision edit data is being lost due to the smaller archive file size? 2009-12-03 12:53:43 in-progress All pages with complete page edit history (.bz2)2010-01-25 16:02:21: enwiki 14833408 pages (3.231/sec), 284292000 revs (61.930/sec), 54.7% prefetched, ETA 2010-02-03 02:34:19 [max 329446505] These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.pages-meta-history.xml.bz2 255.1 GB (written) 2010-02-03 17:28:43 in-progress All pages with complete page edit history (.bz2)2010-02-16 00:32:55: enwiki 747550 pages (0.704/sec), 95964000 revs (90.340/sec), 95.8% prefetched, ETA 2010-03-19 12:10:50 [max 341714004] These dumps can be *very* large, uncompressing up to 20 times the archive download size. Suitable for archival and statistical use, most mirror sites won't want or need this.pages-meta-history.xml.bz2 105.1 GB (written) cheers, Jamie

2 1

Re: [Wikitech-l] [mwdumper] new maintainer?
by Jamie Morken 17 Feb '10

17 Feb '10

Date: Tue, 16 Feb 2010 09:34:41 -0800 From: Brion Vibber <brion(a)pobox.com> Subject: Re: [Wikitech-l] [mwdumper] new maintainer? To: wikitech-l(a)lists.wikimedia.org Message-ID: <hlekvf$nl0$1(a)ger.gmane.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 2/16/10 7:03 AM, Jamie Morken wrote: > Ok, the simple question: how many people prefer XML or sql dumps? I think we have a FAQ on this... http://meta.wikimedia.org/wiki/Download#What_happened_to_the_SQL_dumps.3F You *do* realize that such "SQL dumps" would have to be invented from whole cloth and couldn't just be dumped from the actual databases, right? The raw databases include dozens of alternate clusters and have data from different revisions compressed together, including deleted items and private data, and can't simply be released by WMF even if someone actually wanted to figure out how to replicate Wikimedia's exact storage cluster layout to do a data import. Most likely if they were created they'd simply be created by running the xml through a tool like mwdumper... -- brion Hi Brion, I have not tried mwdumper yet, I have been looking at the various xml to sql conversion tools, and reading about people's use of them, but I will have to give it a try to see for myself, but it seems like an overly complex task to recreate an sql database in my opinion. Also when wikimedia dumps used to be in sql format I think there were less dump problems than there are now, although maybe the main issue is the growth of the file sizes. It is probably simpler to make an sql dump than an XML dump I bet, also the older mediawiki dumps were in sql format. For making the wikimedia dumps into sql directly I think the process would be to do sql database merge's and then make sure the private data is erased? This might be simpler than creating to XML and then using mwdumper to get back to sql. Also there is a bottleneck somewhere in the dump system (dump fails etc) maybe it is the XML part? I will get back to you after I try mwdumper and/or: php importDump.php <17gigabytefail> :) cheers, Jamie

1 0

Re: [Wikitech-l] Wikitech-l Digest, Vol 79, Issue 27
by Aerik Sylvan 16 Feb '10

16 Feb '10

On Tue, 16 Feb 2010 21:31:35, Domas Mituzas <midom.lists(a)gmail.com> wrote: > > Random strings are easy to identify, fixed strings are easy to verify. > > Maybe this is naive of me, but that sounds like an interesting problem. It seems to me that randomized strings that are made of real words are kind of difficult to identify... So, in the spirit of learning something cool (not in the spirit of challenging what you're saying), how would one identify random strings? Thanks, Aerik -- http://eventfeed.org - An Initiative Promoting Syndication of Events http://www.wikidweb.com - the Wiki Directory of the Web http://tagthis.info - Hosted Tagging for your website!

1 0

Re: [Wikitech-l] User-Agent:
by Cary Bass 16 Feb '10

16 Feb '10

Wikimedia Secure Info wrote: > Hi! > > >> Really? Were you doing this work as a contractor, or as a volunteer? >> > Volunteer. > > >> Someone's gotta be in charge of the contractors and/or the volunteers, no? >> > Dunno, Cary maybe? :) On the other hand, even if they are in charge, it doesn't mean that they are my bosses :- Domas: Have I given you a Barnstar lately? Thank you for your hard labors. I know it doesn't feed the kittehs, but: /\ /**\ _______/****\_______ *.******/^^\******.* *.***( () )***.* *.**\,./**.* /**.**.**\ /*.* *.*\ /.* *.\ ' ` -- Cary Bass Volunteer Coordinator, Wikimedia Foundation Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

2 1

[mwdumper] new maintainer?
by emmanuel＠engelhart.org 16 Feb '10

16 Feb '10

Hi Almost one month ago I have reported a bug in mwdumper which seems to me to be critical. I simply can't user mwdumper with the itwiki XML dumps: https://bugzilla.wikimedia.org/show_bug.cgi?id=22137 I have extract the problematic part of the XML but until now I do not have had any remark about this bug report and I guess that Brion, the bugzilla maintainer for mwdumper, does not have time anymore for that. mwdumper is an almost mandatory tool to spread our content and for this reason i wanted to speak about that on the ML. Maybe someone with java skills is interested to help me resolving this mwdumper bug? Regards Emmanuel

3 2

Re: [Wikitech-l] [mwdumper] new maintainer?
by emmanuel＠engelhart.org 16 Feb '10

16 Feb '10

Le mar 16/02/10 14:13, "Jamie Morken" jmorken(a)shaw.ca a écrit: > What is the benefit of the database dumps being archived/distributed in xml > format instead of sql format? Converting the xml to sql takes a long time > for big wiki's and people seem to have problems with this step, so why > isn't the sql format available for download instead of the xml format? * Your are DB neutral, so you do not need to have a version for mysql, for postgres... * You may apply filter easily * The XML is still usefull after a DB schema upgrade Emmanuel

4 3

Re: [Wikitech-l] [mwdumper] new maintainer?
by Jamie Morken 16 Feb '10

16 Feb '10

Hi, What is the benefit of the database dumps being archived/distributed in xml format instead of sql format? Converting the xml to sql takes a long time for big wiki's and people seem to have problems with this step, so why isn't the sql format available for download instead of the xml format? cheers, Jamie >mwDumper is essential also for anyone wiling to replicate a wiki locally for any purpose. There are alternatives such as xml2SQL or importDump.php but mwDumper is the most efficient in terms of correctness and completeness or speed sometimes. bilal == Verily, with hardship comes ease.

2 1

Apple RSS
by Domas Mituzas 16 Feb '10

16 Feb '10

Hi! I blocked Apple's MacOSX RSS syndicator - it probably wastes petabytes of diskspace on unsuspecting user machines, as any open of Wikipedia's RSS feed in Safari will actually be automatically added to the syndication list and resynced constantly. Isn't that painful with other feeds, but Wikipedia's "recent changes" feeds is huge. Oh, and it also stressed our side :-) Domas

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Wikitech-l February 2010