-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi guys,
Just wanted to be sure this got checked out. From #mediawiki:
<Manny> hi. out of curiosity, are there any known issues with the recent
batch of dumps? I noticed that some supposedly completed dumps seem to
have ended with "Please provide a User-Agent header"
Domas banned all UA-less requests earlier tonight, so it didn't seem
random or nonsensical.
Thanks,
- -Mike
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
iEYEARECAAYFAkt5/jgACgkQst0AR/DaKHtmbwCfZtbjmwNbvOXF/u4/x5j95o7V
cZEAn0LnBuhST3BpaSKSsy6D4jaas6NH
=yvuR
-----END PGP SIGNATURE-----
Message: 7
Date: Wed, 17 Feb 2010 13:47:47 +1100
From: John Vandenberg <jayvdb(a)gmail.com>
Subject: Re: [Wikitech-l] User-Agent:
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Message-ID:
<deea21831002161847o2f64f736w37e5a448a7642a5e(a)mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
On Wed, Feb 17, 2010 at 1:00 PM, Anthony <wikimail(a)inbox.org> wrote:
> On Wed, Feb 17, 2010 at 11:57 AM, Domas Mituzas <midom.lists(a)gmail.com> wrote:
>> Probably everything looks easier from your armchair. I'd love to have that
>> view! :)
>>
>
> Then stop volunteering.
Did you miss the point?
The graphs provided in this thread clearly show that the solution had
a positive & desired effect.
A few negative side-effects have been put forward, such as preventing
browsing without a UA, but Domas has also indicated that other tech
team members can overturn the change if they don't like it.
--
John Vandenberg
Hi,
Don't forget some normal traffic was blocked from this unannounced change, ie. Google's translate service? How much of the traffic reduction was from services like this? Some of the cited reduced traffic proving the strategies success is coming from valid services. Be careful or soon you will be saying: "you are either with wikimedia or with the terrorists"? :)
cheers,
Jamie
Hi,
I was looking at the enwiki dump progress and noticed the file size for the enwiki pages-meta-history.xml.bz2 has decreased
from 255GB on 20100125 down to 105GB on 20100203. Is it possible that
old page revision edit data is being lost due to the smaller archive file
size?
2009-12-03 12:53:43 in-progress All pages with complete page edit history (.bz2)2010-01-25
16:02:21: enwiki 14833408 pages (3.231/sec), 284292000 revs
(61.930/sec), 54.7% prefetched, ETA 2010-02-03 02:34:19 [max 329446505]
These dumps can be *very* large, uncompressing
up to 20 times the archive download size. Suitable for archival and
statistical use, most mirror sites won't want or need this.pages-meta-history.xml.bz2 255.1 GB (written)
2010-02-03 17:28:43 in-progress All pages with complete page edit history (.bz2)2010-02-16
00:32:55: enwiki 747550 pages (0.704/sec), 95964000 revs (90.340/sec),
95.8% prefetched, ETA 2010-03-19 12:10:50 [max 341714004]
These dumps can be *very* large, uncompressing
up to 20 times the archive download size. Suitable for archival and
statistical use, most mirror sites won't want or need this.pages-meta-history.xml.bz2 105.1 GB (written)
cheers,
Jamie
Date: Tue, 16 Feb 2010 09:34:41 -0800
From: Brion Vibber <brion(a)pobox.com>
Subject: Re: [Wikitech-l] [mwdumper] new maintainer?
To: wikitech-l(a)lists.wikimedia.org
Message-ID: <hlekvf$nl0$1(a)ger.gmane.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
On 2/16/10 7:03 AM, Jamie Morken wrote:
> Ok, the simple question: how many people prefer XML or sql dumps?
I think we have a FAQ on this...
http://meta.wikimedia.org/wiki/Download#What_happened_to_the_SQL_dumps.3F
You *do* realize that such "SQL dumps" would have to be invented from
whole cloth and couldn't just be dumped from the actual databases, right?
The raw databases include dozens of alternate clusters and have data
from different revisions compressed together, including deleted items
and private data, and can't simply be released by WMF even if someone
actually wanted to figure out how to replicate Wikimedia's exact storage
cluster layout to do a data import.
Most likely if they were created they'd simply be created by running the
xml through a tool like mwdumper...
-- brion
Hi Brion,
I have not tried mwdumper yet, I have been looking at the various xml to sql conversion tools, and reading about people's use of them, but I will have to give it a try to see for myself, but it seems like an overly complex task to recreate an sql database in my opinion. Also when wikimedia dumps used to be in sql format I think there were less dump problems than there are now, although maybe the main issue is the growth of the file sizes. It is probably simpler to make an sql dump than an XML dump I bet, also the older mediawiki dumps were in sql format. For making the wikimedia dumps into sql directly I think the process would be to do sql database merge's and then make sure the private data is erased? This might be simpler than creating to XML and then using mwdumper to get back to sql. Also there is a bottleneck somewhere in the dump system (dump fails etc) maybe it is the XML part? I will get back to you after I try mwdumper and/or:
php importDump.php <17gigabytefail> :)
cheers,
Jamie
On Tue, 16 Feb 2010 21:31:35, Domas Mituzas <midom.lists(a)gmail.com> wrote:
>
> Random strings are easy to identify, fixed strings are easy to verify.
>
>
Maybe this is naive of me, but that sounds like an interesting problem. It
seems to me that randomized strings that are made of real words are kind of
difficult to identify... So, in the spirit of learning something cool (not
in the spirit of challenging what you're saying), how would one identify
random strings?
Thanks,
Aerik
--
http://eventfeed.org - An Initiative Promoting Syndication of Events
http://www.wikidweb.com - the Wiki Directory of the Web
http://tagthis.info - Hosted Tagging for your website!
Wikimedia Secure Info wrote:
> Hi!
>
>
>> Really? Were you doing this work as a contractor, or as a volunteer?
>>
> Volunteer.
>
>
>> Someone's gotta be in charge of the contractors and/or the volunteers, no?
>>
> Dunno, Cary maybe? :) On the other hand, even if they are in charge, it doesn't mean that they are my bosses :-
Domas:
Have I given you a Barnstar lately? Thank you for your hard labors. I
know it doesn't feed the kittehs, but:
/\
/**\
_______/****\_______
*.******/^^\******.*
*.***( () )***.*
*.**\,./**.*
/**.**.**\
/*.* *.*\
/.* *.\
' `
--
Cary Bass
Volunteer Coordinator, Wikimedia Foundation
Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate
Hi
Almost one month ago I have reported a bug in mwdumper which seems to me to be critical.
I simply can't user mwdumper with the itwiki XML dumps:
https://bugzilla.wikimedia.org/show_bug.cgi?id=22137
I have extract the problematic part of the XML but until now I do not have had any remark
about this bug report and I guess that Brion, the bugzilla maintainer for mwdumper,
does not have time anymore for that.
mwdumper is an almost mandatory tool to spread our content and for this reason i wanted to
speak about that on the ML.
Maybe someone with java skills is interested to help me resolving this mwdumper bug?
Regards
Emmanuel
Le mar 16/02/10 14:13, "Jamie Morken" jmorken(a)shaw.ca a écrit:
> What is the benefit of the database dumps being archived/distributed in xml
> format instead of sql format? Converting the xml to sql takes a long time
> for big wiki's and people seem to have problems with this step, so why
> isn't the sql format available for download instead of the xml format?
* Your are DB neutral, so you do not need to have a version for mysql, for postgres...
* You may apply filter easily
* The XML is still usefull after a DB schema upgrade
Emmanuel
Hi,
What is the benefit of the database dumps being archived/distributed in xml format instead of sql format? Converting the xml to sql takes a long time for big wiki's and people seem to have problems with this step, so why isn't the sql format available for download instead of the xml format?
cheers,
Jamie
>mwDumper is essential also for anyone wiling to replicate a wiki locally for
any purpose. There are alternatives such as xml2SQL or importDump.php but
mwDumper is the most efficient in terms of correctness and completeness or
speed sometimes.
bilal
==
Verily, with hardship comes ease.
Hi!
I blocked Apple's MacOSX RSS syndicator - it probably wastes petabytes of diskspace on unsuspecting user machines, as any open of Wikipedia's RSS feed in Safari will actually be automatically added to the syndication list and resynced constantly.
Isn't that painful with other feeds, but Wikipedia's "recent changes" feeds is huge. Oh, and it also stressed our side :-)
Domas