>On Mon, Sep 19, 2011 at 12:53 PM, Asher Feldman <afeldman [at]
wikimedia>wrote:
>
>> Since the primary use case here seems to be offline analysis and it may
not
>> be of much interest to mediawiki users outside of wmf, can we store the
>> checksums in new tables (i.e. revision_sha1) instead of running large
>> alters, and implement the code to generate checksums on new edits via an
>> extension?
>>
>> Checksums for most old revs can be generated offline and populated before
>> the extension goes live. Since nothing will be using the new table yet,
>> there'd be no issues with things like gap lock contention on the revision
>> table from mass populating it.
>>
>
> That's probably the simplest solution; adding a new empty table will be
very
> quick. It may make it slower to use the field though, depending on what
all
> uses/exposes it.
>
> During stub dump generation for instance this would need to add a left
outer
> join on the other table, and add things to the dump output (and also needs
> an update to the XML schema for the dump format). This would then need to
be
> preserved through subsequent dump passes as well.
>
> -- brion
Can we resist the temptation to implement schema changes as new tables
purely to make life easier for Wikimedia? Core schema changes are certainly
enough of a hurdle to warrant serious discussion, but they are not the
totally-intractable mess that they used to be. 1.19 already includes index
changes to the user and logging tables; it will already require the full
game of musical chairs with the db slaves. Implementing this as a new
column does not actually make things any more complicated, it would just
mean that an operation that would take three hours before might now take
five.
It may or may not be an architecturally-better design to have it as a
separate table, although considering how rapidly MW's 'architecture' changes
I'd say keeping things as simple as possible is probably a virtue. But that
is the basis on which we should be deciding it. This is a big project which
still retains its enthusiasm because we recognise that it has equally big
potential to provide interesting new features far beyond the immediate
usecases we can construct now (dump validation and 'something to do with
reversions'). Let's not hamstring it at birth based on the operational
pressures of the one MediaWiki end user who is best placed to overcome said
issues.
--HM
We've just released our puppet repository into a public git
repository. For more information, see the blog post about this:
http://blog.wikimedia.org/2011/09/19/ever-wondered-how-the-wikimedia-server…
As noted in the blog post, we are releasing this to treat operations
like a software development project. Users with Labs access can push
changes to the repository. We aren't currently ready to start giving
out Labs access en masse, but will hopefully have a process ready by
the New Orleans hack-a-thon.
More info to come about Labs later.
- Ryan
For awhile now we've had this pattern for deprecating interfaces:
* Mark it deprecated with @deprecated when you deprecate it
* In one or two releases add wfDeprecated
* A release or two after that or more remove the interface entirely
The rationale of not adding wfDeprecated seams to be so we don't spew
warnings about recently deprecated methods to developers.
((Although after a discussion with Krinkle in #mediawiki we had a hard
time justifying even that))
The fault of that pattern however is that as releases take time to come
out by the time that the next release, or even the release after that,
comes by developers have forgotten about the deprecation and forget to
add the wfDeprecated for code they changed months ago and don't care
about anymore. And the interfaces continue to be used without any
notices to help notify people they're still using a deprecated interface
which may only be half-functional.
So, I've come up with an idea for a new pattern and committed a change
to wfDeprecated.
wfDeprecated now accepts a second arg, the version in which the method
was deprecated. You can start use it like so:
`wfDeprecated( __METHOD__, '1.19' );`
When you use the version arg the notice will include information on when
the method was deprecated.
Additionally there is a new config setting $wgDeprecationReleaseLimit.
If set to a release string alongside $wgDevelopmentWarnings then
deprecation notices for releases later than the limit will not be outputted.
eg: If you set $wgDeprecationReleaseLimit to "1.18" then `wfDeprecated(
__METHOD__, '1.17' );` and `wfDeprecated( __METHOD__, '1.18' );` will
generate notices, but `wfDeprecated( __METHOD__, '1.19' );` calls will
stay silent.
Additionally I've taken branches into account, if you're working in a
branch (not a release branch) then please use the pattern:
`wfDeprecated( __METHOD__, '1.19-branchname' );`
Where 1.19 is likely whatever version trunk is currently at which you
are merging from, and branchname is the name of your branch.
This pattern will make it easy to search and replace that string when
you merge that branch into trunk and will prevent traps like starting a
branch in 1.19 and adding wfDeprecated calls with 1.19 as the version,
but then actually merging it into trunk when trunk is 1.20 and having to
go through each 1.19 tagged wfDeprecated and differentiate between the
wfDeprecated calls you added and the wfDeprecated calls that were
actually added in 1.19.
Now instead of waiting for a later release I encourage developers to add
the wfDeprecated calls right away when you deprecate something. If
recent deprecations are too much noise for anyone they can just change
the limit to what notices they get.
https://secure.wikimedia.org/wikipedia/mediawiki/wiki/Special:Code/MediaWik…
--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]
Hi,
When browsing through [mediawiki]/trunk [1] I see a lot of things that have
piled up over the years. Now that we have /trunk/tools [2] and the separate
[wikimedia] repository, perhaps some things should be re-organized.
A few examples:
* /trunk/backup: This is the script for generating data dumps for
Wikimedia's public wikis.
Perhaps move this to /tools or perhaps to trunk of the [wikimedia] repo.
* /trunk/mwdumper: Tool for extracting sets of pages from an MW dump file
If this works for regular mediawiki dumps, probably a good one to be moved
to /tools
* /trunk/lucene-search-2: Lucene-search 2.1: search extension for MediaWiki
Sounds like something that should be in a /libs/ directory of
/trunk/extensions/MWSearch or something
* /trunk/wikiSDK: Attempt at creating a developer friendly SDK for MW
Again, perfect for /tools probably
* /trunk/wap: Wap Wikipedia. quick and dirty Hack for wap.es.wikipedia.org
Doesn't' appear to be used anymore. But since it's used as a fronted to
MediaWiki, perhaps go into /tools, or if wikimedia specific, into
[wikimedia]
*/trunk/wmfmailadmin: Simple mail account maintenance script.
Fairly obvious. [wikimedia]
etc. take a look at the complete list here:
http://svn.wikimedia.org/viewvc/mediawiki/trunk/
Just mentioning here on the mailing list in case there are some
special dirs that needs special treatment before I make a bold move.
--
Krinkle
[1] http://svn.wikimedia.org/viewvc/mediawiki/trunk/
[2] http://svn.wikimedia.org/viewvc/mediawiki/trunk/tools/
(descriptions copied from the README files and/or the initial commit
messages)
Hi,
while I don't like the idea of introducing more and more testing tools, I can still see an interesting use case here: as of now, we have no way to test whether a given layout (HTML, JS, CSS) is really rendered the way we want it to be, since both Selenium and QUnit make their tests based on DOM, right? Sikuli on the other hand seems to be based on screenshots and here we could detect broken layout. There is also some kind of similarity algorithm (which I hope is configurable) so that one test could be used in different browsers even if the rendering is not identical to the pixel.
The question is, do we have the need for testing screen layout?
Cheers,
Markus
P.S.: CCing wikitech, since this might be of broader interest.
-----Ursprüngliche Nachricht-----
Von: Sumana Harihareswara [mailto:sumanah@wikimedia.org]
Gesendet: Dienstag, 30. August 2011 14:02
An: Markus Glaser; Chad Horohoe; Timo Tijhof
Betreff: automated testing with Sikuli?
http://sikuli.org/
Have any of you run across Sikuli before? Just wanted to point it out to you. It might face the same problems as Selenium, though.
--
Sumana Harihareswara
Volunteer Development Coordinator
Wikimedia Foundation
Hi,
It might be good to keep a private hash in parallel with the MD5 public hash.
cheers,
Jamie
----- Original Message -----
From: wikitech-l-request(a)lists.wikimedia.org
Date: Sunday, September 18, 2011 3:12 pm
Subject: Wikitech-l Digest, Vol 98, Issue 30
To: wikitech-l(a)lists.wikimedia.org
> Send Wikitech-l mailing list submissions to
> wikitech-l(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> or, via email, send a message with subject or body 'help' to
> wikitech-l-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> wikitech-l-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wikitech-l digest..."
>
>
> Today's Topics:
>
> 1. Re: Adding MD5 / SHA1 column to revision table
> (discussing r94289) (Anthony)
> 2. Fwd: Adding MD5 / SHA1 column to revision table
> (discussing r94289) (Anthony)
> 3. Re: Adding MD5 / SHA1 column to revision table
> (discussing r94289) (Chad)
> 4. Re: Adding MD5 / SHA1 column to revision table
> (discussing r94289) (Anthony)
> 5. Re: Adding MD5 / SHA1 column to revision table
> (discussing r94289) (Chad)
> 6. Re: Fwd: Adding MD5 / SHA1 column to revision table
> (discussing r94289) (Roan Kattouw)
> 7. Re: Adding MD5 / SHA1 column to revision
> table (discussing r94289) (Platonides)
> 8. Re: Adding MD5 / SHA1 column to revision table
> (discussing r94289) (Anthony)
> 9. Re: Adding MD5 / SHA1 column to revision table
> (discussing r94289) (Anthony)
> 10. Re: Fwd: Adding MD5 / SHA1 column to revision table
> (discussing r94289) (Anthony)
>
>
> -----------------------------------------------------------------
> -----
>
> Message: 1
> Date: Sun, 18 Sep 2011 16:57:22 -0400
> From: Anthony <wikimail(a)inbox.org>
> Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table
> (discussing r94289)
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
> Message-ID:
> <CAPreJLR8Rhut8gdqizxmDuo5-CAd3Yi_S-
> G478071LD=fg9XTQ(a)mail.gmail.com>Content-Type: text/plain;
> charset=ISO-8859-7
>
> On Sun, Sep 18, 2011 at 2:33 AM, Ariel T. Glenn
> <ariel(a)wikimedia.org> wrote:
> > ???? 17-09-2011, ????? ???, ??? ??? 22:55 -0700, ?/? Robert Rohde
> > ??????:
> >> On Sat, Sep 17, 2011 at 4:56 PM, Anthony
> <wikimail(a)inbox.org> wrote:
> >
> > <snip>
> >
> >> > For offline analyses, there's no need to change the online
> database tables.
> >>
> >> Need? ?That's debatable, but one of the major motivators is
> the desire
> >> to have hash values in database dumps (both for revert checks
> and for
> >> checksums on correct data import / export). ?Both of those are
> >> "offline" uses, but it is beneficial to have that information
> >> precomputed and stored rather than frequently regenerated.
> >
> > If we don't have it in the online database tables, this
> defeats the
> > purpose of having the value in there at all, for the purpose of
> > generating the XML dumps.
> >
> > Recall that the dumps are generated in two passes; in the
> first pass we
> > retrieve from the db and record all of the metadata about
> revisions, and
> > in the second (time-comsuming) pass we re-use the text of the
> revisions> from a previous dump file if the text is in there.
> ?We want to compare
> > the has of that text against what the online database says the
> hash is;
> > if they don't match, we want to fetch the live copy.
>
> Well, this is exactly the type of use in which collisions do matter.
> Do you really want the dump to not record the correct data when some
> miscreant creates an intentional collision?
>
>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 18 Sep 2011 17:00:32 -0400
> From: Anthony <wikimail(a)inbox.org>
> Subject: [Wikitech-l] Fwd: Adding MD5 / SHA1 column to revision table
> (discussing r94289)
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
> Message-ID:
> <CAPreJLQr4tTyBkrhwc5Lnf6Xw93eYDv02jAtCNtzRp910_CZ-
> Q(a)mail.gmail.com>Content-Type: text/plain; charset=ISO-8859-1
>
> On Sun, Sep 18, 2011 at 1:55 AM, Robert Rohde
> <rarohde(a)gmail.com> wrote:
> > If collision attacks really matter we should use SHA-1.
>
> If collision attacks really matter you should use, at least, SHA-
> 256, no?
>
> > However, do
> > any of the proposed use cases care about whether someone might
> > intentionally inject a collision? ?In the proposed uses I've
> looked at
> > it, it seems irrelevant. ?The intentional collision will get flagged
> > as a revert and the text leading to that collision would be
> discarded.> ?How is that a bad thing?
>
> Well, what if the checksum of the initial page hasn't been calculated
> yet? ?Then some miscreant sets the page to spam which collides, and
> then the spam gets reverted. ?The good page would be the one
> that gets
> thrown out.
>
> Maybe that's not feasible. ?Maybe it is. ?Either way, I'd feel very
> uncomfortable about the fact that someday someone might decide
> to use
> the checksums in some way in which collisions would matter.
>
> Now I don't know how important the CPU differences in
> calculating the
> two versions would be. ?If they're significant enough, then
> fine, use
> MD5, but make sure there are warnings all over the place about its
> use.
>
> (As another possibility, what if someone writes a bot to detect
> certain reverts? ?I can see spammers/vandals having a field day with
> this sort of thing.)
>
> >> For offline analyses, there's no need to change the online
> database tables.
> >
> > Need? ?That's debatable, but one of the major motivators is
> the desire
> > to have hash values in database dumps (both for revert checks
> and for
> > checksums on correct data import / export). ?Both of those are
> > "offline" uses, but it is beneficial to have that information
> > precomputed and stored rather than frequently regenerated.
>
> Why not in a separate file? ?There's no need to get permission from
> anyone or mess with the schema to generate a file with revision ids
> and checksums. ?If WMF won't host it at the regular dump location
> (which I can't see why they wouldn't), you could host it at
> archive.org.
>
>
>
> ------------------------------
>
> Message: 3
> Date: Sun, 18 Sep 2011 17:30:52 -0400
> From: Chad <innocentkiller(a)gmail.com>
> Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table
> (discussing r94289)
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
> Message-ID:
> <CADn73rM9R26GnyXGAFEC6_8Jb3AbT6ML0sVYyR-
> E4cXaZ2WR3g(a)mail.gmail.com>Content-Type: text/plain; charset=UTF-8
>
> On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
> <rnnelson(a)clarkson.edu> wrote:
> > It is meaningless to talk about cryptography without a threat
> model, just as Robert says. Is anybody actually attacking us? Or
> are we worried about accidental collisions?
> >
>
> I believe it began as accidental collisions, then everyone promptly
> put on their tinfoil hats and started talking about a hypothetical
> vandal who has the time and desire to generate hash collisions.
>
> -Chad
>
>
>
> ------------------------------
>
> Message: 4
> Date: Sun, 18 Sep 2011 17:47:51 -0400
> From: Anthony <wikimail(a)inbox.org>
> Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table
> (discussing r94289)
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
> Message-ID:
> <CAPreJLSmMi4qqZLZmY3LzOmEO-8JgqjpJgmLfh-
> 8+gPx4v5Rmg(a)mail.gmail.com>Content-Type: text/plain; charset=ISO-
> 8859-1
>
> On Sun, Sep 18, 2011 at 5:30 PM, Chad
> <innocentkiller(a)gmail.com> wrote:
> > On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
> > <rnnelson(a)clarkson.edu> wrote:
> >> It is meaningless to talk about cryptography without a threat
> model, just as Robert says. Is anybody actually attacking us? Or
> are we worried about accidental collisions?
> >>
> >
> > I believe it began as accidental collisions, then everyone promptly
> > put on their tinfoil hats and started talking about a hypothetical
> > vandal who has the time and desire to generate hash collisions.
>
> Having run a wiki which I eventually abandoned due to various "Grawp
> attacks", I can assure you that there's nothing hypothetical
> about it.
>
>
>
> ------------------------------
>
> Message: 5
> Date: Sun, 18 Sep 2011 17:50:12 -0400
> From: Chad <innocentkiller(a)gmail.com>
> Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table
> (discussing r94289)
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
> Message-ID:
> <CADn73rMGkgSPG4nbvB34EKfKp99d5LWcNuALQLrpUta45YaHiA(a)mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> On Sun, Sep 18, 2011 at 5:47 PM, Anthony <wikimail(a)inbox.org>
> wrote:> On Sun, Sep 18, 2011 at 5:30 PM, Chad
> <innocentkiller(a)gmail.com> wrote:
> >> On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
> >> <rnnelson(a)clarkson.edu> wrote:
> >>> It is meaningless to talk about cryptography without a
> threat model, just as Robert says. Is anybody actually attacking
> us? Or are we worried about accidental collisions?
> >>>
> >>
> >> I believe it began as accidental collisions, then everyone promptly
> >> put on their tinfoil hats and started talking about a hypothetical
> >> vandal who has the time and desire to generate hash collisions.
> >
> > Having run a wiki which I eventually abandoned due to various "Grawp
> > attacks", I can assure you that there's nothing hypothetical
> about it.
> >
>
> For those of us who do not know...what the heck is a Grawp attack?
> Does it involve generating hash collisions?
>
> -Chad
>
>
>
> ------------------------------
>
> Message: 6
> Date: Mon, 19 Sep 2011 00:00:11 +0200
> From: Roan Kattouw <roan.kattouw(a)gmail.com>
> Subject: Re: [Wikitech-l] Fwd: Adding MD5 / SHA1 column to revision
> table (discussing r94289)
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
> Message-ID:
> <CALoQHwEOyjQhzRKJM_efPCz7OrG=GbubZ7wMGyqmzrwBGZE6zg(a)mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Sun, Sep 18, 2011 at 11:00 PM, Anthony
> <wikimail(a)inbox.org> wrote:
> > Now I don't know how important the CPU differences in
> calculating the
> > two versions would be. ?If they're significant enough, then
> fine, use
> > MD5, but make sure there are warnings all over the place about its
> > use.
> >
> I ran some benchmarks on one of the WMF machines. The input I
> used is
> a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to
> upload to Commons recently. For each benchmark, I hashed the
> file 25
> times and computed the average running time.
>
> MD5: 393 ms
> SHA-1: 404 ms
> SHA-256: 1281 ms
>
> Note that the input size is many times higher than $wgMaxArticleSize,
> which is set to 2000 KB at WMF. For historical reasons, we have some
> revisions in our history that are larger; Ariel would be able to tell
> you how large, but I believe nothing in there is larger than 10
> MB. So
> I decided to run the numbers for more realistic sizes as well, using
> the first 2 MB and 10 MB, respectively, of my OGV file.
>
> For 2 MB (averages of 1000 runs):
>
> MD5: 5.66 ms
> SHA-1: 5.85 ms
> SHA-256: 18.56 ms
>
> For 10 MB (averages of 200 runs):
>
> MD5: 28.6 ms
> SHA-1: 29.47 ms
> SHA-256: 93.49 ms
>
> So yes, SHA-256 is a few times (just over 3x) more expensive to
> compute than SHA-1, which in turn is only a few percent slower than
> MD5. However, on the largest possible size we allow for new revisions
> it takes < 20ms. It sounds like that's an acceptable worst
> case for
> on-the-fly population, since saves and parses are slow anyway,
> especially for 2 MB of wikitext. The 10 MB case is only relevant for
> backfilling, which we could do from a maintenance script, and
> < 100ms
> is definitely acceptable there.
>
> Roan Kattouw (Catrope)
>
>
>
> ------------------------------
>
> Message: 7
> Date: Mon, 19 Sep 2011 00:07:32 +0200
> From: Platonides <Platonides(a)gmail.com>
> Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table
> (discussing r94289)
> To: wikitech-l(a)lists.wikimedia.org
> Message-ID: <j55pnn$o1j$1(a)dough.gmane.org>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Chad wrote:
> > For those of us who do not know...what the heck is a Grawp attack?
> > Does it involve generating hash collisions?
> >
> > -Chad
>
> It's the name of a wikipedia vandal.
> http://en.wikipedia.org/wiki/User:Grawp
>
>
>
>
>
> ------------------------------
>
> Message: 8
> Date: Sun, 18 Sep 2011 18:01:47 -0400
> From: Anthony <wikimail(a)inbox.org>
> Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table
> (discussing r94289)
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
> Message-ID:
> <CAPreJLRWeY5EhaxZb+wACoX4r5PpenW7fPMiEWrkiwNb=XaRUw(a)mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Sun, Sep 18, 2011 at 5:50 PM, Chad
> <innocentkiller(a)gmail.com> wrote:
> > On Sun, Sep 18, 2011 at 5:47 PM, Anthony
> <wikimail(a)inbox.org> wrote:
> >> On Sun, Sep 18, 2011 at 5:30 PM, Chad
> <innocentkiller(a)gmail.com> wrote:
> >>> On Sun, Sep 18, 2011 at 7:24 AM, Russell N. Nelson - rnnelson
> >>> <rnnelson(a)clarkson.edu> wrote:
> >>>> It is meaningless to talk about cryptography without a
> threat model, just as Robert says. Is anybody actually attacking
> us? Or are we worried about accidental collisions?
> >>>>
> >>>
> >>> I believe it began as accidental collisions, then everyone
> promptly>>> put on their tinfoil hats and started talking about
> a hypothetical
> >>> vandal who has the time and desire to generate hash collisions.
> >>
> >> Having run a wiki which I eventually abandoned due to various
> "Grawp>> attacks", I can assure you that there's nothing
> hypothetical about it.
> >>
> >
> > For those of us who do not know...what the heck is a Grawp attack?
> > Does it involve generating hash collisions?
>
> It does not involve generating hash collisions, but it involves
> finding various bugs in mediawiki and using them to vandalise, often
> by injecting javascript. The best description I could find
> was at
> Encyclopedia Dramatica, which seems to be taken down (there's a cache
> if you do a google search for "grawp wikipedia"). There's
> also a
> description at http://en.wikipedia.org/wiki/User:Grawp , which does
> not do justice to the "mad hacker skillz" of this individual and his
> intent on finding bugs in mediawiki and exploiting them.
>
> If you did something as lame as relying on no one generating an MD5
> collision (*), it would happen. If you use SHA-1, it may
> or may not
> happen, depending on how quickly computers get faster, and how many
> further attacks are made on the algorithm. If you use SHA-
> 256 (**),
> it's significantly less likely to happen, and you'll probably
> have a
> warning in the form of an announcement on Slashdot that SHA-256 has
> been broken, before it happens.
>
> (*) Something which I have done myself on my home computer in a couple
> minutes, and apparently now can be done in a couple seconds.
>
> (**) Which, incidentally, is possibly the single most secure
> hash for
> Wikimedia to use at the current time. SHA-512 is
> significantly more
> "broken" than SHA-256, and the more theoretically secure hashes have
> received much less scrutiny than SHA-256. If you want to
> be more
> secure than SHA-256, you should combine SHA-256 with some other
> hashing algorithm.)
>
>
>
> ------------------------------
>
> Message: 9
> Date: Sun, 18 Sep 2011 18:06:21 -0400
> From: Anthony <wikimail(a)inbox.org>
> Subject: Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table
> (discussing r94289)
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
> Message-ID:
> <CAPreJLQ0YUq9j8zr52Lme2eo=ijyjN6x6CssF=xcFcsTo6YTUg(a)mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Sun, Sep 18, 2011 at 6:01 PM, Anthony <wikimail(a)inbox.org>
> wrote:> There's also a
> > description at http://en.wikipedia.org/wiki/User:Grawp , which does
> > not do justice to the "mad hacker skillz" of this individual
> and his
> > intent on finding bugs in mediawiki and exploiting them.
>
> (and/or the Grawp copycats - personally I don't know if it was "Grawp"
> himself or a copycat that attacked my wiki)
>
>
>
> ------------------------------
>
> Message: 10
> Date: Sun, 18 Sep 2011 18:12:34 -0400
> From: Anthony <wikimail(a)inbox.org>
> Subject: Re: [Wikitech-l] Fwd: Adding MD5 / SHA1 column to revision
> table (discussing r94289)
> To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
> Message-ID:
> <CAPreJLR7gd=jNMnrZ-
> bxyB0RPx7sdPOSzygKALx8S1Wg8N2Q1w(a)mail.gmail.com>Content-Type:
> text/plain; charset=ISO-8859-1
>
> On Sun, Sep 18, 2011 at 6:00 PM, Roan Kattouw
> <roan.kattouw(a)gmail.com> wrote:
> > On Sun, Sep 18, 2011 at 11:00 PM, Anthony
> <wikimail(a)inbox.org> wrote:
> >> Now I don't know how important the CPU differences in
> calculating the
> >> two versions would be. ?If they're significant enough, then
> fine, use
> >> MD5, but make sure there are warnings all over the place
> about its
> >> use.
> >>
> > I ran some benchmarks on one of the WMF machines. The input I
> used is
> > a 137.5 MB (144,220,582 bytes) OGV file that someone asked me to
> > upload to Commons recently. For each benchmark, I hashed the
> file 25
> > times and computed the average running time.
> >
> > MD5: 393 ms
> > SHA-1: 404 ms
> > SHA-256: 1281 ms
>
> Did you try any of the non-secure hash functions? If
> you're going to
> go with MD5, might as well go with the significantly faster CRC-64.
>
> If you're just using it to detect reverts, then you can run the
> CRC-64
> check first, and then confirm with a check of the entire message.
>
>
>
> ------------------------------
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
> End of Wikitech-l Digest, Vol 98, Issue 30
> ******************************************
>
Hi All,
This is my first post to wikitech-l! The 2011 Fundraiser draws near and
we're trying to setup efficient code review for the analytics source which
has a fairly high throughput. I have one code review volunteer already, but
I was seeing if anyone else would be interested in helping to review as we
approach, and ultimately, kick-off the Fundraiser.
The source can be found here:
http://svn.wikimedia.org/svnroot/wikimedia/trunk/fundraiser-analysis/
The analytics reporting interface has been implemented using django
(../fundraiser-analysis/web_reporting) and is hosted on Wikimedia servers.
The data processing portion of the source, found mostly
in ../fundraiser-analysis/classes/DataLoader.py
and ../fundraiser-analysis/classes/DataReporting.py, interacts with the
mysql back-end database where banner and landing page impressions are
stored. The visualization is generated using matplotlib and flot (
http://code.google.com/p/flot/), a jquery based library for data
visualization.
If you're thinking that you may be interested in lending a hand fire me a
message!
--
Ryan Faulkner
Data Analyst - Community Department
Wikimedia Foundation
mobile: (415) 793-5086
office: (415) 839-6885 ext 6726
ArchiveLinks was created as a GSoC project to address the problem of linkrot on Wikipedia. In articles we often cite or link to external URLs, but anything could happen to content on
other sites -- if they move, change, or simply vanish, the value of the citation is lost. ArchiveLinks rewrites external links in Wikipedia articles, so there is a '[cached]' link immediately afterwards which points to the web archiving service of your choice. This can even preserve the exact time that the link was added, so for sites which archive multiple versions of content (such as the Internet Archive) it will even link to a copy of the page that was made around the time the article was written.
Next, ArchiveLinks also publishes a feed via the API of recently added external links, so your favorite remote service can crawl those in a timely fashion. We have been talking with the Internet Archive about this; they are eager to get a list of the recent external links from Wikipedia since they believe our community will probably be linking to some of the most important and useful content on the web.
ArchiveLinks also contains a simple spidering system if you want to cache the links yourself, and display them through MediaWiki.
We completed almost all of our planned features (https://secure.wikimedia.org/wikipedia/mediawiki/wiki/User:Kevin_Brown/Arch…) and the next step is to campaign to get this adopted on Wikipedia. A lot of people are enthusiastic about the concept, but it is likely we will get more input on exactly what the "cached" link should look like, and it will take some time to get a security review. At the same time, we are working with the Internet Archive to set up a test site for them to crawl the feed (perhaps from Toolserver, before it is deployed on Wikipedia). Once the feed is setup on the toolserver the Internet Archive will start archiving all links that appear on the feed. This will effectively leave producing the cached link on the deployed version of mediawiki as the last step to fixing linkrot in all places where it is possible.
(Thanks to Neil Kandalgaonkar for writing the majority of this email).
On Sun, Sep 18, 2011 at 1:55 AM, Robert Rohde <rarohde(a)gmail.com> wrote:
> If collision attacks really matter we should use SHA-1.
If collision attacks really matter you should use, at least, SHA-256, no?
> However, do
> any of the proposed use cases care about whether someone might
> intentionally inject a collision? In the proposed uses I've looked at
> it, it seems irrelevant. The intentional collision will get flagged
> as a revert and the text leading to that collision would be discarded.
> How is that a bad thing?
Well, what if the checksum of the initial page hasn't been calculated
yet? Then some miscreant sets the page to spam which collides, and
then the spam gets reverted. The good page would be the one that gets
thrown out.
Maybe that's not feasible. Maybe it is. Either way, I'd feel very
uncomfortable about the fact that someday someone might decide to use
the checksums in some way in which collisions would matter.
Now I don't know how important the CPU differences in calculating the
two versions would be. If they're significant enough, then fine, use
MD5, but make sure there are warnings all over the place about its
use.
(As another possibility, what if someone writes a bot to detect
certain reverts? I can see spammers/vandals having a field day with
this sort of thing.)
>> For offline analyses, there's no need to change the online database tables.
>
> Need? That's debatable, but one of the major motivators is the desire
> to have hash values in database dumps (both for revert checks and for
> checksums on correct data import / export). Both of those are
> "offline" uses, but it is beneficial to have that information
> precomputed and stored rather than frequently regenerated.
Why not in a separate file? There's no need to get permission from
anyone or mess with the schema to generate a file with revision ids
and checksums. If WMF won't host it at the regular dump location
(which I can't see why they wouldn't), you could host it at
archive.org.