[Wikimedia-l] Multivariate Fundraising Tests (Re: compromise?)

Fri Dec 28 23:02:44 UTC 2012

Matt, I have specific answers to most of your questions, but I don't
know whether others on wikimedia-l would be interested in them, and
I'm not sure about the specifics of a couple terms you used relative
to what I remember of the testing harness, so I'll reply in more
detail off-list with some questions about the terms over the weekend.

For now, I think the banner text message has aways been the most
important part of any appeal, and that if you were to take all 300 of
the existing volunteer submissions (and accept more -- e.g. "How much
you donate may help determine how much we pay our programmers" would
be incredibly effective, and hope you will measure it) and if you were
to include all those without any javascript, pull-down, landing page,
or other changes over a one week period with about 3000 impressions
each at random times of day and days of week for each, you would have
plenty to work with.  That's about a million impressions, or a 0.3%
impressions test, which I believe will give you well over 95%
confidence in the results.

That would not account for banner fatigue, which may be significant
all the way from timezone-to-timezone up to year-to-year, but I have
no ideas about how to account for that other than to do a multivariate
test shortly before beginning fundraising in earnest.

On Fri, Dec 28, 2012 at 3:46 PM, Matthew Walker <mwalker at wikimedia.org> wrote:
> James,
>
> On Fri, Dec 28, 2012 at 2:11 PM, James Salsman <jsalsman at gmail.com> wrote:
>>
>> I mean as in the tests done May 16, September 20, and October 9
>> reported at
>> http://meta.wikimedia.org/wiki/Fundraising_2012/We_Need_A_Breakthrough
>> without adjusting the best performing pull-down delivery combined
>> banner/landing page from the beginning of this month
>
>
> I obviously cannot speak for what Zack will end up doing but let's talk shop
> for a moment on how this would be implemented.
>
> The tests you indicated play banner, landing page impressions, and donation
> amount against each other. It appears that everyone saw a collection of
> random banners (ie: the test was not bucketed.) Are these the same variables
> you want to test?
>
> Regardless of the answer to the above; how do you propose we normalize our
> tests across time of day, day of week, and day of month factors - we've seen
> evidence that these all play a role. I don't know how many banner variations
> we actually have to test but it's likely we won't be able to test them all
> at the same time (In fact with the current weighting setup we can only test
> 30 banners at a time). Do we just take each group as it stands -- find the
> best performers in the group and then test the winners against each other?
>
> An additional considering is that we have four buckets to play with; buckets
> are independent so we could potentially test 120 banners at a time to four
> different groups. Presumably if we did this we would want a couple of
> control banners in each to normalize with?
>
> An additional something to consider is how long do we have to run these
> tests to gain statistical significance? At least a day I'm guessing. Are we
> going to account for banner fatigue at all? IE: show banners during only the
> first 10 visits like we just did with this most recent campaign?
>
> --
> ~Matt Walker