James,
On Fri, Dec 28, 2012 at 2:11 PM, James Salsman jsalsman@gmail.com wrote:
I mean as in the tests done May 16, September 20, and October 9 reported at http://meta.wikimedia.org/wiki/Fundraising_2012/We_Need_A_Breakthrough without adjusting the best performing pull-down delivery combined banner/landing page from the beginning of this month
I obviously cannot speak for what Zack will end up doing but let's talk shop for a moment on how this would be implemented.
The tests you indicated play banner, landing page impressions, and donation amount against each other. It appears that everyone saw a collection of random banners (ie: the test was not bucketed.) Are these the same variables you want to test?
Regardless of the answer to the above; how do you propose we normalize our tests across time of day, day of week, and day of month factors - we've seen evidence that these all play a role. I don't know how many banner variations we actually have to test but it's likely we won't be able to test them all at the same time (In fact with the current weighting setup we can only test 30 banners at a time). Do we just take each group as it stands -- find the best performers in the group and then test the winners against each other?
An additional considering is that we have four buckets to play with; buckets are independent so we could potentially test 120 banners at a time to four different groups. Presumably if we did this we would want a couple of control banners in each to normalize with?
An additional something to consider is how long do we have to run these tests to gain statistical significance? At least a day I'm guessing. Are we going to account for banner fatigue at all? IE: show banners during only the first 10 visits like we just did with this most recent campaign?
Matt, I have specific answers to most of your questions, but I don't know whether others on wikimedia-l would be interested in them, and I'm not sure about the specifics of a couple terms you used relative to what I remember of the testing harness, so I'll reply in more detail off-list with some questions about the terms over the weekend.
For now, I think the banner text message has aways been the most important part of any appeal, and that if you were to take all 300 of the existing volunteer submissions (and accept more -- e.g. "How much you donate may help determine how much we pay our programmers" would be incredibly effective, and hope you will measure it) and if you were to include all those without any javascript, pull-down, landing page, or other changes over a one week period with about 3000 impressions each at random times of day and days of week for each, you would have plenty to work with. That's about a million impressions, or a 0.3% impressions test, which I believe will give you well over 95% confidence in the results.
That would not account for banner fatigue, which may be significant all the way from timezone-to-timezone up to year-to-year, but I have no ideas about how to account for that other than to do a multivariate test shortly before beginning fundraising in earnest.
On Fri, Dec 28, 2012 at 3:46 PM, Matthew Walker mwalker@wikimedia.org wrote:
James,
On Fri, Dec 28, 2012 at 2:11 PM, James Salsman jsalsman@gmail.com wrote:
I mean as in the tests done May 16, September 20, and October 9 reported at http://meta.wikimedia.org/wiki/Fundraising_2012/We_Need_A_Breakthrough without adjusting the best performing pull-down delivery combined banner/landing page from the beginning of this month
I obviously cannot speak for what Zack will end up doing but let's talk shop for a moment on how this would be implemented.
The tests you indicated play banner, landing page impressions, and donation amount against each other. It appears that everyone saw a collection of random banners (ie: the test was not bucketed.) Are these the same variables you want to test?
Regardless of the answer to the above; how do you propose we normalize our tests across time of day, day of week, and day of month factors - we've seen evidence that these all play a role. I don't know how many banner variations we actually have to test but it's likely we won't be able to test them all at the same time (In fact with the current weighting setup we can only test 30 banners at a time). Do we just take each group as it stands -- find the best performers in the group and then test the winners against each other?
An additional considering is that we have four buckets to play with; buckets are independent so we could potentially test 120 banners at a time to four different groups. Presumably if we did this we would want a couple of control banners in each to normalize with?
An additional something to consider is how long do we have to run these tests to gain statistical significance? At least a day I'm guessing. Are we going to account for banner fatigue at all? IE: show banners during only the first 10 visits like we just did with this most recent campaign?
-- ~Matt Walker
wikimedia-l@lists.wikimedia.org