I have followed that process, been subscribed to
https://phabricator.wikimedia.org/T44259 which I just reread
and thus rather surprised by your comment. I have never
seen any technical reason mentioned in the bug. It would
have been very helpful, because someone might have come up
with a fix in the two years when it was "on our roadmap" un-
til you overcame them.
It wasn't my intention to dig up this history, just to point out that the
real story is always more complex. That applies to whatever explanation I
give here, as well. It's from my perspective, and the nuance is endless.
Anyway, I'm more than happy to try and shed light, hopefully this helps for
future work we do together.
The technical challenge was, basically, moving off of our udp2log based
logging infrastructure to Kafka. I think it's fair to say that the
Analytics team didn't have the full trust and confidence of WMF until Toby
started turning that around. We were submitted to some painful agile
coaching and were not allowed to implement the correct solution (Kafka)
fully, we were working with a patchwork system that still had single points
of failure and data loss. Once we gained that trust, it still took while
to sort out how to tune Kafka so it reliably received traffic logs from all
of our caching centers, and let us know when it had loss or duplication of
data. This work was in really good shape, if memory serves, by the end of
summer, 2014. I incorrectly summarized that solely as a technical
challenge, it was a pretty tricky technical challenge combined with an
organizational one. For the latter, if it helps, Sue and Erik both
acknowledged responsibility and things were much smoother after that. (I
always had tremendous respect for the two of them, but that acknowledgement
was pretty amazing, and unique in my 12 years of experience).
At that point, October 2014, some of us, myself included, wanted to start
work on the pageview API. We didn't get push-back as much as a strong push
to focus on Event Logging instead. The Event Logging system, developed by
Ori, was also experiencing some pretty serious growing pains. Outages were
becoming very frequent due to the increased traffic and lack of automated
monitoring and management. Over the next few months we improved
performance and upgraded it to use Kafka as well, and solved those
problems. Looking back, that's still a bittersweet choice for me. This
work on Event Logging was absolutely key to the experiments that led to
Visual Editor's successful roll-out in 2015. As one of many examples, this
dashboard would not have been possible without a stable Event Logging
platform:
https://edit-analysis.wmflabs.org/compare/. And, perhaps this
was Toby's strategic vision that I didn't see at the time, and very
important for us to keep our newly gained trust and independence within
WMF. But, of course, it meant we had to delay the pageview API yet again.
That's the 6 month delay I mentioned. And we didn't leave the community
hanging, we made the higher quality raw data available with mobile traffic
in this new dataset:
http://dumps.wikimedia.org/other/pagecounts-all-sites/
as well as gave Henrik some support with stats.grok.se
Some of these things are mentioned on the epic T44259
<https://phabricator.wikimedia.org/T44259>, but some I didn't even truly
understand at the time, and some might have not been constructive to
mention. I'm personally all ears at this point. What of this should we
have noted on the task? Like I said above, there's lots of detail, but at
some point it would feel like I'm a news reporter instead of an engineer :)
Also, I'm not sure I would have seen it the same way. Even a few months
ago when we released the pageview API I was still a bit bitter that the
Event Logging work was prioritized, and now I think that was me being
short-sighted to some extent.
Instead, I read for example Toby's comment at Magnus's blog
(
http://magnusmanske.de/wordpress/?p=173#comment-290):
| […]
| We’ve been prioritizing and working on these projects as our
| resources allow and it’s important to understand that the
| team has not been idle. While we’ve done a less than stel-
| lar job in communicating our progress to the community, in-
| formation on what we’ve been doing is available via our
| planning pages on mediawiki. In the future, we will be more
| proactive in communicating with the community regarding our
| goals and projects.
as meaning that there were no technical obstacles, but lim-
ited resources that were directed to other projects (and ap-
parently none that matched the popularity of a pageviews
API).
Both can be true, and are true. The challenge was great, from what I
understand what we accomplished took Twitter orders of magnitude more money
and people, a fact which makes me look at my teammates with complete awe
(they're amazing). And, as I explained above, we also had to prioritize
other work.
My interpretation may have been biased by
Magnus's
report above that:
| […]
| Like others, I have tried to get the Foundation to provide
| the page view data in a more accessible and local (as in
| toolserver/Labs) way. Like others, I failed. The last it-
| eration was a video meeting with the Analytics team (newly
| restarted, as the previous Analytics team didn’t really work
| out for a reason; I didn’t inquire too deeply), which ended
| with a promise to get this done Real Soon Now™, and the gen-
| erous offer to use the page view data from their hadoop
| cluster. Except the cluster turned out to be empty; I then
| was encouraged to import the view data myself. (No, this is
| not a joke. I have the emails to prove it.) As much as I
| enjoy working with and around the Wikiverse, I do have nei-
| ther the time, the bandwidth, nor the inclination to do your
| paid jobs for you, thank you very much.
| […]
which seems to indicate that it was indeed a problem of WMF
allocating (human) resources.
No, that video call was my fault. I felt like I was sitting on burning hot
coals and I couldn't stand having some of the data in the cluster and not
being able to make it available publicly any more. So I tried to offer
Magnus access to the cluster and dedicate my volunteer time to help him get
to the data. This mainly failed because I lost my volunteer time to a
personal crisis that I can't get into here (it had absolutely nothing to do
with the foundation, it was just unfortunate timing from Magnus's point of
view).
I hope that helps. Above all, getting this project done is my proudest
professional moment, and I think in some sense the delays only made it
better when it finally came out. The members of the Analytics team that
are involved in the pageview API now are ten times smarter and more
equipped to handle the project than I would have been by myself in October
2014.
Respectfully,
Dan