Over the last couple of months, I have worked on introducing a dependency
injection mechanism into MediaWiki core (don't fear, no auto-wiring). My
proposal is described in detail at <https://phabricator.wikimedia.org/T124792>
(yea, TL;DR - just read the top and search the rest if you have a question).
Before we discuss this again on IRC at the RFC meeting on Wednesday (March 23,
2pm PST / 22:00 CEST due to daylight confusion), I would like to invite you to
review the proposal as well as the patches that are up on gerrit. In particular,
any feedback would be appreciated on:
* Introduce top level service locator
* Allow reset of global services <https://gerrit.wikimedia.org/r/#/c/270020/>
* WIP: Make storage layer services injectable.
Perhaps also have a look at the documentation included in the change, in
particular the migration part:
Before commenting on design choices on gerrit, please have a look at T124792 and
see whether I have written something about the issue in question there. I would
like to focus conceptual discussion on the RFC ticket on phabricator, rather
than on gerrit. On gerrit, we can talk about the implementation.
I very much want this to move forward. Perhaps we can even get the first bits of
this merged at the hackathon. So, criticize away!
Thanks for your help!
PS: phabricator event page (still blank, we'll fix that soon):
On Wed, Mar 23, 2016 at 1:06 PM, Federico Leva (Nemo) <nemowiki(a)gmail.com>
> Dan Andreescu, 23/03/2016 15:58:
>> *Clean-up:* Analytics data on dumps was crammed into /other with
>> unrelated datasets. We made a new page to receive current and future
>> datasets  and linked to it from /other and /. Please let us know if
>> anything there looks confusing or opaque and I'll be happy to clarify.
> I assume the old URLs will redirect to the new ones, right?
Good question, we didn't change any old URLs actually, so if you're trying
to get to other/pagecounts-ez, other/pagecounts-raw and all that, they're
all still there, just linked-to from /analytics. We did it this way
because we figured people had scripts that depended on those URLs. We
thought about moving and symlinking but it's probably unlikely that we'll
ever be able to delete the other/** location.
So mainly we just have a new page where we can do a better job of focusing
on the analytics datasets.
cc-ing our friends in research and wikitech (sorry I forgot initially)
We're happy to announce a few improvements to Analytics data releases on
> * We are releasing a new dataset, an estimate of Unique Devices accessing
> our projects 
> * We are officially making available a better Pageviews dataset 
> * We are deprecating two older pageview statistics datasets
> * We moved Analytics data from /other to /analytics 
> Details follow:
> *Unique Devices:* Since 2009, the Wikimedia Foundation used comScore to
> report data about unique web visitors. In January 2016, however, we
> decided to stop reporting comScore numbers  because of certain
> limitations in the methodology, these limitations translated into
> misreported mobile usage. We are now ready to replace comscore numbers with
> the Unique Devices Dataset . While unique devices does not equal
> unique visitors, it is a good proxy for that metric, meaning that a major
> increase in the number of unique devices is likely to come from an increase
> in distinct users. We understand that counting uniques raises fairly big
> privacy concerns and we use a very private conscious way to count unique
> devices, it does not include any cookie by which your browser history can
> be tracked .
> We invite you to explore this new dataset and hope it’s helpful for the
> Wikimedia community in better understanding our projects. This data can
> help measurethe reach of wikimedia projects on the web.
> *Pageviews:* This  is the best quality data available for counting the
> number of pageviews our projects receive at the article and project level.
> We've upgraded from pagecounts-raw to pagecounts-all-sites, and now to
> pageviews, in order to filter out more spider traffic and measure something
> closer to what we think is a real user viewing content. A short history
> might be useful:
> * pagecounts-raw: was maintained by Domas Mituzas originally and taken
> over by the analytics team. It was and still is the most used dataset,
> though it has some majore problems. It does not count access to the mobile
> site, it does not filter out spider or bot traffic, and it suffers from
> unknown loss due to logging infrastructure limitations.
> * pagecounts-all-sites: uses the same pageview definition as
> pagecounts-raw, and so also does not filter out spider or bot traffic. But
> it does include access to mobile and zero sites, and is built on a more
> reliable logging infrastructure.
> * pagecounts-ez: is derived from the best data available at the time.
> So until December 2015, it was based on pagecounts-raw and
> pagecounts-all-sites, and now it's based on pageviews. This dataset is
> great because it compresses very large files without losing any
> information, still providing hourly page and project level statistics.
> So the new dataset, pageviews, is what's behind our pageview API and is
> now available in static files for bulk download back to May 2015. But the
> multiple ways to download pageview data is confusing for consumers, so
> we're keeping only pageviews and pagecounts-ez and deprecating the other
> two. If you'd like to read more about the current pageview definition,
> details are on the research page .
> *Deprecating:* We are deprecating the pagecounts-raw and
> pagecounts-all-sites datasets in May 2016 (discussion here:
> https://phabricator.wikimedia.org/T130656 ). This data suffers from many
> artifacts, lack of mobile data, and/or infrastructure problems, and so is
> not comparable to the new way we track pageviews. It will remain here
> because we have historical data that may be useful, but it will not be
> maintained or updated beyond May 2016.
> *Clean-up:* Analytics data on dumps was crammed into /other with
> unrelated datasets. We made a new page to receive current and future
> datasets  and linked to it from /other and /. Please let us know if
> anything there looks confusing or opaque and I'll be happy to clarify.
>  http://dumps.wikimedia.org/other/unique_devices
>  http://dumps.wikimedia.org/other/pageviews
>  http://dumps.wikimedia.org/analytics/
>  https://meta.wikimedia.org/wiki/ComScore/Announcement
>  https://meta.wikimedia.org/wiki/Research:Unique_Devices
>  https://meta.wikimedia.org/wiki/Research:Page_view
Thank you for your question:
>... What does the "reputation score" in the description refer to?
I've asked Priyanka to reply with her current design, but here is some
of the advice I gave her:
"Each reviewer needs, at a minimum, data indicating the number and
proportion of reviewers who have agreed with them. However, the third level
of tie-breaking review introduces an extra bit for each disagreement which
determines whether agreement or disagreement should be counted in their
favor. So, even if a given reviewer only agrees with 50% of the other
reviewers, the determination of the tie breaker in each case of
disagreement controls whether their reputation score ranges from 0% to
100%. (As too does the agreement proportion, which is unlikely to be
"Do you want the reviewers to know their agreement ratios and reputation
scores? How might their behavior change if they are and aren't told those?
Could there ever be a case when you might want to withhold them? Would
there ever be a benefit from distorting them? How about displaying them as
a range instead of distorting or withholding them? That last possibility
seems superior to me. You might want to do that when you are unsure that
the precision of the mathematical values is near the accuracy of the
knowledge they represent. Do you want to be able to tell each reviewer the
responses which have contributed to defects in their reputation scores,
i.e., do you want them to know which disagreements were tie-broken against
Her reply at the time was:
"In case of two reviewers agreeing, we add a +1 to the reputation. In case
of disagreement, we seek the opinion of a 3rd reviewer. If A says Yes, B
says No and C says Yes to an edit, A and C will have an agreement ratio of
50% and reputation of 100%, whereas B will have an agreement ratio of 0
and reputation 0%? This would of course change as more edits are reviewed
I believe that is still an accurate description of the current design.
Finally, I regret that the GSoC program doesn't allow more than one student
Hi, I'm Li Linxuan, a second-year student from Peking University, China. I'm familiar with using C/C++ and have experience of using Python. I have also participated in some projects including making games and game-bots.
I am interested in the "Accuracy Review" project, but there is a note saying the estimated time for a senior contributor is 3 weeks. Other projects in the idea list have the similar estimated time. So should we complete more than one project during the three-month internship or just one?
Hey, I have a new topic I'd like to discuss. It's about mbstring and
whether do we really need to support running without it.
The RFC is at https://gerrit.wikimedia.org/r/#/c/267309/
Here's a copy:
MediaWiki currently relies heavily on Unicode support to provide support
for 300+ languages yet does not require the mbstring PHP extension to
function. Instead, we create PHP-only fallbacks if a native support is not
available. This creates a few problems:
* These fallbacks are extremely slow. The script in P2734
<https://phabricator.wikimedia.org/P2734> demonstrates that fallbacks are
roughly order of magnitude slower on PHP 5.6. In extreme cases, it can be
100+ times slower, per comment in Fallback.php).
* These fallbacks cover only a few functions. If there's no fallback,
either ad-hoc solutions are used in places, or, like in SwiftFileBackend,
we just say "mbstring is required".
* This also means that extensions can't expect any consistent Unicode
* Won't somebody please think of the children!
Now that we've dramatically increased PHP requirements, we've already cut
off a lot of crappy environments so this change will likely not affect too
* On Debian-based systems, a simple apt-get install php5 gives you mbstring
* On RPM-based, a separate package is required
* On Windows, people tend to use *AMP all-in-one packages that have
Current mbstring usage in core (excluding fallbacks themselves):
mediawiki/includes$ grep -orEh '\bmb_\w+' . | sort | uniq -c
Some time ago, I committed https://gerrit.wikimedia.org/r/#/c/267309/ to
start a discussion, but it went largely unnoticed so I'd like to start a
Max Semenik ([[User:MaxSem]])