The English Wikipedia is showing a pattern that I don't notice on
several other wikis. If I'm not mistaken, in April 2020 monthly active
editors passed 43k for the first time since 2011 (the year when
MobileFrontend was created).
<https://stats.wikimedia.org/#/en.wikipedia.org/contributing/active-editors/…>
(As usual there will be a deflation of the number in a few months, after
the deletions have run their course. The 43k threshold may still hold.)
The April peak looks like it continued and reinforced one of the
now-usual October/January/March peaks. Do we know how much of this
growth is organic or across the board and how much is amplification of
existing known seasonal patterns (WikiEdu?).
Federico
Hi everybody,
Superset has been upgraded to 0.36 (latest upstream)!
<https://phabricator.wikimedia.org/T249495>
Please note: if I got it correctly some charts are now requiring an
explicit time window to work correctly, -infinite - +infinite seems not
suggested anymore, so if you see your charts looking strange (like all
datapoints collapsed etc..) please check time ranges :)
If you see any issue or regression please report it in
https://phabricator.wikimedia.org/T249495
Thanks!
Luca (on behalf of the Analytics team)
The WMF Research team has published a new pageview report of inbound
traffic coming from Facebook, Twitter, YouTube, and Reddit.[1]
The report contains a list of all articles that received at least 500 views
from one or more of these platforms (i.e. someone clicked a link on Twitter
that sent them directly to a Wikipedia article). The report is available
on-wiki and will be updated daily at around 14:00 UTC with traffic counts
from the previous calendar day.
We believe this report provides editors with a valuable new information
source. Daily inbound social media traffic stats can help editors monitor
edits to articles that are going viral on social media sites and/or are
being linked to by the social media platform itself in order to fact-check
disinformation and other controversial content[2][3].
The social media traffic report also contains additional public article
metadata that may be useful in the context of monitoring articles that are
receiving unexpected attention from social media sites, such as...
- the total number of pageviews (from all sources) that article received
in the same period of time
- the number of pageviews the article received from the same platform
(e.g. Facebook) the previous day (two days ago)
- the number of editors who have the page on their watchlist
- the number of editors who have watchlisted the page AND recently
visited it
We want your feedback! We have some ideas of our own for how to improve the
report, but we want to hear yours! If you have feature suggestions, please
add them here.[4] We intend to maintain this daily report for at least the
next two months. If we receive feedback that the report is useful, we are
considering making it available indefinitely.
If you have other questions about the report, please first check out our
(still growing) FAQ [5]. All questions, comments, concerns, ideas, etc. are
welcome on the project talkpage on Meta.[4]
1. https://en.wikipedia.org/wiki/User:HostBot/Social_media_traffic_report
2.
https://www.engadget.com/2018/03/15/wikipedia-unaware-would-be-youtube-fact…
3.
https://mashable.com/2017/10/05/facebook-wikipedia-context-articles-news-fe…
4.
https://meta.wikimedia.org/wiki/Research_talk:Social_media_traffic_report_p…
5.
https://meta.wikimedia.org/wiki/Research:Social_media_traffic_report_pilot/…
Cheers,
Jonathan
--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
(Uses He/Him)
*Please note that I do not expect a response from you on evenings or
weekends*
Hi all,
The next Research Showcase will be live-streamed on Wednesday, May 20, at
9:30 AM PDT/16:30 UTC.
This month we will learn about recent research on machine learning systems
that rely on human supervision for their learning and optimization -- a
research area commonly referred to as Human-in-the-Loop ML. In the first
talk, Jie Yang will present a computational framework that relies on
crowdsourcing to identify influencers in Social Networks (Twitter) by
selectively obtaining labeled data. In the second talk, Estelle Smith will
discuss the role of the community in maintaining ORES, the machine learning
system that predicts the quality in Wikipedia applications.
YouTube stream: https://www.youtube.com/watch?v=8nDiu2ebdOI
As usual, you can join the conversation on IRC at #wikimedia-research. You
can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
*OpenCrowd: A Human-AI Collaborative Approach for Finding Social
Influencers via Open-Ended Answers Aggregation*
By: Jie Yang, Amazon (current), Delft University of Technology (starting
soon)
Finding social influencers is a fundamental task in many online
applications ranging from brand marketing to opinion mining. Existing
methods heavily rely on the availability of expert labels, whose collection
is usually a laborious process even for domain experts. Using open-ended
questions, crowdsourcing provides a cost-effective way to find a large
number of social influencers in a short time. Individual crowd workers,
however, only possess fragmented knowledge that is often of low quality. To
tackle those issues, we present OpenCrowd, a unified Bayesian framework
that seamlessly incorporates machine learning and crowdsourcing for
effectively finding social influencers. To infer a set of influencers,
OpenCrowd bootstraps the learning process using a small number of expert
labels and then jointly learns a feature-based answer quality model and the
reliability of the workers. Model parameters and worker reliability are
updated iteratively, allowing their learning processes to benefit from each
other until an agreement on the quality of the answers is reached. We
derive a principled optimization algorithm based on variational inference
with efficient updating rules for learning OpenCrowd parameters.
Experimental results on finding social influencers in different domains
show that our approach substantially improves the state of the art by 11.5%
AUC. Moreover, we empirically show that our approach is particularly useful
in finding micro-influencers, who are very directly engaged with smaller
audiences.
Paper: https://dl.acm.org/doi/fullHtml/10.1145/3366423.3380254
*Keeping Community in the Machine-Learning Loop*
By: C. Estelle Smith, MS, PhD Candidate, GroupLens Research Lab at the
University of Minnesota
On Wikipedia, sophisticated algorithmic tools are used to assess the
quality of edits and take corrective actions. However, algorithms can fail
to solve the problems they were designed for if they conflict with the
values of communities who use them. In this study, we take a
Value-Sensitive Algorithm Design approach to understanding a
community-created and -maintained machine learning-based algorithm called
the Objective Revision Evaluation System (ORES)—a quality prediction system
used in numerous Wikipedia applications and contexts. Five major values
converged across stakeholder groups that ORES (and its dependent
applications) should: (1) reduce the effort of community maintenance, (2)
maintain human judgement as the final authority, (3) support differing
peoples’ differing workflows, (4) encourage positive engagement with
diverse editor groups, and (5) establish trustworthiness of people and
algorithms within the community. We reveal tensions between these values
and discuss implications for future research to improve algorithms like
ORES.
Paper:
https://commons.wikimedia.org/wiki/File:Keeping_Community_in_the_Loop-_Unde…
--
Janna Layton (she, her)
Administrative Assistant - Product & Technology
Wikimedia Foundation <https://wikimediafoundation.org/>
Hello:
We have added the 'automated' maker to Wikimedia's pageview data. Up to now
pageview agents were classified as 'spider' (self reported bots like
'google bot' or 'bing bot') and 'user'.
We have known for a while that some requests classified as 'user' were, in
fact, coming from automated agents not disclosed as such. This was a well
known fact for our community as for a couple years now they have been
applying filtering rules for any "Top X" list compiled [1]. We have
incorporated some of these filters (and others) to our automated traffic
detection and, as of this week, traffic that meets the filtering
criteria is now automatically excluded from being counted towards "top"
lists reported by the pageview API.
The effect of removing pageviews marked as 'automated' from the overall
user traffic is about a 5.6% reduction of pageviews labeled as "user" [2]
in the course of a month. Not all projects are affected equally when it
comes to reduction of "user pageviews". The biggest effect is on English
Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly
affected (< 1%).
If you are curious as what problems this type of traffic causes in the
data, this ticket for Hungarian Wikipedia is a good example of issues
inflicted by what we call "bot vandalism/bot spam":
https://phabricator.wikimedia.org/T237282
Given the delicate nature of this data we have worked for many months now
on vetting the algorithms we are using. We will appreciate reports via phab
ticket for any issues you might find.
Thanks,
Nuria
[1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions
[2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetectio…
Hello everybody,
I am pretty sure that the first thought reading the email's subject was
"whatttttttttt!!!!???? Are you crazy Luca?". There is a valid reason,
please keep reading :D
In T243934 we worked a lot to reduce the complexity of our client nodes,
and the result is that all stat100x have the same configuration (please see
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients), that
includes Jupyterhub. SSH access to nodes was also simplified a lot, see
https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups.
The notebook100x hosts have the downside of having little disk space for
home directories, and it would be great to reduce the amount of hosts
handled by Analytics. This is why I am announcing that at the beginning of
June I'll start decommissioning the notebook100x nodes. I have added
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients#Rsync_between…
as
reference to a quick and easy way to transfer data between notebook and
stat boxes, but please feel free to reach out to us in
https://phabricator.wikimedia.org/T249752 for any issue/doubt/etc..
Thanks in advance!
Luca (on behalf of the Analytics team)