Hello everyone - apologies for cross-posting! *TL;DR*: We would like your
feedback on our Metrics Kit project. Please have a look and comment on
Meta-Wiki:
https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
The Wikimedia Foundation's Trust and Safety team, in collaboration with the
Community Health Initiative, is working on a Metrics Kit designed to
measure the relative "health"[1] of various communities that make up the
Wikimedia movement:
https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
The ultimate outcome will be a public suite of statistics and data looking
at various aspects of Wikimedia project communities. This could be used by
both community members to make decisions on their community direction and
Wikimedia Foundation staff to point anti-harassment tool development in the
right direction.
We have a set of metrics we are thinking about including in the kit,
ranging from the ratio of active users to active administrators,
administrator confidence levels, and off-wiki factors such as freedom to
participate. It's ambitious, and our methods of collecting such data will
vary.
Right now, we'd like to know:
* Which metrics make sense to collect? Which don't? What are we missing?
* Where would such a tool ideally be hosted? Where would you normally look
for statistics like these?
* We are aware of the overlap in scope between this and Wikistats <
https://stats.wikimedia.org/v2/#/all-projects> — how might these tools
coexist?
Your opinions will help to guide this project going forward. We'll be
reaching out at different stages of this project, so if you're interested
in direct messaging going forward, please feel free to indicate your
interest by signing up on the consultation page.
Looking forward to reading your thoughts.
best,
Joe
P.S.: Please feel free to CC me in conversations that might happen on this
list!
[1] What do we mean by "health"? There is no standard definition of what
makes a Wikimedia community "healthy", but there are many indicators that
highlight where a wiki is doing well, and where it could improve. This
project aims to provide a variety of useful data points that will inform
community decisions that will benefit from objective data.
--
*Joe Sutherland* (he/him or they/them)
Trust and Safety Specialist
Wikimedia Foundation
joesutherland.rocks
Hi everybody,
the Analytics team has been working with the SRE Data Persistence team
during the last months to replace dbstore1002 with three brand new nodes,
dbstore100[3-5]. We are moving from a single mysql instance (multi-source)
to a multi-instance environment.
For more info please check:
* T210478 and related subtasks.
* https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas
We are planning to decommission the dbstore1002 host (namely stopping mysql
and shutting down the server) on Monday March 4th (EU morning). We have
recently been following up with a lot of users to help them migrate to the
new environment, so we are reasonably sure that this move should not
heavily impact anybody, but if we have left some use case aside please let
us know in https://phabricator.wikimedia.org/T215589. If we don't hear
anything before the March 4th deadline we'll proceed with the host
decommission maintenance.
Luca (on behalf of the Analytics team)
Hi everybody,
today we configured Hadoop Yarn to store its application/jobs data (called
rmstore) from Zookeeper to HDFS. We are going to remove a lot of data from
our Zookeeper cluster in eqiad (several thousands of znodes), hopefully
increasing its reliability (it is shared with all the Kafka clusters at the
moment). This change should be totally transparent for you, but please let
us know in https://phabricator.wikimedia.org/T216952 if anything looks
weird/different during the next hours/days.
Thanks!
Luca
Hello!
I'm hoping to get advice on how we should approach the following challenge...
I am building a public website that will provide information that is automatically harvested from online news articles about the work of scientists. The goal is to make it easier to create and maintain scientific content on Wikipedia.
Here's some news about the project: https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer <https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer>
And here is the prototype of the site: https://quicksilver.primer.ai <https://quicksilver.primer.ai/>
What I am working on now is a self-updating version of this site.
The goal is to provide daily refreshed information for scientists most likely to be missing from Wikipedia.
For now I am focusing on English-language news and English-language Wikipedia. Eventually this will expand to other languages.
The ~100 scientists shown on any given day are selected from ~100k scientists that the system is tracking for news updates.
So here's the challenge:
To choose the 100 scientists most in need of an update on Wikipedia, we need to query Wikipedia each day for the 100k scientists to see if they have an article yet, and if so to get its content (to check if we have new information).
I am getting throttled by the Wikipedia servers. 100k is a lot of queries.
What is the most polite, sanctioned method for programmatic access to Wikipedia for a daily job on this scale?
Many thanks for help/advice!
John Bohannon
http://johnbohannon.org
Hi friends!
Spark 1.x is pretty old. We only keep it around because it is a standard
part of the Cloudera distribution we use in the analytics Hadoop cluster.
The Analytics Engineering team uses Spark 2 for all of our jobs, and you
should too!
Spark 2 has been available in our cluster for over a year now. If you
don't yet use it, see
https://wikitech.wikimedia.org/w/index.php?title=Analytics/Systems/Cluster/…
for more info on how to.
We'd like to remove Spark 1 during the week of February 11. Please migrate
any Spark 1 jobs to Spark 2 by then (if there are any left!). (If this
timeline doesn't work for you just let us know and we'll adjust.)
Thanks!
- Andrew Otto & Analytics Engineering
https://phabricator.wikimedia.org/T212134
Hello everyone,
The next Research Showcase, “The_Tower_of_Babel.jpg” and “A Warm Welcome,
Not a Cold Start,” will be live-streamed next Wednesday, February 20, 2019,
at 11:30 AM PST/19:30 UTC. The first presentation is about how images are
used across language editions, and the second is about new editors.
YouTube stream: https://www.youtube.com/watch?v=_jpJIFXwlEg
As usual, you can join the conversation on IRC at #wikimedia-research. You
can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
The_Tower_of_Babel.jpg: Diversity of Visual Encyclopedic Knowledge Across
Wikipedia Language Editions
By Shiqing He (presenting, University of Michigan), Brent Hecht
(presenting, Northwestern University), Allen Yilun Lin (Northwestern
University), Eytan Adar (University of Michigan), ICWSM'18.
Across all Wikipedia language editions, millions of images augment text in
critical ways. This visual encyclopedic knowledge is an important form of
wikiwork for editors, a critical part of reader experience, an emerging
resource for machine learning, and a lens into cultural differences.
However, Wikipedia research--and cross-language edition Wikipedia research
in particular--has thus far been limited to text. In this paper, we assess
the diversity of visual encyclopedic knowledge across 25 language editions
and compare our findings to those reported for textual content. Unlike
text, translation in images is largely unnecessary. Additionally, the
Wikimedia Foundation, through the Wikipedia Commons, has taken steps to
simplify cross-language image sharing. While we may expect that these
factors would reduce image diversity, we find that cross-language image
diversity rivals, and often exceeds, that found in text. We find that
diversity varies between language pairs and content types, but that many
images are unique to different language editions. Our findings have
implications for readers (in what imagery they see), for editors (in
deciding what images to use), for researchers (who study cultural
variations), and for machine learning developers (who use Wikipedia for
training models).
A Warm Welcome, Not a Cold Start: Eliciting New Editors' Interests via
Questionnaires
By Ramtin Yazdanian (presenting, Ecole Polytechnique Federale de Lausanne)
Every day, thousands of users sign up as new Wikipedia contributors. Once
joined, these users have to decide which articles to contribute to, which
users to reach out to and learn from or collaborate with, etc. Any such
task is a hard and potentially frustrating one given the sheer size of
Wikipedia. Supporting newcomers in their first steps by recommending
articles they would enjoy editing or editors they would enjoy collaborating
with is thus a promising route toward converting them into long-term
contributors. Standard recommender systems, however, rely on users'
histories of previous interactions with the platform. As such, these
systems cannot make high-quality recommendations to newcomers without any
previous interactions -- the so-called cold-start problem. Our aim is to
address the cold-start problem on Wikipedia by developing a method for
automatically building short questionnaires that, when completed by a newly
registered Wikipedia user, can be used for a variety of purposes, including
article recommendations that can help new editors get started. Our
questionnaires are constructed based on the text of Wikipedia articles as
well as the history of contributions by the already onboarded Wikipedia
editors. We have assessed the quality of our questionnaire-based
recommendations in an offline evaluation using historical data, as well as
an online evaluation with hundreds of real Wikipedia newcomers, concluding
that our method provides cohesive, human-readable questions that perform
well against several baselines. By addressing the cold-start problem, this
work can help with the sustainable growth and maintenance of Wikipedia's
diverse editor community.
--
Janna Layton (she, her)
Administrative Assistant - Audiences & Technology
Wikimedia Foundation <https://wikimediafoundation.org/>
Hi everybody,
as described in here (https://phabricator.wikimedia.org/T215589#4946535) I
am proposing a maintenance window to allow the Data Persistence and
Analytics teams to move the staging database from dbstore1002 to
dbstore1005 (its new home) on Monday 18th during the EU morning. This will
mean that the staging database on dbstore1002 will become read-only
(permanently), but an up to date copy with read/write capabilities will be
present on dbstore1005.
Some notes:
- dbstore1002 will not be shutdown/decommissioned yet, you'll be able to
query tables as you are used to.
- as described in
https://wikitech.wikimedia.org/wiki/Analytics/Data_access#MariaDB_replicas
several DNS CNAME/SRV records have been created to ease the use of the new
systems, please check them out :)
Let me know if this is feasible or if you need more time for a specific use
case (let's coordinate in the task).
Thanks!
Luca (on behalf of the Analytics team)
Dear Ms.,
I thank you for your efforts. We are a WikiResearch group working in Sfax, Tunisia. Our main project is to try to enrich medical information on Wikidata. I ask if we can participate to the Research showcase next month.
Yours Sincerely,
Houcemeddine Turki
Medical Student, Faculty of Medicine of Sfax, University of Sfax, Tunisia
Undergraduate Researcher, UR12SP36
GLAM and Education Coordinator, Wikimedia TN User Group
Member, Wiki Project Med
Member, Wikimedia and Library User Group
Founder, WikiLingua Maghreb
Founder, TunSci
____________________
+21629499418
-------- Message d'origine --------
De : Janna Layton <jlayton(a)wikimedia.org>
Date : 2019/02/14 20:20 (GMT+01:00)
À : wikimedia-l(a)lists.wikimedia.org, analytics(a)lists.wikimedia.org, wiki-research-l(a)lists.wikimedia.org
Objet : [Analytics] [Wikimedia Research Showcase] February 20 at 11:30 AM PST, 19:30 UTC
Hello everyone,
The next Research Showcase, “The_Tower_of_Babel.jpg” and “A Warm Welcome, Not a Cold Start,” will be live-streamed next Wednesday, February 20, 2019, at 11:30 AM PST/19:30 UTC. The first presentation is about how images are used across language editions, and the second is about new editors.
YouTube stream: https://www.youtube.com/watch?v=_jpJIFXwlEg
As usual, you can join the conversation on IRC at #wikimedia-research. You can also watch our past research showcases here: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
The_Tower_of_Babel.jpg: Diversity of Visual Encyclopedic Knowledge Across Wikipedia Language Editions
By Shiqing He (presenting, University of Michigan), Brent Hecht (presenting, Northwestern University), Allen Yilun Lin (Northwestern University), Eytan Adar (University of Michigan), ICWSM'18.
Across all Wikipedia language editions, millions of images augment text in critical ways. This visual encyclopedic knowledge is an important form of wikiwork for editors, a critical part of reader experience, an emerging resource for machine learning, and a lens into cultural differences. However, Wikipedia research--and cross-language edition Wikipedia research in particular--has thus far been limited to text. In this paper, we assess the diversity of visual encyclopedic knowledge across 25 language editions and compare our findings to those reported for textual content. Unlike text, translation in images is largely unnecessary. Additionally, the Wikimedia Foundation, through the Wikipedia Commons, has taken steps to simplify cross-language image sharing. While we may expect that these factors would reduce image diversity, we find that cross-language image diversity rivals, and often exceeds, that found in text. We find that diversity varies between language pairs and content types, but that many images are unique to different language editions. Our findings have implications for readers (in what imagery they see), for editors (in deciding what images to use), for researchers (who study cultural variations), and for machine learning developers (who use Wikipedia for training models).
A Warm Welcome, Not a Cold Start: Eliciting New Editors' Interests via Questionnaires
By Ramtin Yazdanian (presenting, Ecole Polytechnique Federale de Lausanne)
Every day, thousands of users sign up as new Wikipedia contributors. Once joined, these users have to decide which articles to contribute to, which users to reach out to and learn from or collaborate with, etc. Any such task is a hard and potentially frustrating one given the sheer size of Wikipedia. Supporting newcomers in their first steps by recommending articles they would enjoy editing or editors they would enjoy collaborating with is thus a promising route toward converting them into long-term contributors. Standard recommender systems, however, rely on users' histories of previous interactions with the platform. As such, these systems cannot make high-quality recommendations to newcomers without any previous interactions -- the so-called cold-start problem. Our aim is to address the cold-start problem on Wikipedia by developing a method for automatically building short questionnaires that, when completed by a newly registered Wikipedia user, can be used for a variety of purposes, including article recommendations that can help new editors get started. Our questionnaires are constructed based on the text of Wikipedia articles as well as the history of contributions by the already onboarded Wikipedia editors. We have assessed the quality of our questionnaire-based recommendations in an offline evaluation using historical data, as well as an online evaluation with hundreds of real Wikipedia newcomers, concluding that our method provides cohesive, human-readable questions that perform well against several baselines. By addressing the cold-start problem, this work can help with the sustainable growth and maintenance of Wikipedia's diverse editor community.
--
Janna Layton (she, her)
Administrative Assistant - Audiences & Technology
Wikimedia Foundation<https://wikimediafoundation.org/>
Just came across
https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-ten…
In it, the author discusses some of what he calls the 'impedance mismatch'
between data engineers and production engineers. The links to Ubers
Michelangelo <https://eng.uber.com/michelangelo/> (which as far as I can
tell has not been open sourced) and the Hidden Technical Debt in Machine
Learning Systems paper
<https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning…>
are
also very interesting!
At All hands I've been hearing more and more about using ML in production,
so these things seem very relevant to us. I'd love it if we had a working
group (or whatever) that focused on how to standardize how we train and
deploy ML for production use.
:)
Hi,
in light of the current switch from Wikistats 1 to Wikistats 2 I would
like to express a strong desire to get some additional features for the
statistcs. The rationale for this request is described below:
1. How many articles make up for 90 / 95 / 99 percent of all page views
over a certain period? Which articles are these? (live articles
excluding WP:xx and Spezial:xxx etc.)
2. Which share of page views goes to the Top 5% / Top 10% of our pages?
3. List of pages with less than x views per year (x = e.g. 12, 25, 100)
*Reasoning / Rationale*:
I am limiting myself to the German Wikipedia, assuming that the
situation is similar in other language WP's.
In the German Wikipedia we meanwhile have more than 2.2 million
articles. This produces an enormous maintenance workload, to keep up the
quality of articles. We clearly have too few people to do that
maintenance work.
This means that we have to focus our maintenance efforts on those
articles, which really matter and those are much more than the list of
1000 we get from the Top-Views stats. The reports suggested above will
give us exactly the information required to have an informed discussion
about the articles we should focus on. And the report under #3 will give
us a means to identify articles which we could either delete or clearly
label as 'out of maintenance'.
We do know that certain articles are 'en vogue' for a short period, as
they relate to current news topics. Therefore we must be able to have
above reports over a longer period (at least a year, maybe two) to
identify those articles which are really the long term favourites.
Best regards
Peter
my user page: Wikipeter-HH
<https://de.wikipedia.org/wiki/Benutzer:Wikipeter-HH>