Hello everyone - apologies for cross-posting! *TL;DR*: We would like your
feedback on our Metrics Kit project. Please have a look and comment on
Meta-Wiki:
https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
The Wikimedia Foundation's Trust and Safety team, in collaboration with the
Community Health Initiative, is working on a Metrics Kit designed to
measure the relative "health"[1] of various communities that make up the
Wikimedia movement:
https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
The ultimate outcome will be a public suite of statistics and data looking
at various aspects of Wikimedia project communities. This could be used by
both community members to make decisions on their community direction and
Wikimedia Foundation staff to point anti-harassment tool development in the
right direction.
We have a set of metrics we are thinking about including in the kit,
ranging from the ratio of active users to active administrators,
administrator confidence levels, and off-wiki factors such as freedom to
participate. It's ambitious, and our methods of collecting such data will
vary.
Right now, we'd like to know:
* Which metrics make sense to collect? Which don't? What are we missing?
* Where would such a tool ideally be hosted? Where would you normally look
for statistics like these?
* We are aware of the overlap in scope between this and Wikistats <
https://stats.wikimedia.org/v2/#/all-projects> — how might these tools
coexist?
Your opinions will help to guide this project going forward. We'll be
reaching out at different stages of this project, so if you're interested
in direct messaging going forward, please feel free to indicate your
interest by signing up on the consultation page.
Looking forward to reading your thoughts.
best,
Joe
P.S.: Please feel free to CC me in conversations that might happen on this
list!
[1] What do we mean by "health"? There is no standard definition of what
makes a Wikimedia community "healthy", but there are many indicators that
highlight where a wiki is doing well, and where it could improve. This
project aims to provide a variety of useful data points that will inform
community decisions that will benefit from objective data.
--
*Joe Sutherland* (he/him or they/them)
Trust and Safety Specialist
Wikimedia Foundation
joesutherland.rocks
Hi everybody,
as part of https://phabricator.wikimedia.org/T201165 the Analytics team
thought to reach out to everybody to make it clear that all the home
directories on the stat/notebook nodes are not backed up periodically. They
run on a software RAID configuration spanning multiple disks of course, so
we are resilient on a disk failure, but even if unlikely if might happen
that a host could loose all its data. Please keep this in mind when working
on important projects and/or handling important data that you care about.
I just added a warning to
https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients.
If you have really important data that is too big to backup, keep in mind
that you can use your home directory (/user/your-username) on HDFS (that
replicates data three times across multiple nodes).
Please let us know if you have comments/suggestions/etc.. in the
aforementioned task.
Thanks in advance!
Luca (on behalf of the Analytics team)
Hi WMF Analytics,
In my web searches in the past few months I am seeing an increasing number
of websites that have republished Wikimedia content, sometimes in ways that
I suspect are in violation of trademark and/or Creative Commons licensing
rules. (My guess is that these sites make money through advertising that
they place on their sites.) Has WMF observed any negative effects in web
traffic that can be attributed to other websites reusing Wikimedia content
and/or trademarks?
It might be interesting if WMF can obtain statistics from web search
providers regarding how many times users click on search engine links to
sites that reuse Wikimedia content and/or trademarks.
Pine
( https://meta.wikimedia.org/wiki/User:Pine )
Hi all!
The Analytics team is planning to upgrade OpenJDK in our Hadoop cluster (
https://phabricator.wikimedia.org/T229003) tomorrow Tuesday 30th of July at
10am CEST.
Hive and Oozie will be unavailable for 10 to 15 minutes, and any ongoing
Oozie jobs or Hive (beeline) queries will be interrupted (we'll let the
outstanding ones finish, if possible).
If this will break some important job that you have running, please let us
know in the Phabricator task above or via IRC (#wikimedia-analytics).
Cheers!
Marcel (on behalf of the Analytics team)
--
*Marcel Ruiz Forns** (he/him)*
Analytics Developer @ Wikimedia Foundation
Hi all,
The next Research Showcase will be live-streamed next Wednesday, July 17,
at 11:30 AM PDT/18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=i9vvwV5KfW4
As usual, you can join the conversation on IRC at #wikimedia-research. You
can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
Characterizing Incivility on Wikipedia
Elizabeth Whittaker, University of Michigan School of Information
In a society whose citizens have a variety of viewpoints, there is a
question of how citizens can govern themselves in ways that allow these
viewpoints to co-exist. Online deliberation has been posited as a problem
solving mechanism in this context, and civility can be thought of as a
mechanism that facilitates this deliberation. Civility can thus be thought
of as a method of interaction that encourages collaboration, while
incivility disrupts collaboration. However, it is important to note that
the nature of online civility is shaped by its history and the technical
architecture scaffolding it. Civility as a concept has been used both to
promote equal deliberation and to exclude the marginalized from
deliberation, so we should be careful to ensure that our conceptualizations
of incivility reflect what we intend them to in order to avoid
unintentionally reinforcing inequality.
To this end, we examined Wikipedia editors’ perceptions of interactions
that disrupt collaboration through 15 semi-structured interviews. Wikipedia
is a highly deliberative platform, as editors need to reach consensus about
what will appear on the article page, a process that often involves
deliberation to coordinate, and any disruption to this process should be
apparent. We found that incivility on Wikipedia typically occurs in one of
three ways: through weaponization of Wikipedia’s policies, weaponization of
Wikipedia’s technical features, and through more typical vitriolic content.
These methods of incivility were gendered, and had the practical effect of
discouraging women from editing. We implicate this pattern as one of the
underlying causes of Wikipedia’s gender gap.
Hidden Gems in the Wikipedia Discussions: The Wikipedians’ Rationales
Lu Xiao, Syracuse University School of Information Studies
I will present a series of completed and ongoing studies that are aimed at
understanding the role of the Wikipedians’ rationales in Wikipedia
discussions. We define a rationale as one’s justification of her viewpoint
and suggestions. Our studies demonstrate the potential of leveraging the
Wikipedians’ rationales in discussions as resources for future
decision-making and as resources for eliciting knowledge about the
community’s norms, practices and policies. Viewed as rich digital traces in
these environments, we consider them to be beneficial for the community
members, such as helping newcomers familiarize themselves on the commonly
accepted justificatory reasoning styles. We call for more research
attention to the discussion content from this rationale study perspective.
--
Janna Layton (she, her)
Administrative Assistant - Audiences & Technology
Wikimedia Foundation <https://wikimediafoundation.org/>
To whom it might concern,
I am writing in regards of the project *Cultural Diversity Observatory* and
the data we are collecting. In short, this project aims at bridging the
content gaps between language editions that relate to cultural and
geographical aspects. For this we need to retrieve data from all language
editions and Wikidata, and run some scripts in order to crawl down the
category and the link graph, in order to create some datasets and
statistics.
The reason that I am writing is because we are stuck as we cannot
automatize the scripts to retrieve data from the Replicas. We could create
the datasets few months ago but during the past months it is impossible.
We are concerned because one thing is to create the dataset once for
research purposes and another thing is to create them on monthly basis.
This is what we promised in the project grant
<https://meta.wikimedia.org/wiki/Grants:Project/WCDO/Culture_Gap_Monthly_Mon…>
details and now we cannot do it because of the infrastructure. It is
important to do it on monthly basis because the data visualizations and
statistics Wikipedia communities will receive need to be updated.
Lately there had been some changes in the Replicas databases and the
queries that used to take several hours are getting stuck completely. We
tried to code them in multiple ways: a) using complex queries, b) doing the
joins as code logics and in-memory, c) downloading the parts of the table
that we require and storing them in a local database. *None is an option
now *considering the current performance of the replicas.
Bryan Davis suggested that this might be a moment to consult the Analytics
team, considering the Hadoop environemnt is design to run long, complex
queries and it has massively more compute power than the Wiki Replicas
cluster. We would certainly be relieved If you considerd we could connect
to these Analytics databases (Hadoop).
Let us know if you need more information on the specific queries or the
processes we are running. The server we are using is wcdo.eqiad.wmflabs. We
will be happy to explain in detail anything you require.
Thanks.
Best regards,
Marc Miquel
PS: You can read about the method we follow to retrieve data and create the
dataset here:
*Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity
Dataset: A Complete Cartography for 300 Language Editions. Proceedings of
the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM.
2334-0770 *
wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/
Hi everybody,
due to https://phabricator.wikimedia.org/T227941 we'd need to take down
Oozie/Hive/etc.. on an-coord1001. The maintenance should not last long, but
if you have any issue please reach out to us on IRC (#wikimedia-analytics
on Freenode).
Thanks!
Luca (on behalf of the Analytics team)
TL;DR: In https://phabricator.wikimedia.org/T170826 the Analytics team
wants to add base firewall rules to stat100x and notebook100x hosts, that
will cause any non-localhost or known traffic to be blocked by default.
Please let us know in the task if this is a problem for you.
Hi everybody,
the Analytics team has always left the stat100x and notebook100x hosts
without a set of base firewall rules to avoid impacting any
research/test/etc.. activity on those hosts. This choice has a lot of
downsides, one of the most problematic ones is that usually environments
like the Python venvs can install potentially any package, and if the owner
does not pay attention to security upgrades then we may have a security
problem if the environment happens to bind to a network port and accept
traffic from anywhere.
One of the biggest problems was Spark: when somebody launches a shell using
Hadoop Yarn (--master yarn), a Driver component is created that needs to
bind to a random port to be able to communicate with the workers created on
the Hadoop cluster. We assumed that instructing Spark to use a predefined
range of random ports was not possible, but in
https://phabricator.wikimedia.org/T170826 we discovered that there is a way
(that seems to work fine from our tests). The other big use case that we
know, Jupyter notebooks, seems to require only localhost traffic flow
without restrictions.
Please let us know in the task if you have a use case that requires your
environment to bind to a network port on stat100x or notebook100x and
accept traffic from other hosts. For example, having a python app that
binds to port 33000 on stat1007 and listens/accepts traffic from other stat
or notebook hosts.
If we don't hear anything, we'll start adding base firewall rules to one
host at the time during the upcoming weeks, tracking our work on the
aforementioned task.
Thanks!
Luca (on behalf of the Analytics team)
Forwarding a quick question from Peter so we can answer it publicly or take
advantage of work others have done:
[Can we] estimate how many visitors visit pages with equations (i.e.,
wikitext math tags)?
When we're talking about "how many visitors" we're talking about our Unique
Devices data
<https://wikitech.wikimedia.org/wiki/Analytics/AQS/Unique_Devices>. This
is an estimate and the way it's computed restricts us to only knowing high
level numbers at the project or project family level (like de.wikipedia or
"all wikipedias"). So if an estimate for number of visitors to a specific
subset of pages is required, we don't collect that data.
If what's needed is "how many visits", then we have Pageview data
<https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews>. This is
broken down per page, and available as a bulk download, through an API,
etc. So if you can compile a list of pages that have some feature (like
equations for example), then it's possible to cross-reference that with
pageview data and get the answer. To compile this list in the specific
case of equations, you may be able to use the templatelinks
<https://www.mediawiki.org/wiki/Manual:Templatelinks_table> table of the
project you're interested in. These are mirrored to the cloud db replicas
<https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database>. So if
equations generally use templates, then you can search for pages with links
to those templates in that table, and that would be the list of pages
you're interested in. With those page titles / page ids you can then query
the pageview data.
Hope this helps explain a bit more our data, but feel free to follow up
with questions.