Hi everybody,
due to https://phabricator.wikimedia.org/T227941 we'd need to take down
Oozie/Hive/etc.. on an-coord1001. The maintenance should not last long, but
if you have any issue please reach out to us on IRC (#wikimedia-analytics
on Freenode).
Thanks!
Luca (on behalf of the Analytics team)
Hi all
Some time ago I'm sure I remember reading something about a study that
looked at the link between the number of times a reference was used on
Wikipedia and the number of times the source was referenced in journal
articles. Does anyone know what I'm talking about or have something
similar?
Thanks very much
John
[You can safely skip this message if you have already seen it in the
Wikidata mailing list, and pardon for the spam]
Dear all,
-----------------------------------------------------------------------
TL;DR: soweego version 1 will be released soon. In the meanwhile, why
don't you consider endorsing the next steps?
https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Hjfocs/soweego_1.1
-----------------------------------------------------------------------
This is a pre-release notification for early feedback.
Does the name *soweego* ring you a bell?
It is a machine learning-based pipeline that links Wikidata to large
catalogs [1].
It is a close friend of Mix'n'match [2], which mainly caters for small
catalogs.
The first version is almost done, and will start uploading results soon.
Confident links are going to feed Wikidata via a bot [3], while others
will get into Mix'n'match for curation.
The next short-term steps are detailed in a rapid grant proposal [4],
and I would be really grateful if you could consider an endorsement there.
The soweego team has also tried its best to address the following
community requests:
1. plan a sync mechanism between Wikidata and large catalogs / implement
checks against external catalogs to find mismatches in Wikidata;
2. enable users to add links to new catalogs in a reasonable time.
So, here is the most valuable contribution you can give to the project
right now: understand how to *import a new catalog* [5].
Can't wait for your reactions.
Cheers,
Marco
[1] https://soweego.readthedocs.io/
[2] https://tools.wmflabs.org/mix-n-match/
[3] see past contributions:
https://www.wikidata.org/w/index.php?title=Special:Contributions/Soweego_bo…
[4] https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Hjfocs/soweego_1.1
[5] https://soweego.readthedocs.io/en/latest/new_catalog.html
[You can safely skip this message if you have already seen it in the
Wikidata mailing list, and pardon for the spam]
Dear all,
-----------------------------------------------------------------------
TL;DR: soweego version 1 will be released soon. In the meanwhile, why
don't you consider endorsing the next steps?
https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Hjfocs/soweego_1.1
-----------------------------------------------------------------------
This is a pre-release notification for early feedback.
Does the name *soweego* ring you a bell?
It is a machine learning-based pipeline that links Wikidata to large
catalogs [1].
It is a close friend of Mix'n'match [2], which mainly caters for small
catalogs.
The first version is almost done, and will start uploading results soon.
Confident links are going to feed Wikidata via a bot [3], while others
will get into Mix'n'match for curation.
The next short-term steps are detailed in a rapid grant proposal [4],
and I would be really grateful if you could consider an endorsement there.
The soweego team has also tried its best to address the following
community requests:
1. plan a sync mechanism between Wikidata and large catalogs / implement
checks against external catalogs to find mismatches in Wikidata;
2. enable users to add links to new catalogs in a reasonable time.
So, here is the most valuable contribution you can give to the project
right now: understand how to *import a new catalog* [5].
Can't wait for your reactions.
Cheers,
Marco
[1] https://soweego.readthedocs.io/
[2] https://tools.wmflabs.org/mix-n-match/
[3] see past contributions:
https://www.wikidata.org/w/index.php?title=Special:Contributions/Soweego_bo…
[4] https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Hjfocs/soweego_1.1
[5] https://soweego.readthedocs.io/en/latest/new_catalog.html
Hi everybody,
as part of https://phabricator.wikimedia.org/T201165 the Analytics team
thought to reach out to everybody to make it clear that all the home
directories on the stat/notebook nodes are not backed up periodically. They
run on a software RAID configuration spanning multiple disks of course, so
we are resilient on a disk failure, but even if unlikely if might happen
that a host could loose all its data. Please keep this in mind when working
on important projects and/or handling important data that you care about.
I just added a warning to
https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients.
If you have really important data that is too big to backup, keep in mind
that you can use your home directory (/user/your-username) on HDFS (that
replicates data three times across multiple nodes).
Please let us know if you have comments/suggestions/etc.. in the
aforementioned task.
Thanks in advance!
Luca (on behalf of the Analytics team)
TL;DR: In https://phabricator.wikimedia.org/T170826 the Analytics team
wants to add base firewall rules to stat100x and notebook100x hosts, that
will cause any non-localhost or known traffic to be blocked by default.
Please let us know in the task if this is a problem for you.
Hi everybody,
the Analytics team has always left the stat100x and notebook100x hosts
without a set of base firewall rules to avoid impacting any
research/test/etc.. activity on those hosts. This choice has a lot of
downsides, one of the most problematic ones is that usually environments
like the Python venvs can install potentially any package, and if the owner
does not pay attention to security upgrades then we may have a security
problem if the environment happens to bind to a network port and accept
traffic from anywhere.
One of the biggest problems was Spark: when somebody launches a shell using
Hadoop Yarn (--master yarn), a Driver component is created that needs to
bind to a random port to be able to communicate with the workers created on
the Hadoop cluster. We assumed that instructing Spark to use a predefined
range of random ports was not possible, but in
https://phabricator.wikimedia.org/T170826 we discovered that there is a way
(that seems to work fine from our tests). The other big use case that we
know, Jupyter notebooks, seems to require only localhost traffic flow
without restrictions.
Please let us know in the task if you have a use case that requires your
environment to bind to a network port on stat100x or notebook100x and
accept traffic from other hosts. For example, having a python app that
binds to port 33000 on stat1007 and listens/accepts traffic from other stat
or notebook hosts.
If we don't hear anything, we'll start adding base firewall rules to one
host at the time during the upcoming weeks, tracking our work on the
aforementioned task.
Thanks!
Luca (on behalf of the Analytics team)
Given some of the recent under-performance noticed by
toxicity-sniffing tools, I thought I would ask what people here think
of http://chat.dbpedia.org
Have a look: https://i.imgur.com/jKqRRTw.png
Is anyone else working on an open source text chatbot based on Wikidata?
I can offer no promises about how much it will teach you about
Wikidata or Natural Language Processing, but a good starter task would
be e.g.
https://github.com/dbpedia/GSoC/issues/11
Hello,
For those of you who are interested in "small" Wikipedias and Indigenous
languages, here's a new academic paper co-signed by yours truly.
Published in an open access journal :)
Nathalie Casemajor (Seeris)
-
*Openness, Inclusion and Self-Affirmation: Indigenous knowledge in Open
Knowledge Projects
<http://peerproduction.net/editsuite/issues/issue-13-open/peer-reviewed-pape…>*
This paper is based on an action research project (Greenwood and Levin,
1998) conducted in 2016-2017 in partnership with the Atikamekw Nehirowisiw
Nation and Wikimedia Canada. Built into the educational curriculum of a
secondary school on the Manawan reserve, the project led to the launch of a
Wikipedia encyclopaedia in the Atikamekw Nehirowisiw language. We discuss
the results of the project by examining the challenges and opportunities
raised in the collaborative process of creating Wikimedia content in the
Atikamekw Nehirowisiw language. What are the conditions of inclusion of
Indigenous and traditional knowledge in open projects? What are the
cultural and political dimensions of empowerment in this relationship
between openness and inclusion? How do the processes of inclusion and
negotiation of openness affect Indigenous skills and worlding processes?
Drawing from media studies, indigenous studies and science and technology
studies, we adopt an ecological perspective (Star, 2010) to analyse the
complex relationships and interactions between knowledge practices,
ecosystems and infrastructures. The material presented in this paper is the
result of the group of participants’ collective reflection digested by one
Atikamekw Nehirowisiw and two settlers. Each co-writer then brings his/her
own expertise and speaks from what he or she knows and has been trained for.
Casemajor N., Gentelet K., Coocoo C. (2019), « Openness, Inclusion and
Self-Affirmation: Indigenous knowledge in Open Knowledge Projects », *Journal
of Peer Production*, no13, pp. 1-20.
More info about the Atikamekw Wikipetcia project and the involvement
of Wikimedia Canada:
https://ca.wikimedia.org/…/Atikamekw_knowledge,_culture_and…
<https://ca.wikimedia.org/wiki/Atikamekw_knowledge,_culture_and_language_in_…>
Hello,
I’m with a group of researchers <https://grouplens.org/> working on using
Artificial Intelligence (AI) tools to promote gender diversity in Wikipedia
contents and thus to close the gender gap
<https://en.wikipedia.org/wiki/Gender_bias_on_Wikipedia>. We want to build
a recommender system that targets the gender gap in content, while creating
personalized article recommendations for editors. To ensure that our tool
addresses real community issues, we plan to design the recommender
algorithms by incorporating the feedback from stakeholders in the
community, such as members of the WikiProject Women in Red, related
WikiProjects, and others who are concerned with this issue. We want to
understand your concerns and values as we come up with effective
algorithmic designs.
For more details about our project, please refer to our Wikimedia project
meta page
<https://meta.wikimedia.org/wiki/Research:Closing_the_Gender_Content_Gap_in_…>
.
If you are interested or have any thoughts and suggestions, please feel
free to reach out to me at bowen-yu(a)umn.edu and we can plan a time to
connect. Thanks!
Hi all,
The next Research Showcase will be live-streamed this Wednesday, June 26,
at 11:30 AM PST/19:30 UTC. We will have three presentations this showcase,
all relating to Wikipedia blocks.
YouTube stream: https://www.youtube.com/watch?v=WiUfpmeJG7E
As usual, you can join the conversation on IRC at #wikimedia-research. You
can also watch our past research showcases here:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase
This month's presentations:
Trajectories of Blocked Community Members: Redemption, Recidivism and
Departure
By Jonathan Chang, Cornell University
Community norm violations can impair constructive communication and
collaboration online. As a defense mechanism, community moderators often
address such transgressions by temporarily blocking the perpetrator. Such
actions, however, come with the cost of potentially alienating community
members. Given this tradeoff, it is essential to understand to what extent,
and in which situations, this common moderation practice is effective in
reinforcing community rules. In this work, we introduce a computational
framework for studying the future behavior of blocked users on Wikipedia.
After their block expires, they can take several distinct paths: they can
reform and adhere to the rules, but they can also recidivate, or
straight-out abandon the community. We reveal that these trajectories are
tied to factors rooted both in the characteristics of the blocked
individual and in whether they perceived the block to be fair and
justified. Based on these insights, we formulate a series of prediction
tasks aiming to determine which of these paths a user is likely to take
after being blocked for their first offense, and demonstrate the
feasibility of these new tasks. Overall, this work builds towards a more
nuanced approach to moderation by highlighting the tradeoffs that are in
play.
Automatic Detection of Online Abuse in Wikipedia
By Lane Rasberry, University of Virginia
Researchers analyzed all English Wikipedia blocks prior to 2018 using
machine learning. With insights gained, the researchers examined all
English Wikipedia users who are not blocked against the identified
characteristics of blocked users. The results were a ranked set of
predictions of users who are not blocked, but who have a history of conduct
similar to that of blocked users. This research and process models a system
for the use of computing to aid human moderators in identifying conduct on
English Wikipedia which merits a block.
Project page:
https://meta.wikimedia.org/wiki/University_of_Virginia/Automatic_Detection_…
Video: https://www.youtube.com/watch?v=AIhdb4-hKBo
First Insights from Partial Blocks in Wikimedia Wikis
By Morten Warncke-Wang, Wikimedia Foundation
The Anti-Harassment Tools team at the Wikimedia Foundation released the
partial block feature in early 2019. Where previously blocks on Wikimedia
wikis were sitewide (users were blocked from editing an entire wiki),
partial blocks makes it possible to block users from editing specific pages
and/or namespaces. The Italian Wikipedia was the first wiki to start using
this feature, and it has since been rolled out to other wikis as well. In
this presentation, we will look at how this feature has been used in the
first few months since release.
--
Janna Layton (she, her)
Administrative Assistant - Audiences & Technology
Wikimedia Foundation <https://wikimediafoundation.org/>