Wiki-research-l July 2019

wiki-research-l@lists.wikimedia.org

23 participants
16 discussions

by song＠cs.umn.edu

Pursuant to prior discussions about the need for a research policy on Wikipedia, WikiProject Research is drafting a policy regarding the recruitment of Wikipedia users to participate in studies. At this time, we have a proposed policy, and an accompanying group that would facilitate recruitment of subjects in much the same way that the Bot Approvals Group approves bots. The policy proposal can be found at: http://en.wikipedia.org/wiki/Wikipedia:Research The Subject Recruitment Approvals Group mentioned in the proposal is being described at: http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group Before we move forward with seeking approval from the Wikipedia community, we would like additional input about the proposal, and would welcome additional help improving it. Also, please consider participating in WikiProject Research at: http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research -- Bryan Song GroupLens Research University of Minnesota

9 months, 2 weeks

soweego 1.0 release

by Marco Fossati

[Please disregard this message if you have already read it in the Wikidata mailing list, and apologies for the distraction] Hi everyone, -------------------------------- TL;DR: soweego version 1 is out! https://soweego.readthedocs.io/ Like it? Star it! -------------------------------- The soweego team is delighted to announce the release of *version 1* [1]! If you like it, why don't you click on the Star button? *soweego* links Wikidata to large catalogs through machine learning. It partners with Mix'n'match [2], which mainly deals with small catalogs. The soweego bot [3] is currently uploading *255 k confident* links to Wikidata: see it in action [4]! *126 k* medium-confident* links are instead getting into Mix'n'match for curation: see the current catalogs [5-13]. The soweego team has also worked hard to address the following community requests: 1. sync Wikidata to external catalogs & check them to spot inconsistencies in Wikidata; 2. import new catalogs with reasonable effort. Thinking of the best way to contribute? Try to *import a new catalog* [14]. Best, Marco [1] https://soweego.readthedocs.io/ [2] https://tools.wmflabs.org/mix-n-match/ [3] https://www.wikidata.org/wiki/User:Soweego_bot [4] https://xtools.wmflabs.org/ec/wikidata.org/Soweego%20bot [5] https://tools.wmflabs.org/mix-n-match/#/catalog/2694 [6] https://tools.wmflabs.org/mix-n-match/#/catalog/2695 [7] https://tools.wmflabs.org/mix-n-match/#/catalog/2696 [8] https://tools.wmflabs.org/mix-n-match/#/catalog/2709 [9] https://tools.wmflabs.org/mix-n-match/#/catalog/2710 [10] https://tools.wmflabs.org/mix-n-match/#/catalog/2711 [11] https://tools.wmflabs.org/mix-n-match/#/catalog/2478 [12] https://tools.wmflabs.org/mix-n-match/#/catalog/2712 [13] https://tools.wmflabs.org/mix-n-match/#/catalog/2713 [14] https://soweego.readthedocs.io/en/latest/new_catalog.html

4 years, 9 months

Upcoming Research Newsletter: New Papers Open For Review

by Mohammed Sadat Abdulai

Hi everyone, We’re preparing for the July 2019 research newsletter and looking for contributors. Please take a look at https://etherpad.wikimedia.org/p/WRN201907 and add your name next to any paper you are interested in covering. In case you have time over this weekend, the writing deadline is on Monday, July 30 already. If you can't make this deadline but would like to cover a particular paper in the subsequent issue, leave a note next to the paper's entry below. As usual, short notes and one-paragraph reviews are most welcome. Highlights from this month: - Revealing the Role of User Moods in Struggling Search Tasks - Building a Knowledge Graph for Recommending Experts - Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability - Anomaly Detection in the Dynamics of Web and Social Networks Using Associative Memory - Openness, Inclusion And Self-Affirmation: Indigenous Knowledge In Open Knowledge Projects - The Quality and Readability of English Wikipedia Anatomy Articles - Different Topic, Different Traffic: How Search and Navigation Interplay on Wikipedia - Uncovering the Semantics of Wikipedia Categories - Adapting NMT to caption translation in Wikimedia Commons for low-resource languages - Discovering Implicational Knowledge in Wikidata Mohammed S. Abdulai and Tilman Bayer [1] Research:Newsletter - Meta[2] WikiResearch (@WikiResearch) on Twitter

4 years, 9 months

Second Call for Workshops and Tutorials, ECIR 2020 in Lisbon

by S. Nunes

Second Call for Workshops and Tutorials, ECIR 2020 in Lisbon Deadlines: 1 September (Workshops) & 15 November (Tutorials) ==================== ECIR 2020 - 42nd European Conference on Information Retrieval Call for Workshops and Tutorials Lisbon, Portugal - April 14-17, 2020 http://www.ecir2020.org/ ==================== Workshops The purpose of workshops is to provide a platform for presenting novel ideas and research results in a focused and more interactive way. Workshops can be of either a half-day (3 hours plus breaks) or a full day (6 hours plus breaks). Workshops are encouraged to be as dynamic and interactive as possible and should lead to a concrete outcome, such as the publication of a summary paper and/or workshop proceedings. The information required for a workshop proposal is on the conference website. Workshop proposals will be reviewed by the workshop committee. A summary paper of the workshop will be published in the conference proceedings. Please find more information at http://www.ecir2020.org/call-for-workshops/ Workshops dates: 1 September 2019 – Workshop submission 1 October 2019 – Workshop notification 14 April 2020 – Workshops and Tutorials Tutorials Tutorials inform the community on recent advances in core IR research, related research, or on novel application areas related to IR. They may focus on specific problems or specific domains in which IR research may be applied. Tutorials can be of either a half-day (3 hours plus breaks) or a full day (6 hours plus breaks). Tutorials are encouraged to be as interactive as possible. Please follow tutorial proposal instructions on the conference website (link below). Tutorial proposals will be reviewed by the tutorial committee. A summary of the tutorial will be published in the conference proceedings. Further information can be found at http://www.ecir2020.org/call-for-tutorials/ Tutorials dates: 15 November 2019 – Tutorial submission 15 December 2019 – Tutorial notification 14 April 2020 – Workshops and Tutorials Hope to see you in Lisbon!

4 years, 9 months

[Wikimedia Research Showcase] July 17, 2019 at 11:30 AM PDT, 18:30 UTC

by Janna Layton

Hi all, The next Research Showcase will be live-streamed next Wednesday, July 17, at 11:30 AM PDT/18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=i9vvwV5KfW4 As usual, you can join the conversation on IRC at #wikimedia-research. You can also watch our past research showcases here: https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase This month's presentations: Characterizing Incivility on Wikipedia Elizabeth Whittaker, University of Michigan School of Information In a society whose citizens have a variety of viewpoints, there is a question of how citizens can govern themselves in ways that allow these viewpoints to co-exist. Online deliberation has been posited as a problem solving mechanism in this context, and civility can be thought of as a mechanism that facilitates this deliberation. Civility can thus be thought of as a method of interaction that encourages collaboration, while incivility disrupts collaboration. However, it is important to note that the nature of online civility is shaped by its history and the technical architecture scaffolding it. Civility as a concept has been used both to promote equal deliberation and to exclude the marginalized from deliberation, so we should be careful to ensure that our conceptualizations of incivility reflect what we intend them to in order to avoid unintentionally reinforcing inequality. To this end, we examined Wikipedia editors’ perceptions of interactions that disrupt collaboration through 15 semi-structured interviews. Wikipedia is a highly deliberative platform, as editors need to reach consensus about what will appear on the article page, a process that often involves deliberation to coordinate, and any disruption to this process should be apparent. We found that incivility on Wikipedia typically occurs in one of three ways: through weaponization of Wikipedia’s policies, weaponization of Wikipedia’s technical features, and through more typical vitriolic content. These methods of incivility were gendered, and had the practical effect of discouraging women from editing. We implicate this pattern as one of the underlying causes of Wikipedia’s gender gap. Hidden Gems in the Wikipedia Discussions: The Wikipedians’ Rationales Lu Xiao, Syracuse University School of Information Studies I will present a series of completed and ongoing studies that are aimed at understanding the role of the Wikipedians’ rationales in Wikipedia discussions. We define a rationale as one’s justification of her viewpoint and suggestions. Our studies demonstrate the potential of leveraging the Wikipedians’ rationales in discussions as resources for future decision-making and as resources for eliciting knowledge about the community’s norms, practices and policies. Viewed as rich digital traces in these environments, we consider them to be beneficial for the community members, such as helping newcomers familiarize themselves on the commonly accepted justificatory reasoning styles. We call for more research attention to the discussion content from this rationale study perspective. -- Janna Layton (she, her) Administrative Assistant - Audiences & Technology Wikimedia Foundation <https://wikimediafoundation.org/>

4 years, 9 months

Social Media Bots, Recommender Systems and Wiki Technology

by Adam Sobieski

Introduction A recent research article of mine, Artificial Wisdom [1], pertains to search engines and recommender systems for: anecdotes, proverbs, quotations, lyrics, poetry, narratives (e.g. parables, allegories), and humor. I refer to these materials as wisdom materials. A variety of social media bot is described with an integrated recommender system for wisdom materials. Wiki technology is described as a reasonable option to consider for producing and maintaining dynamic corpora of wisdom materials to be utilized by such systems. Social Media Bots with Integrated Recommender Systems Wisdom materials could be provided to users daily, quote-of-the-day services, or on an as-needed basis, whenever a social media bot determines that a wisdom material could provide value to a user. Recommended wisdom materials could be sent to users via a number of possible channels (e.g. social media websites, instant messenger applications, e-mail). To ensure that recommended wisdom materials provide value to users, recommender systems could utilize users’ social media posts and other content to obtain context data. Systems could process users’ posts and content including to estimate users’ affect or complex mental states. Should a user indicate that they are sad, utilizing a status update or by otherwise sharing that information in a post with their friends and a social media bot, the social media bot could send one or more wisdom materials to uplift the user. Recommender systems for wisdom materials, operating at scale, could learn and improve over the course of time. Users could be provided with a number of options for providing feedback with respect to wisdom material recommendations. Systems could also process users’ social media posts and content before, during and after encounters with recommended wisdom materials. Did a recommended wisdom material provide value to, e.g. uplift, a user in a context? Utilizing technology such as Solid, social media bots with integrated recommender systems can provide value to users while simultaneously protecting users’ data privacy. Wiki Technology Wiki technology is a reasonable option to consider for producing and maintaining dynamic corpora of wisdom materials. Such Wiki systems could be utilized by organizations stewarding social media bots, by larger numbers of users, or by the general public. Some varieties of wisdom materials tend to make use of figurative language. Accordingly, the indexing of such materials shouldn’t necessarily utilize the literal text of the items. For instance, the proverb “a rolling stone gathers no moss” is not best indexed and searched for by its lexemes. One desires to search, somehow, for the contents of the interpretations of the items rather than for the literal text of the items. Contextual searching is also desirable. Varieties of wisdom materials should be modeled; a wisdom material, for instance, could have multiple interpretations and each interpretation could have text content for indexing, keywords, categories, and so forth. The modeling of wisdom materials, Wiki-based technology for producing and maintaining dynamic corpora of wisdom materials, and social media bots with integrated recommender systems for wisdom materials are indicated to be contemporary research topics. Best regards, Adam Sobieski [1] http://www.phoster.com/artificial-wisdom/

4 years, 9 months

Re: [Wiki-research-l] [Analytics] Analytics clients (stat/notebook hosts) and backups of home directories

by Luca Toscano

Hi Leila and Kate, adding a few words after Nuria's email to clarify my original intentions. My point was that any important and vital file that needs to be preserved may be stored in HDFS rather than on stat/notebooks due to the absence of backups of the home directories. My concern was that people had a different understanding about backups and I wanted to clarify. We (as Analytics team) don't have any good way at the moment to periodically scan HDFS and home directories across hosts to find PII data that is retained more than the allowed period of time. The main motivation is that we'd need to find a way to check a huge amount of files, with different names and formats, and figure out if the data contained in them is PII and retained more than X days. It is not an impossible task but not easy or trivial, we'd need a lot more staff in my opinion to create and maintain something similar :) We started recently with the clean up of old home directories (i.e. belonging to users not active anymore) and we established a process with SRE to get pinged when a user is offboarded to verify what data should be kept and what not (I know that both of you are aware of this since you have been working with us on several tasks, I am writing it to allow other people to get the context :). This is only a starting point, I really hope to have something more robust and complete in the future. In the meantime, I'd say that every user is responsible of the data that he/she handles on the Analytics infrastructure, periodically reviewing it and deleting when necessary. I don't have a specific guideline/process to suggest, but we can definitely have a chat together and decide something shared among our teams! Let me know if this makes sense or not :) Thanks, Luca Il giorno mer 10 lug 2019 alle ore 23:15 Nuria Ruiz <nuria(a)wikimedia.org> ha scritto: > >I have one question for you: As you allow/encourage for more copies of > >the files to exist > To be extra clear, we do not encourage for data to be in that notebooks > hosts at all, there is no capacity of them to neither process nor hosts > large amounts of data. Data that you are working with is best placed on > /user/your-username databse in hadoop so far from encouraging multiple > copies we are rather encouraging you keep the data outside the notebook > machines. > > Thanks, > > Nuria > > On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman <kzimmerman(a)wikimedia.org> > wrote: > >> I second Leila's question. The issue of how we flag PII data and ensure >> it's appropriately scrubbed came up in our team meeting yesterday. We're >> discussing team practices for data/project backups tomorrow and plan to >> come out with some proposals, at least for the short term. >> >> Are there any existing processes or guidelines I should be aware of? >> >> Thanks! >> Kate >> >> -- >> >> Kate Zimmerman (she/they) >> Head of Product Analytics >> Wikimedia Foundation >> >> >> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia <leila(a)wikimedia.org> wrote: >> >>> Hi Luca, >>> >>> Thanks for the heads up. Isaac is coordinating a response from the >>> Research side. >>> >>> I have one question for you: As you allow/encourage for more copies of >>> the files to exist, what is the mechanism you'd like to put in place >>> for reducing the chances of PII to be copied in new folders that then >>> will be even harder (for your team) to keep track of? Having an >>> explicit process/understanding about this will be very helpful. >>> >>> Thanks, >>> Leila >>> >>> >>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano <ltoscano(a)wikimedia.org> >>> wrote: >>> > >>> > Hi everybody, >>> > >>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics >>> team >>> > thought to reach out to everybody to make it clear that all the home >>> > directories on the stat/notebook nodes are not backed up periodically. >>> They >>> > run on a software RAID configuration spanning multiple disks of >>> course, so >>> > we are resilient on a disk failure, but even if unlikely if might >>> happen >>> > that a host could loose all its data. Please keep this in mind when >>> working >>> > on important projects and/or handling important data that you care >>> about. >>> > >>> > I just added a warning to >>> > >>> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients >>> . >>> > If you have really important data that is too big to backup, keep in >>> mind >>> > that you can use your home directory (/user/your-username) on HDFS >>> (that >>> > replicates data three times across multiple nodes). >>> > >>> > Please let us know if you have comments/suggestions/etc.. in the >>> > aforementioned task. >>> > >>> > Thanks in advance! >>> > >>> > Luca (on behalf of the Analytics team) >>> > _______________________________________________ >>> > Wiki-research-l mailing list >>> > Wiki-research-l(a)lists.wikimedia.org >>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >>> _______________________________________________ >> Analytics mailing list >> Analytics(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >

4 years, 9 months

Urgent maintenance to an-coord1001 requires a brief stop of Oozie/Hive/Spark/etc..

by Luca Toscano

Hi everybody, due to https://phabricator.wikimedia.org/T227941 we'd need to take down Oozie/Hive/etc.. on an-coord1001. The maintenance should not last long, but if you have any issue please reach out to us on IRC (#wikimedia-analytics on Freenode). Thanks! Luca (on behalf of the Analytics team)

4 years, 9 months

[Wikimedia Research] link between Wikipedia references and citations in scientific journal articles

by john cummings

Hi all Some time ago I'm sure I remember reading something about a study that looked at the link between the number of times a reference was used on Wikipedia and the number of times the source was referenced in journal articles. Does anyone know what I'm talking about or have something similar? Thanks very much John

4 years, 9 months

soweego: link Wikidata to large catalogs

by Marco Fossati

[You can safely skip this message if you have already seen it in the Wikidata mailing list, and pardon for the spam] Dear all, ----------------------------------------------------------------------- TL;DR: soweego version 1 will be released soon. In the meanwhile, why don't you consider endorsing the next steps? https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Hjfocs/soweego_1.1 ----------------------------------------------------------------------- This is a pre-release notification for early feedback. Does the name *soweego* ring you a bell? It is a machine learning-based pipeline that links Wikidata to large catalogs [1]. It is a close friend of Mix'n'match [2], which mainly caters for small catalogs. The first version is almost done, and will start uploading results soon. Confident links are going to feed Wikidata via a bot [3], while others will get into Mix'n'match for curation. The next short-term steps are detailed in a rapid grant proposal [4], and I would be really grateful if you could consider an endorsement there. The soweego team has also tried its best to address the following community requests: 1. plan a sync mechanism between Wikidata and large catalogs / implement checks against external catalogs to find mismatches in Wikidata; 2. enable users to add links to new catalogs in a reasonable time. So, here is the most valuable contribution you can give to the project right now: understand how to *import a new catalog* [5]. Can't wait for your reactions. Cheers, Marco [1] https://soweego.readthedocs.io/ [2] https://tools.wmflabs.org/mix-n-match/ [3] see past contributions: https://www.wikidata.org/w/index.php?title=Special:Contributions/Soweego_bo… [4] https://meta.wikimedia.org/wiki/Grants:Project/Rapid/Hjfocs/soweego_1.1 [5] https://soweego.readthedocs.io/en/latest/new_catalog.html

4 years, 9 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l July 2019