Wiki-research-l September 2016

wiki-research-l@lists.wikimedia.org

27 participants
22 discussions

Wikipedia Research policy
by song＠cs.umn.edu 14 Jul '23

14 Jul '23

Pursuant to prior discussions about the need for a research policy on Wikipedia, WikiProject Research is drafting a policy regarding the recruitment of Wikipedia users to participate in studies. At this time, we have a proposed policy, and an accompanying group that would facilitate recruitment of subjects in much the same way that the Bot Approvals Group approves bots. The policy proposal can be found at: http://en.wikipedia.org/wiki/Wikipedia:Research The Subject Recruitment Approvals Group mentioned in the proposal is being described at: http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group Before we move forward with seeking approval from the Wikipedia community, we would like additional input about the proposal, and would welcome additional help improving it. Also, please consider participating in WikiProject Research at: http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research -- Bryan Song GroupLens Research University of Minnesota

8 10

[Analytics] Beeline as Hive client
by Madhumitha Viswanathan 03 Oct '18

03 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

2 2

Wikipedia aggregate clickstream data released
by Dario Taraborelli 17 Jan '18

17 Jan '18

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

5 5

I intend to survey my students about Wikipedia assignments - can you suggest any refinements?
by Piotr Konieczny 29 Sep '16

29 Sep '16

Dear colleagues, My new research project, inspired by the following CfP ( *http://www.asanet.org/journals/TS/SpecialIssueCall.cfm)* aims at trying to judge how effective our teaching assignments on Wikipedia have been, in the context of my globalization lectures in which students have created or expanded dozens of Wikipedia articles (you can see partial list of articles created by my students at https://en.wikipedia.org/wiki/User:Piotrus/Educational_project_results to get an idea of what I had them to do over the past few years). It is clear that Wikipedia benefits, but what about the students? Here are my two questions to you. First, my main source of data is going to be a survey of my former students (N<100). I wonder if anyone is familiar with literature on relevant metrics (i.e. how to design a survey to measure the effectiveness of a teaching instrument)? I have never surveyed students before, and while I am in the middle of a lit review, any suggestions would be appreciated. I am somewhat familiar with the literature on teaching with Wikipedia, but sadly few works have published surveys used. If anything comes to mind that you think would be good to use for comparative studies, that would also be helpful. Second, here is my draft survey: http://tinyurl.com/hehckvs I'd appreciate any comments: is it too long? Are some questions ambiguous? Unnecessary? Leading and creating bias in subsequent questions? Should I rephrase something? Should I ask something else? Thank you for any comments, and do not hesitate to be critical - I'd much rather redo the survey now then after I send it out :) -- Piotr Konieczny, PhD http://hanyang.academia.edu/PiotrKonieczny http://scholar.google.com/citations?user=gdV8_AEAAAAJ http://en.wikipedia.org/wiki/User:Piotrus

2 1

Intro to "Committee" and "Diversity" section approved, "Conflict of interest" needs more work
by Matthew Flaschen 28 Sep '16

28 Sep '16

The community approved the introduction to the Committee section (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Page:_Code_of_Conduct.…) (the part after "Committee" and before the "Diversity" section), as well as the "Diversity" section (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Diversity). There was not consensus to approve the "Conflict of interest" (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Conflict_of_interest) section. Work will continue on this section. See the top of https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Finalize_.22Confl… (including the sections linked from those bullet points). Thanks, Matt Flaschen

1 1

Google open source research on automatic image captioning
by Pine W 28 Sep '16

28 Sep '16

Perhaps of interest: "...We’re making the latest version of our image captioning system available as an open source model in TensorFlow." https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open… Pine

3 3

Re: [Wiki-research-l] Wiki-research-l Digest, Vol 133, Issue 18
by Kevin G Crowston 28 Sep '16

28 Sep '16

We are doing research on a similar situation combining machine vision processing with volunteer annotations in a citizen science project. It would be interesting to see how much translates across these settings, e.g., if our ideas about using the machine annotations are applicable here as well. Kevin Crowston | Associate Dean for Research and Distinguished Professor of Information Science | School of Information Studies Syracuse University 348 Hinds Hall Syracuse, New York 13244 t (315) 443.1676 f 315.443.5806 e crowston(a)syr.edu <mailto:crowston@syr.edu> crowston.syr.edu <http://crowston.syr.edu/> From: Jan Dittrich <jan.dittrich(a)wikimedia.de<mailto:jan.dittrich@wikimedia.de>> Subject: Re: [Wiki-research-l] Google open source research on automatic image captioning I find it interesting which impact this could have on the sense of achievement for volunteers, if captions are autogenerated or suggested and them possibly affirmed or corrected. On one hand one could assume a decreased sense of ownership, on the other hand, it might be more easier to comment/correct then to write from scratch and feel much more efficient. Jan 2016-09-27 23:08 GMT+02:00<http://airmail.calendar/2016-09-27%2017:08:00%20EDT> Dario Taraborelli <dtaraborelli(a)wikimedia.org<mailto:dtaraborelli@wikimedia.org>>: > I forwarded this separately to internally at WMF a few days ago. Clearly – > before thinking of building workflows for human contributors to generate > captions or rich descriptors of media files in Commons – we should look at > what's available in terms of off-the-shelf machine learning services and > libraries. > > #1 rule of sane citizen science/crowdsourcing projects: don't ask humans > to perform tedious tasks machines are pretty good at, get humans to curate > inputs and outputs of machines instead. > > D

1 0

Fwd: [Analytics] Identifying Wikipedia stubs in various languages
by Giuseppe Profiti 23 Sep '16

23 Sep '16

[forwarding my answer from analytics ml, I forgot to subscribe to this list too] Hi Robert, one solution may be to use a query on Wikidata to retrieve the name for the stubs category in all the different languages. Then you could use a tool like PetScan to retrive all the pages in such categories, or write your own tool by using either a query on the database or Mediawiki API. You can find a sample solution here: http://paws-public.wmflabs.org/paws-public/3270/Stub%20categories.ipynb I wrote that thing while on a train, so it may be messy and/or sub-optimal. I would like to thank Alex Monk and Yuvi Panda for their help with SQL on paws today. Best, Giuseppe 2016-09-20 11:26 GMT+02:00 Robert West <west(a)cs.stanford.edu>: > Hi everyone, > > Does anyone know if there's a straightforward (ideally language-independent) > way of identifying stub articles in Wikipedia? > > Whatever works is ok, whether it's publicly available data or data > accessible only on the WMF cluster. > > I've found lists for various languages (e.g., Italian or English), but the > lists are in different formats, so separate code is required for each > language, which doesn't scale. > > I guess in the worst case, I'll have to grep for the respective stub > templates in the respective wikitext dumps, but even this requires to know > for each language what the respective template is. So if anyone could point > me to a list of stub templates in different languages, that would also be > appreciated. > > Thanks! > Bob > > -- > Up for a little language game? -- http://www.unfun.me > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >

2 1

Research Showcase, September 21, 2016
by Sarah R 21 Sep '16

21 Sep '16

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, September 21, 2016 at 11:30 AM (PST) 18:30 (UTC). YouTube stream: https://www.youtube.com/watch?v=fTDkVeqjw80 As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#September_2016>. This month's showcase includes. Finding News Citations for WikipediaBy *Besnik Fetahu <http://www.l3s.de/~fetahu/> (Leibniz University of Hannover)*An important editing policy in Wikipedia is to provide citations for added statements in Wikipedia pages, where statements can be arbitrary pieces of text, ranging from a sentence to a paragraph. In many cases citations are either outdated or missing altogether. In this work we address the problem of finding and updating news citations for statements in entity pages. We propose a two- stage supervised approach for this problem. In the first step, we construct a classifier to find out whether statements need a news citation or other kinds of citations (web, book, journal, etc.). In the second step, we develop a news citation algorithm for Wikipedia statements, which recommends appropriate citations from a given news collection. Apart from IR techniques that use the statement to query the news collection, we also formalize three properties of an appropriate citation, namely: (i) the citation should entail the Wikipedia statement, (ii) the statement should be central to the citation, and (iii) the citation should be from an authoritative source. We perform an extensive evaluation of both steps, using 20 million articles from a real-world news collection. Our results are quite promising, and show that we can perform this task with high precision and at scale. Designing and Building Online Discussion SystemsBy *Amy X. Zhang <http://people.csail.mit.edu/axz/> (MIT)*Today, conversations are everywhere on the Internet and come in many different forms. However, there are still many problems with discussion interfaces today. In my talk, I will first give an overview of some of the problems with discussion systems, including difficulty dealing with large scales, which exacerbates additional problems with navigating deep threads containing lots of back-and-forth and getting an overall summary of a discussion. Other problems include dealing with moderation and harassment in discussion systems and gaining control over filtering, customization, and means of access. Then I will focus on a few projects I am working on in this space now. The first is Wikum, a system I developed to allow users to collaboratively generate a wiki-like summary from threaded discussion. The second, which I have just begun, is exploring the design space of presentation and navigation of threaded discussion. I will next discuss Murmur, a mailing list hybrid system we have built to implement and test ideas around filtering, customization, and flexibility of access, as well as combating harassment. Finally, I'll wrap up with what I am working on at Google Research this summer: developing a taxonomy to describe online forum discussion and using this information to extract meaningful content useful for search, summarization of discussions, and characterization of communities. Hope to see you there! Sarah R. Rodlund Senior Project Coordinator-Engineering, Wikimedia Foundation srodlund(a)wikimedia.org

1 1

Re: [Wiki-research-l] [Analytics] Upcoming reboots of stat and Hadoop hosts due to Kernel upgrades
by Dan Andreescu 21 Sep '16

21 Sep '16

+ research (btw, if we cc analytics and research does that reach everyone using these boxes? Like discovery, fundraising, etc? Basically, everyone who doesn't see this message, raise your hand :)) On Wed, Sep 21, 2016 at 10:44 AM, Luca Toscano <ltoscano(a)wikimedia.org> wrote: > Hi everybody, > > the Analytics team is going to reboot all the stat hosts (stat1002, > stat1003 and stat1004) and the Hadoop cluster nodes to install new kernels > (security upgrade required). The work will start tomorrow morning (Sep > 22nd) at around 9:00 AM CEST. > This task might interfere with ongoing Hadoop jobs or processes running on > the stat* hosts, so please let me know if there is any motivation to > postpone the maintenance. > > Please also feel free to reach out to the analytics IRC channel or to me > directly if you have more questions :) > > Thanks! > > Regards, > > Luca > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >

2 1

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l September 2016