Wiki-research-l July 2017

wiki-research-l@lists.wikimedia.org

26 participants
23 discussions

Wikipedia Research policy
by song＠cs.umn.edu 14 Jul '23

14 Jul '23

Pursuant to prior discussions about the need for a research policy on Wikipedia, WikiProject Research is drafting a policy regarding the recruitment of Wikipedia users to participate in studies. At this time, we have a proposed policy, and an accompanying group that would facilitate recruitment of subjects in much the same way that the Bot Approvals Group approves bots. The policy proposal can be found at: http://en.wikipedia.org/wiki/Wikipedia:Research The Subject Recruitment Approvals Group mentioned in the proposal is being described at: http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group Before we move forward with seeking approval from the Wikipedia community, we would like additional input about the proposal, and would welcome additional help improving it. Also, please consider participating in WikiProject Research at: http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research -- Bryan Song GroupLens Research University of Minnesota

8 10

[Analytics] Beeline as Hive client
by Madhumitha Viswanathan 03 Oct '18

03 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

2 2

Wikipedia aggregate clickstream data released
by Dario Taraborelli 17 Jan '18

17 Jan '18

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

5 5

stat1002 and stat1003 deprecated. Please use new stat boxes
by Andrew Otto 05 Sep '17

05 Sep '17

Hi all! tl;dr: Stop using stat100[23] by September 1st. We’re finally replacing stat1002 and stat1003. These boxes are out of warranty, and are running Ubuntu Trusty, while most of the production fleet is already on Debian Jessie or even Debian Stretch. stat1005 is the new stat1002 replacement. If you have access to stat1002, you also have access to stat1005. I’ve copied over home directories from stat1002. stat1006 is the new stat1003 replacement. If you have access to stat1003, you also have access to stat1006. I’ve copied over home directories from stat1003. I have not migrated any personal cron jobs running on stat1002 or stat1003. I need your help for this! Both of these boxes are running Debian Stretch. As such, packages that your work depends on may have upgraded. Please log into the new boxes and try stuff out! If you find anything that doesn’t work, please let me know by commenting on https://phabricator.wikimedia.org/T152712. Please be fully migrated to the new nodes by September 1st. This will give us enough time to fully decommission stat1002 and stat1003 by the end of this quarter. I’ve only done a single rsync of home directories. If there is new data on stat1002 or stat1003 that you want rsynced over, let me know on the ticket. A few notes: - stat1002 used to have /a. This has been removed in favor of /srv. /a no longer exists. - Home directories are now much larger. You no longer need to create personal directories in /srv. - /tmp is still small, so please be careful. If you are running long jobs that generate temporary data, please have those jobs write into your home directory, rather than /tmp. - We might implement user home directory quotas in the future. Thanks all! I’ll send another email in about a months time to remind you of the impending deadline of Sept 1. -Andrew Otto

2 6

Research Showcase Wednesday, July 26, 2017 at 11:30 AM (PST) 18:30 UTC
by Sarah R 02 Aug '17

02 Aug '17

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, July 26, 2017 at 11:30 AM (PST) 18:30 UTC. YouTube stream: https://www.youtube.com/watch?v=yC1jgK8C8aQ As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#July_2017>. This month's presentation: Freedom versus Standardization: Structured Data Generation in a Peer Production CommunityBy *Andrew Hall*In addition to encyclopedia articles and software, peer production communities produce *structured data*, e.g., Wikidata and OpenStreetMap’s metadata. Structured data from peer production communities has become increasingly important due to its use by computational applications, such as CartoCSS, MapBox, and Wikipedia infoboxes. However, this structured data is usable by applications only if it follows *standards.* We did an interview study focused on OpenStreetMap’s knowledge production processes to investigate how – and how successfully – this community creates and applies its data standards. Our study revealed a fundamental tension between the need to produce structured data in a standardized way and OpenStreetMap’s tradition of contributor freedom. We extracted six themes that manifested this tension and three overarching concepts, *correctness, community,* and *code,* which help make sense of and synthesize the themes. We also offer suggestions for improving OpenStreetMap’s knowledge production processes, including new data models, sociotechnical tools, and community practices. Kindly, Sarah R. Rodlund Senior Project Coordinator-Product & Technology, Wikimedia Foundation srodlund(a)wikimedia.org

5 7

Wikiscan statistics tool for Wikimedia projects
by Pine W 30 Jul '17

30 Jul '17

Wikiscan is an interesting tool for statistics fans. I suggest briefly reading this IEG page <https://meta.wikimedia.org/wiki/Grants:IEG/Wikiscan_multi-wiki>, then playing with the tool on https://wikiscan.org/ Pine

1 0

[Announcement] Voice and exit in a voluntary work environment
by Leila Zia 27 Jul '17

27 Jul '17

Hi all, With the start of the new fiscal year in Wikimedia Foundation on July 1, the Research team has officially started the work on Program 12: Growing contributor diversity. [1] Here are a few announcements/pointers about this program and the research and work that will be going to it: * We aim to keep the research documentation for this project on the corresponding research page on meta. [2] * Research tasks are hard to break down and track in task-tracking systems. This being said, any task that we can break down and track will be documented under the corresponding Epic task on Phabricator. [3] * The goals for this Program for July-September 2017 (Quarter 1) are captured on MediaWiki. [4] (The Phabricator epic will be updated with corresponding tasks as we start working on them.) * Our three formal collaborators (cc-ed) will contribute to this program: Jérôme Hergueux from ETH, Paul Seabright from TSE, and Bob West from EPFL. We are thankful to these people who have agreed to spend their time and expertise on this project in the coming year, and to those of you who have already worked with us as we were shaping the proposal for this project and are planning to continue your contributions to this program. :) * I act as the point of contact for this research in Wikimedia Foundation. Please feel free to reach out to me (directly, if it cannot be shared publicly) if you have comments/questions about the project in the coming year. Best, Leila [1] https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/… [2] https://meta.wikimedia.org/wiki/Research:Voice_and_exit_in_a_voluntary_work… [3] https://phabricator.wikimedia.org/T166083 [4] https://www.mediawiki.org/wiki/Wikimedia_Technology/Goals/2017-18_Q1#Resear… -- Leila Zia Senior Research Scientist Wikimedia Foundation

5 8

research trying to influence real-world outcomes by editing Wikipedia
by James Salsman 26 Jul '17

26 Jul '17

This was in the recent Research Newsletter: https://www.econstor.eu/bitstream/10419/127472/1/847290360.pdf They found a correlation between the length of articles about tourist destinations and the number of tourists visiting them. They tried to influence other destinations by adding content and did not find a correlation in the subsequent number of tourists, suggesting that the causation flows from tourism to article length instead. But I was taken aback by the last line of their paper, "using the suggested research design to study other areas of information acquisition, such as medicine or school choices could be fruitful directions." Are there any ethical guidelines concerning whether this is reasonable? Should there be?

4 5

The April 2017 issue of the Wikimedia Research Newsletter is out
by masssly＠ymail.com 25 Jul '17

25 Jul '17

The April 2017 issue of the Wikimedia Research Newsletter is out: https://blog.wikimedia.org/2017/07/24/research-newsletter-april-2017/ https://meta.wikimedia.org/wiki/Research:Newsletter/2017/April In this issue: 1 Chilling effects: The impact of surveillance awareness on Wikipedia pageviews *** 13 recent publications were covered or listed in this issue *** Contributors are still welcome for our next issue, which is planned to be completed on Thursday already - see https://etherpad.wikimedia.org/p/WRN201705 Masssly, Tilman Bayer and Dario Taraborelli --- Wikimedia Research Newsletter https://meta.wikimedia.org/wiki/Research:Newsletter/ * Follow us on Twitter: @WikiResearch * Like us on Facebook: Facebook.com/WikiResearch/ * Receive this newsletter by mail: https://lists.wikimedia.org/mailman/listinfo/research-newsletter * Subscribe to the RSS feed: http://blog.wikimedia.org/c/research-2/wikimedia-research-newsletter/feed/

1 0

category extraction question
by Leila Zia 25 Jul '17

25 Jul '17

Hi all, [If you are not interested in discussions related to the category system (on English Wikipedia) , you can stop here. :)] We have run into a problem that some of you may have thought about or addressed before. We are trying to clean up the category system on English Wikipedia by turning the category structure to an IS-A hierarchy. (The output of this work can be useful for the research on template recommendation [1], for example, but the use-cases won't stop there). One issue that we are facing is the following: We are currently using SQL dumps to extract categories associated with every article on English Wikipedia (main namespace). [2] Using this approach, we get 5 categories associated with Flow cytometry bioinformatics article [3]: Flow_cytometry Bioinformatics Wikipedia_articles_published_in_peer-reviewed_literature Wikipedia_articles_published_in_PLOS_Computational_Biology CS1_maint:_Multiple_names:_authors_list The problem is that only the first two categories are the ones we are interested in. We have one cleaning step through which we only keep categories that belong to category Article and that step removes the last category above, but the other two Wikipedia_... remain there. We need to somehow prune the data and clean it from those two categories. One way we could do the above would be to parse wikitext instead of the SQL dumps and focus on extracting categories marked by pattern [[Category:XX]], but in that case, we would lose a good category such as Guided_missiles_of_Norway because that's generated by a template. Any ideas on how we can start with a "cleaner" dataset of categories related to the topic of the articles as opposed to maintenance related or other types of categories? Thanks, Leila [1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia _stubs_across_languages [2] The exact code we use is SELECT p.page_id id, p.page_title title, cl.cl_to category FROM categorylinks cl JOIN page p on cl.cl_from = p.page_id where cl_type = 'page' and page_namespace = 0 and page_is_redirect = 0 and the edges of the category graph are extracted with *SELECT p.page_title category, cl.cl_to parent * *FROM categorylinks cl * *JOIN page p * *ON p.page_id = cl.cl_from * *where p.page_namespace = 14* [3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics

4 7

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l July 2017