Wiki-research-l July 2016

wiki-research-l@lists.wikimedia.org

36 participants
26 discussions

by song＠cs.umn.edu

Pursuant to prior discussions about the need for a research policy on Wikipedia, WikiProject Research is drafting a policy regarding the recruitment of Wikipedia users to participate in studies. At this time, we have a proposed policy, and an accompanying group that would facilitate recruitment of subjects in much the same way that the Bot Approvals Group approves bots. The policy proposal can be found at: http://en.wikipedia.org/wiki/Wikipedia:Research The Subject Recruitment Approvals Group mentioned in the proposal is being described at: http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group Before we move forward with seeking approval from the Wikipedia community, we would like additional input about the proposal, and would welcome additional help improving it. Also, please consider participating in WikiProject Research at: http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research -- Bryan Song GroupLens Research University of Minnesota

9 months, 1 week

[Analytics] Beeline as Hive client

by Madhumitha Viswanathan

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

5 years, 6 months

Wikipedia aggregate clickstream data released

by Dario Taraborelli

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

6 years, 3 months

link trails in different languages

by Amir E. Aharoni

Hi, Here's a fun simple little idea: Did anybody ever try to find what are the most common link trails wikis in different languages? In English, for example, the two most common ones will probably be "s" and "es", in links like [[bottle]]s and [[box]]es; these two possibly appear millions of times in the English Wikipedia. And there are certainly many other common trails in English. In other languages they will be different. I can easily do it myself some time by running on a dump. Just wondering whether anybody already tried it. -- Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי http://aharoni.wordpress.com ‪“We're living in pieces, I want to live in peace.” – T. Moore‬

7 years, 8 months

Fwd: [Wikistats 2.0] [Regular Update] First update on Wikistats 2.0

by Dan Andreescu

Sorry, I'm bad at remembering to cross-post ---------- Forwarded message ---------- From: Dan Andreescu <dandreescu(a)wikimedia.org> Date: Fri, Jul 29, 2016 at 11:22 PM Subject: [Wikistats 2.0] [Regular Update] First update on Wikistats 2.0 To: Analytics List <analytics(a)lists.wikimedia.org> Hi, Welcome to the first of a series of semi-regular updates on our progress towards Wikistats 2.0. As you may have seen from the banners on stats.wikimedia.org, we're working on a replacement for Wikistats. Erik talked about this in his announcement [1]. To summarize it from our point of view: * Wikistats has served the community very well so far, and we're looking to keep every bit of value in the upgrade * Wikistats depends on the dumps generation process which is getting slower and slower due to its architecture. Because of this, most editing metrics are delayed by weeks through no fault of the Wikistats implementation * Finding data on Wikistats is a bit hard for new users, so we're working on new ways to organize what's available and present it in a comprehensive way along with other data sources like dumps This regular update is meant to keep interested people informed on the direction and progress of the project. Of course, Wikistats 2.0 is not a new project. We've already replaced the data pipeline behind the pageview reports on stats.wikimedia.org already. But the end goal is a new data pipeline for editing, reading, and beyond, plus a nice UI to help guide people to what they need. Since this is the first update, I'll lay out the high level milestones along with where we are, and then I'll give detail about the last few weeks of work. 1. [done] Build pipeline to process and analyze *pageview* data 2. [done] Load pageview data into an *API* 3. [ ] *Sanitize* pageview data with more dimensions for public consumption 4. [ ] Build pipeline to process and analyze *editing* data 5. [ ] Load editing data into an *API* 6. [ ] *Sanitize* editing data for public consumption 7. [ ] *Design* UI to organize dashboards built around new data 8. [ ] Build enough *dashboards* to replace the main functionality of stats.wikipedia.org 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) analytics.wikipedia.org <http://analytics.wikipedia.org>* ***. [ ] Bonus: *replace dumps generation* based on the new data pipelines Our focus last year was pageview data, and that's how we got 1 and 2 done. 3 is mostly done except deploying the logic and making the data available. So 4, 5, and 6 are what we're working on now. As we work on these pieces, we'll take vertical slices of different important metrics and take them from the data processing all the way to the dashboards that present the results. That means we'll make incremental progress on 8 and 9 as we go. But we won't be able to finish 7 and 9 until we have a cohesive design to wrap around it all. We don't want to introduce yet more dashboard hell, we want to save you the consumers from all that. So the focus right now is on the editing data pipeline. What do I mean by this? Data is already available in quarry and via the API. That's true, but here are some problems with that data: * lack of historical change information. For example, we only have pageview data by the title of the page. If we wanted to get all the pageviews for a page that's now called C, but was called B two months ago and A three months before that, we have to manually parse PHP-serialized parameters in the logging table to trace back those page moves * no easy way to look at data across wikis. If someone asks you to run a quarry query to look at data from all wikipedias, you have to run hundreds of separate queries, one for each database * no easy way to look at a lot of data. Quarry and other tools time out after a certain amount of time to protect themselves. Downloading dumps is a way to get access to more data but the files are huge and analysis is hard * querying the API with complex multi-dimensional analytics questions isn't possible These are the kinds of problems we're trying to solve. Our progress so far: * Retraced history through the logging table to piece together what names each page has had throughout its life. Deleted pages were included in this reconstruction * Found what names each user has had throughout their life. And what rights and blocks were applied to or removed from users. * Wrote event schemas for Event Bus, which will feed data into this pipeline in near real time (so metrics and dashboards can be updated in near-real-time) * Come up with a single denormalized schema that holds every single kind of event possible in the editing world. This is a join of the Event Bus schemas mentioned above and is possible to feed either in batch from our reconstruction algorithm or in real time. If you're familiar with lambda architecture, this is the approach we're taking to make our editing data available Right now we're testing the accuracy of our reconstruction against Wikistats data. If this works, we'll open up the schema to more people to play with so they can give feedback on this way of doing analytics. And if all that looks good, we'll be loading the data into Druid and Hive and running the most high priority metrics on this new platform. We hope to be done with this by the end of this quarter. To weigh in on what reports are important, make sure you visit Erik's page [2]. We'll also do a tech talk on our algorithm for historical reconstruction and lessons learned on mediawiki analytics. If you're still reading, congratulations, sorry for the wall of text. I look forward to keeping you all in the loop, and to making steady progress on this project that's very dear to our hearts. Feel free to ask questions and if you'd like to be involved, just let me know how. Have a nice weekend :) [1] http://infodisiac.com/blog/2016/05/wikistats-days-will-be-over-soon-long-li… [2] https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpReports/Future_per_r…

7 years, 8 months

pagecounts and stub-meta-history

by Bruno Goncalves

Hi, I've been trying to match edit activity with pagecounts but I've encountered a couple of problems. The amazing pagecounts dumps ( https://dumps.wikimedia.org/other/pagecounts-raw/) use the page url to identify the individual page: fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624 while the stub-meta-history uses the "raw" title: <page> <title>Wikipedia:Community Portal</title> <ns>4</ns> <id>1270</id> so I need an easy way to map title to url. I imagine there some rules on how this "translation" is done? My google-fu has failed to encounter them. Also, are is timezones mentioned in the meta-history files: <timestamp>2006-02-18T19:29:10Z</timestamp> the same as the one used in the pagecount filenames: pagecounts-20140725-070000.gz Best, B ******************************************* Bruno Miguel Tavares Gonçalves, PhD Homepage: www.bgoncalves.com Email: bgoncalves(a)gmail.com *******************************************

7 years, 8 months

Q4-2016 (April-June) quarterly report for Wikimedia Research

by Dario Taraborelli

This is what we've been up to at Wikimedia Research this past quarter (April - June 2016): - Research and Data <https://commons.wikimedia.org/w/index.php?title=File:Technology_Quarterly_R…> - Design Research <https://commons.wikimedia.org/w/index.php?title=File%3ATechnology_Quarterly…> You might also be interested in the Analytics Engineering <https://commons.wikimedia.org/w/index.php?title=File:Technology_Quarterly_R…> team's quarterly report. Best, Dario *Dario Taraborelli *Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter <http://twitter.com/readermeter>

7 years, 8 months

Discussion on Arc.heolo.gy: applications, volunteers, and stack design

by Ian Seyer

Full disclosure: I am the creator of the Project Grant application for Arc.heolo.gy, located here: https://meta.wikimedia.org/wiki/Grants:Project/Arc.heolo.gy I hope for this to be a general discussion on potential applications, criticisms, questions, technological recommendations, and community discussion. Currently, the project has a live Neo4j Graph database built and parsed from a download of the English language Wikipedia from April. I have temporarily hosted the database instance both on my local machine and a SoftLayer server provided under a temporary entrepreneur credit. My goal is two fold. On the backend: refine the parsing algorithm (I am getting some incorrect relationships in the database), automate the parsing so that it updates the database frequently, expand language support, and perform semantic parsing to weight individual relationships to strengthen the ability to filter out extraneous relationships. On the frontend: I have done little to zero work here beyond pure conceptualization. I would hope to use an asynchronous front-end javascript framework to build both a 2d (d3) and 3d (webGL) interface to be able to explore the database with a high amount of control and ease. If any of you would like to access the database for exploration, please contact me privately and I will give you credentials. Any recommendations on parsing, hosting, visualization, or otherwise are appreciated. Endorsements and Volunteers are also highly appreciated! p.s. I am new to directly engaging with the Wiki community, and if I committed some faux pas in starting this thread please let me know and I will do my best to correct it. -- ╭╮ ╭╮┃┃ ╭╮ ╭╮┃┃┃┃╭╮ ┃┃ ╭╮ ┃╰╯╰╯┃┃╰ ╭╮┃┃╭╮┃┃╭╮┃ ╰╯ ╭╮ ┃┃┃┃┃╰╯┃┃╰╯ ┃┃╭╮┃╰╯┃┃ ╰╯ ╮┃╰╯┃┃ ╰╯ ╰╯ ┃┃ ╰╯

7 years, 8 months

"Cases" section approved, finishing "Committee" section of Code of Conduct

by Matthew Flaschen

The community has approved the "Cases" (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Page:_Code_of_Conduct.…) section of the draft Code of Conduct. The next section is "Committee". * Section: https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Page:_Code_of_Conduct.… * Talk: https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft * Alternatively, you can provide anonymous feedback to conduct-discussion at wikimedia.org . This is the best time to make any final necessary changes to the Committee section (and explain why, in edit summaries and/or talk) and discuss it on the talk page. After this last call, I will send out emails seeking approval of this section, organized by sub-section. Thanks, Matt Flaschen

7 years, 8 months

Upcoming research newsletter (July 2016): new papers open for review

by Mohammed Sadat

Hi everybody, We’re preparing for the July 2016 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201607 and add your name next to any paper you are interested in covering. Our target publication date is Monday August 1 UTC although actual publication might happen several days later. As usual, short notes and one-paragraph reviews are most welcome. Highlights from this month: • An Empirical Evaluation of Property Recommender Systems for Wikidata and Collaborative Knowledge Bases • Breaking the glass ceiling on Wikipedia • Centrality and Content Creation in Networks - The Case of Economic Topics on German Wikipedia • Centrality and Content Creation in Networks - The Case of Economic Topics on German Wikipedia • Comparative assessment of three quality frameworks for statistics derived from big data: the cases of Wikipedia page views and Automatic Identification Systems • Competencias informacionales básicas y uso de Wikipedia en entornos educativos • Computational Science and Its Applications • Controversy Detection in Wikipedia Using Collective Classification • Discovery and efficient reuse of technology pictures using Wikimedia infrastructures. • Dynamics and Biases of Online Attention: The Case of Aircraft Crashes • Evaluating and Improving Navigability of Wikipedia: A Comparative Study of Eight Language Editions • Extracting Scientists from Wikipedia • From Digital Library Citation Parsing to Wikipedia Reference Analysis • Monitoring the Gender Gap with Wikidata Human Gender Indicators • Platform affordances and data practices: The value of dispute on Wikipedia • Stationarity of the inter-event power-law distributions • Using Wikipedia to Teach Discipline Specific Writing • 日本の大学生のWikipediaに対する信憑性認知，学習における利用実態とそれらに影響を与える要因 (Google Translate: Factors that give Japan's credibility awareness of Wikipedia of college students, use in learning actual situation and the impact on them) If you have any question about the format or process feel free to get in touch off-list. Masssly, Tilman Bayer and Dario Taraborelli [1] http://meta.wikimedia.org/wiki/Research:Newsletter

7 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l July 2016