Wiki-research-l December 2016

wiki-research-l@lists.wikimedia.org

28 participants
29 discussions

Re: [Wiki-research-l] Wiki-research-l Digest, Vol 136, Issue 9
by Alex Druk 13 Dec '16

13 Dec '16

2016 files are uploaded to Internet Archive. Identifier " enwiki-pageviews2007-2016" On Mon, Dec 12, 2016 at 1:00 PM, < wiki-research-l-request(a)lists.wikimedia.org> wrote: > Send Wiki-research-l mailing list submissions to > wiki-research-l(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > or, via email, send a message with subject or body 'help' to > wiki-research-l-request(a)lists.wikimedia.org > > You can reach the person managing the list at > wiki-research-l-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Wiki-research-l digest..." > > > Today's Topics: > > 1. Upcoming research newsletter (November 2016): new papers open > for review (masssly(a)ymail.com) > 2. another pageview db to download (Alex Druk) > 3. Re: another pageview db to download (Federico Leva (Nemo)) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 11 Dec 2016 22:57:34 +0000 > From: <masssly(a)ymail.com> > To: Wikimedia Research Mailing List > <wiki-research-l(a)lists.wikimedia.org> > Subject: [Wiki-research-l] Upcoming research newsletter (November > 2016): new papers open for review > Message-ID: <726158.24370.bm(a)smtp108.mail.ir2.yahoo.com> > Content-Type: text/plain; charset="utf-8" > > Hi everybody, > > We’re preparing for the November 2016 research newsletter and looking for > contributors. Please take a look at: https://etherpad.wikimedia. > org/p/WRN201611 and add your name next to any paper you are interested in > covering. Reviews should be in before December 14. As usual, short notes > and one-paragraph reviews are most welcome. > > Highlights from this month: > > • Black Lives Matter in Wikipedia: Collaboration and Collective Memory > around Online Social Movements > • DePP: A System for Detecting Pages to Protect in Wikipedia > • Digital Heritage. Progress in Cultural Heritage: Documentation, > Preservation, and Protection > • Docforia: A Multilayer Document Model > • Does astronomy research become too dated for the public? Wikipedia > citations to astronomy and astrophysics journal articles 1996-2014 > • Election Prediction Based on Wikipedia Pageviews > • Establishing and Evaluating Digital Ethos and Online Credibility > • Finding and Expanding Hypernymic Relations in the Music Domain > • Game with a Purpose for mappings verification > • Hierarchical Question Answering for Long Documents > • How Many People Constitute a Crowd and What Do They Do? Quantitative > Analyses of Revisions in the English and German Wiktionary Editions > • Measuring Quality of Collaboratively Edited Documents: the case of > Wikipedia > • On Emerging Entity Detection > • Predicting Importance of Historical Persons Using Wikipedia > • Relationship between personality and attitudes to Wikipedia > • Social patterns and dynamics of creativity in Wikipedia > • Travel Attractions Recommendation with Knowledge Graphs > • What Makes a Link Successful on Wikipedia? > > If you have any question about the format or process feel free to get in > touch off-list. > > Masssly, Tilman Bayer and Dario Taraborelli > > [1] http://meta.wikimedia.org/wiki/Research:Newsletter >

1 0

Upcoming research newsletter (November 2016): new papers open for review
by masssly＠ymail.com 12 Dec '16

12 Dec '16

Hi everybody, We’re preparing for the November 2016 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201611 and add your name next to any paper you are interested in covering. Reviews should be in before December 14. As usual, short notes and one-paragraph reviews are most welcome. Highlights from this month: • Black Lives Matter in Wikipedia: Collaboration and Collective Memory around Online Social Movements • DePP: A System for Detecting Pages to Protect in Wikipedia • Digital Heritage. Progress in Cultural Heritage: Documentation, Preservation, and Protection • Docforia: A Multilayer Document Model • Does astronomy research become too dated for the public? Wikipedia citations to astronomy and astrophysics journal articles 1996-2014 • Election Prediction Based on Wikipedia Pageviews • Establishing and Evaluating Digital Ethos and Online Credibility • Finding and Expanding Hypernymic Relations in the Music Domain • Game with a Purpose for mappings verification • Hierarchical Question Answering for Long Documents • How Many People Constitute a Crowd and What Do They Do? Quantitative Analyses of Revisions in the English and German Wiktionary Editions • Measuring Quality of Collaboratively Edited Documents: the case of Wikipedia • On Emerging Entity Detection • Predicting Importance of Historical Persons Using Wikipedia • Relationship between personality and attitudes to Wikipedia • Social patterns and dynamics of creativity in Wikipedia • Travel Attractions Recommendation with Knowledge Graphs • What Makes a Link Successful on Wikipedia? If you have any question about the format or process feel free to get in touch off-list. Masssly, Tilman Bayer and Dario Taraborelli [1] http://meta.wikimedia.org/wiki/Research:Newsletter

1 0

Template usage statistics
by Felix Engelmann 10 Dec '16

10 Dec '16

Hi, I’m looking for statistical information about template usage on Wikipedia. In particular I’m interested in the number of usages per template (I need to know which templates re the most popular ones) and also information about the number of template transclusions vs. substitutions. Can somebody help me? Thanks a lot, Felix Engelmann

1 0

Fwd: [Labs-l] Tell us about the SQL that you can't get to work
by Pine W 07 Dec '16

07 Dec '16

Forwarding to Analytics, Research, and Wikimetrics in case this is of interest to people who aren't subscribed to the Labs mailing list. Pine ---------- Forwarded message ---------- From: Bryan Davis <bd808(a)wikimedia.org> Date: Tue, Dec 6, 2016 at 9:28 AM Subject: [Labs-l] Tell us about the SQL that you can't get to work To: labs-l <labs-l(a)lists.wikimedia.org> In early January there is going to be a Developer Summit in San Francisco [0]. Chase and I are in charge of scheduling talks on the topic "Building on Wikimedia services: APIs and Developer Resources". One of the more interesting to me talks that has been proposed for this is "Labsdbs for WMF tools and contributors: get more data, faster" by Jamie Crespo [1]. I know that most of you won't be able to attend in person, but if we can show that there is enough interest in this topic we can get the talk scheduled in a main room and recorded so anyone can watch it later. An idea I just had for showing interest is to get Tool Labs maintainers and other Labs users to describe questions that they have tried and failed to answer using SQL queries. We can look at the kinds of questions that come up and ask Jamie (and others) if there are some general recommendations that can be made about how to improve performance or understand how the bits and pieces of our data model fit together. To kick things off, here's an example I tried to help with over the weekend. A Quarry user was adapting a query they had used before to find non-redirect File namespace pages not paired with binary files on Commons. The query they had come up with was: SELECT DISTINCT page_title, img_name FROM ( SELECT DISTINCT page_title FROM page WHERE page_namespace = 6 AND page_is_redirect = 0 ) AS page LEFT JOIN ( SELECT DISTINCT img_name FROM image ) AS image ON page_title=img_name WHERE img_name IS NULL; The performance of this is horrible for several reasons including the excessive use of DISTINCT. The query was consistently killed by the 30 minute runtime limit. MaxSem and I both came up with about the same optimization that eliminated the sub-queries and use of DISTINCT: SELECT page_title, img_name FROM page LEFT OUTER JOIN image ON page_title=img_name WHERE page_namespace = 6 AND page_is_redirect = 0 AND img_name IS NULL; This new query is not fast in any sense of the word, but it does finish without timing out. There is still some debate about whether the 906 rows it returned are correct or not [2]. [0]: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit [1]: https://phabricator.wikimedia.org/T149624 [2]: https://quarry.wmflabs.org/query/14501 Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Sr Software Engineer Boise, ID USA irc: bd808 v:415.839.6885 x6855 _______________________________________________ Labs-l mailing list Labs-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/labs-l

1 0

[Wikistats 2.0] [Regular Update] Wrapping up Q2
by Dan Andreescu 05 Dec '16

05 Dec '16

We're starting to wrap up the calendar year, here's what we've accomplished so far with Wikistats. We're really excited to have some data in our production Hive database for people to play with. We worked really hard to clean up and present an intuitive interface to all of mediawiki history. The results are captured in the tables mentioned below, which we'll cover more in an upcoming tech talk. Documentation for the project is here <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>. Our goals so far and progress breakdown: 1. [done] Build pipeline to process and analyze *pageview* data 2. [done] Load pageview data into an *API* 3. [ ] *Sanitize* pageview data with more dimensions for public consumption 4. [ beta] Build pipeline to process and analyze *editing* data 5. [ beta] Load editing data into an *API* 6. [ ] *Sanitize* editing data for public consumption 7. [ ] *Design* UI to organize dashboards built around new data 8. [ ] Build enough *dashboards* to replace the main functionality of stats.wikipedia.org 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) analytics.wikipedia.org <http://analytics.wikipedia.org/>* ***. [ ] Bonus: *replace dumps generation* based on the new data pipelines 4 & 5. Since our last update, we've finished the pipeline that imports data from mediawiki databases, cleans it up as best as possible, reshapes it in a analytics-friendly way, and makes it easily queryable. I'm marking these goals as "beta" because we're still tweaking the algorithm for performance and productionizing the jobs. This will be completed early next quarter, but in the meantime we have data for people to play with internally. Sadly we haven't sanitized it yet so we can't publish it. For those with internal access: * https://pivot.wikimedia.org/#edit-history-test is the full history across all wikis. It's a bit hard to understand how to slice and dice, so we will host a tech talk and present it at the January metrics meeting if we can. * In hive, you can access this data in the wmf database, the tables are: - wmf.mediawiki_history: denormalized full history with this schema <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_history> - wmf.mediawiki_page_history: the sequence of states of each wiki page ( schema <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_page_hist…> ) - wmf.mediawiki_user_history: the sequence of states of each user account (schema <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Mediawiki_user_hist…> ) 6. Sanitizing has not moved forward, as we need DBA time and they've been overloaded. We will attempt to restart this effort in Q3. 7. We have begun the design process, we'll share more about this as we go. Our goals and planning for next quarter support us finishing 4, 5, 7, and 8, so basically putting a UI on top of the data pipeline we have in place, and updating it weekly. We also hope to have good progress on 6, but that depends on collaboration with the DBA team and is harder than we originally imagined. And remember, voice your opinions about important reports in the current Wikistats here: https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpReports/Future_per_r… (thank you so so much to the many people who already chimed in).

2 2

Data on the lifespan of Wikipedia articles
by Stella Yu 01 Dec '16

01 Dec '16

Where could I find data on the lifespan of different types of Wikipedia articles? Thanks in advance for the help!

5 6

Re: [Wiki-research-l] crowdsource question
by James Salsman 01 Dec '16

01 Dec '16

Thank you for your questions, Jan. > Is this on questions on Wikipedia Articles which ask for an > estimate of good, neutral or bad assertions (or generally > sentiments) about a subject? After the Signpost ran a blurb last month on research successfully predicting company stock price changes using pageviews (confirming similar work from 2013), I tried to find anyone using the textual substance of edits to do the same thing. I found this: http://community.wolfram.com/groups/-/m/t/882612 It produces small but consistently positive correlations between companies' article edit summaries classified by the text sentiment model which ships with Wolfram Mathematica and their daily stock price changes. The significance is low, in part because using sentiment of edit summaries is a very naive approach. So I wonder if anyone has tried to train a sentiment analysis model to address the task directly with full diffs. > Or are you more interested in the subject of lobbyism and > company directed edits and the like? I'm more interested in identifying organized advocacy, and I suspect such models would help with that, too, especially if brand product articles are included along with companies. 2016-12-01 4:12 GMT+01:00 James Salsman <jsalsman(a)gmail.com>: > > Who, if anyone, is examining crowdsource survey > questions such as, "Look at the text added or > removed in this edit to [Company]'s Wikipedia > article. Was the editor saying [ ] good things, [ ] > bad things, or [ ] was neutral about [Company]'s > financial prospects?"?

1 0

Re: [Wiki-research-l] crowdsource question
by Jan Dittrich 01 Dec '16

01 Dec '16

Hi James, Just to understand better what you are interested in – Is this on questions on Wikipedia Articles which ask for an estimate of good, neutral or bad assertions (or generally sentiments) about a subject? Or are you more interested in the subject of lobbyism and company directed edits and the like? Jan 2016-12-01 4:12 GMT+01:00 James Salsman <jsalsman(a)gmail.com>: > Who, if anyone, is examining crowdsource survey > questions such as, "Look at the text added or > removed in this edit to [Company]'s Wikipedia > article. Was the editor saying [ ] good things, [ ] > bad things, or [ ] was neutral about [Company]'s > financial prospects?"? > > Best regards, > Jim > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- Jan Dittrich UX Design/ User Research Wikimedia Deutschland e.V. | Tempelhofer Ufer 23-24 | 10963 Berlin Phone: +49 (0)30 219 158 26-0 http://wikimedia.de Imagine a world, in which every single human being can freely share in the sum of all knowledge. That‘s our commitment. Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

1 0

crowdsource question
by James Salsman 01 Dec '16

01 Dec '16

Who, if anyone, is examining crowdsource survey questions such as, "Look at the text added or removed in this edit to [Company]'s Wikipedia article. Was the editor saying [ ] good things, [ ] bad things, or [ ] was neutral about [Company]'s financial prospects?"? Best regards, Jim

1 0

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l December 2016