Wiki-research-l September 2016

wiki-research-l@lists.wikimedia.org

27 participants
22 discussions

Identifying Wikipedia stubs in various languages

by Robert West

Hi everyone, Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia? Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster. I've found lists for various languages (e.g., Italian <https://it.wikipedia.org/wiki/Categoria:Stub> or English <https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the lists are in different formats, so separate code is required for each language, which doesn't scale. I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated. Thanks! Bob -- Up for a little language game? -- http://www.unfun.me

7 years, 7 months

measuring time to proofread wikipedias and the Making Work Pay Tax Credit

by James Salsman

I am pleased to announce that, thanks to Google Summer of Code student Priyanka Mandikal, the project for the Accuracy Review of Wikipedias project has delivered a working demonstration of open source code and data available here: https://github.com/priyankamandikal/arowf/ Please try it out at: http://tools.wmflabs.org/arowf/ We need your help to test it and try it out and send us comments. You can read more about the project here: https://priyankamandikal.github.io/posts/gsoc-2016-project-overview/ The formal project report, still in progress (Google docs comments from anyone are most welcome) is at: https://docs.google.com/document/d/1_AiOyVn9Qf5ne1qCHIygUU3OTJcbpkb14N3rIty… This allows experiments to measure, for example, how long it would take to complete proofreading of the wikipedias with and without paying editors to work alongside volunteers. I am sure everyone agrees that is an interesting question which bears directly on budget expectations. I hope multiple organizations use the published methods and their Python implementations to make such measurements. I would also like to suggest a proposal related to the questions in both of the following reviews: http://unotes.hartford.edu/announcements/images/2014_03_04_Cerasoli_and_Nic… http://onlinelibrary.wiley.com/doi/10.1111/1748-8583.12080/abstract The most recent solicitation of community input for the Foundation's Public Policy team I've seen said that they would like suggestions for specific issues as long as the suggestions did not involve endorsements of or opposition to any specific candidates. My support for adjusting copyright royalties on a sliding scale to transfer wealth from larger to smaller artists has been made clear, and I do not believe there are any concerns that I have not addressed concerning alignment to mission or effectiveness. I would also like to propose a related endorsement. The Making Work Pay tax credit (MWPTC) is a negative payroll tax that expired in 2010. It has all the advantages of an expanded Earned Income Tax Credit (EITC) but would happen with every paycheck. Reinstating the Making Work Pay tax credit would serve to reduce economic inequality. This proposal is within the scope of the Foundation's mission because reducing economic inequality should serve to empower people to develop educational content for the projects because of the increased levels of support for artistic production among a broader set of potential editors with additional discretionary free time due to increased wealth. This proposal is needed because economic inequality produces more excess avoidable deaths and leads to fewer years of productive life than global warming. This proposal would provide substantial benefits to the movement, the community, the Foundation, the US and the world if it were to be successfully adopted. For the reasons stated above, this proposal will be seen as positive. Here is some background and supporting information: * MWPTC overview: https://en.wikipedia.org/wiki/Making_Work_Pay_tax_credit * MWPTC details: http://tpcprod.urban.org/taxtopics/2011_work.cfm * Problems with expanding the EITC: http://www.taxpolicycenter.org/taxvox/eitc-expansion-backed-obama-and-ryan-… * Educational advantages of expanding the EITC: https://www.brookings.edu/opinions/this-policy-would-help-poor-kids-more-th… * Financial advantages of expanding the EITC: http://www.cbpp.org/research/federal-tax/strengthening-the-eitc-for-childle… * The working class has lost half their wealth over the past two decades: https://www.nerdwallet.com/blog/finance/why-people-are-angry/ * Health effects of addressing economic inequality: http://talknicer.com/ehlr.pdf * Economic growth effects of addressing economic inequality: http://talknicer.com/egma.pdf * Unemployment and underemployment effects of addressing economic inequality: http://diposit.ub.edu/dspace/bitstream/2445/33140/1/617293.pdf For an example of how a campaign on this issue could be conducted based on the issues identified in the sources above, please see: http://bit.ly/mwptc Please share your thoughts on the wikipedias proofreading time measurement effort and this related public policy proposal. I expect that some people will say that they do not understand how the public policy proposal relates to the project to measure the amount of time it would take to proofread the wikipedias. I am happy to explain that in detail if and when needed. On a related note, I would like to point out that the project report Google doc suggests future work involving a peer learning system for speaking skills using the same architecture as we derived from the constraints for successfully performing simultaneous paid and volunteer proofreading. I would like people to keep that in mind when evaluating the utility of these proposals. Sincerely, Jim Salsman

7 years, 7 months

[Wikistats 2.0] [Regular Update] Wrapping up Q1

by Dan Andreescu

We're starting to wrap up Q1, so it's time for another wikistats update. First, a quick reminder: ----- If you currently use the existing reports, PLEASE give feedback in the section(s) at https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpRepor ts/Future_per_report Bonus points for noting what you use, how you use it, and explaining what elements you most appreciate or might want added. ----- Ok, so this is our list of high level goals, and as we were saying before, we're focusing on taking a vertical slice through 4, 5, and 6 so we can deliver functionality and iterate. 1. [done] Build pipeline to process and analyze *pageview* data 2. [done] Load pageview data into an *API* 3. [ ] *Sanitize* pageview data with more dimensions for public consumption 4. [ ] Build pipeline to process and analyze *editing* data 5. [ ] Load editing data into an *API* 6. [ ] *Sanitize* editing data for public consumption 7. [ ] *Design* UI to organize dashboards built around new data 8. [ ] Build enough *dashboards* to replace the main functionality of stats.wikipedia.org 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) analytics.wikipedia.org <http://analytics.wikipedia.org/>* ***. [ ] Bonus: *replace dumps generation* based on the new data pipelines So here's the progress since last time by high level goal: 4. We can rebuild most all page and user histories from logging, revision, page, archive, and user mediawiki tables. The scala / spark algorithm scales well and can process english wikipedia in less than an hour. Once history is rebuilt, we want to join it into a denormalized schema. We have an algorithm that works on simplewiki rather quickly, but we're *still working on scaling* it to work with english wiki. For that reason, our vertical slice this quarter may include *only simplewiki*. In addition to denormalizing the data to make it very simple for analysts and researchers to work with, we're also computing columns like "this edit was reverted at X timestamp" or "this page was deleted at X timestamp". These will all be available in one flat schema. 5. We loaded the simplewiki data into Druid and put Pivot on top of it. It's fantastically fun, I had to close that tab or I would've lost a day browsing around. For a small db like simplewiki, Druid should have no problem maintaining an updated version of the computed columns mentioned above. (I say updated because "this edit was reverted" is a fact that can change from false to true at some point in the future). We're still not 100% sure whether Druid can do that with the much larger enwiki data, but we're testing that. And we're also testing ClickHouse, another highly performant OLAP big data columnar store, just in case. In short, we can update *once a week* already, and we're working on seeing how feasible it is to update more often than that. 6. We ran into a *problem* when thinking about sanitizing the data. Our initial idea was to filter out the same columns that are filtered out when data is replicated to labsdb. But we found rows are also filtered and the process for doing that filtering is in need of a lot of love and care. So we may side-track to see if we can help out our fellow DBAs and labs ops in the process, maybe unifying the edit data sanitization. Steps remaining for having simplewiki data in Druid / Pivot by the end of Q1: * vet data with Erik * finish productionizing our Pivot install so internal/NDA folks can play with it

7 years, 7 months

SPARQL workshop and WDQS tutorials

by Dario Taraborelli

The Wikimedia Foundation's Discovery and Research teams recently hosted an introductory workshop on the SPARQL query language and the Wikidata Query Service. We made the video stream <https://www.youtube.com/watch?v=NaMdh4fXy18> and materials <https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/2016_SPARQL_Wor…> (demo queries, slidedecks) from this workshop publicly available. Guest speakers: - Ruben Verborgh, *Ghent University* and *Linked Data Fragments* - Benjamin Good, *Scripps Research Institute* and *Gene Wiki* - Tim Putman, *Scripps Research Institute* and *Gene Wiki* - Lucas, *@WikidataFacts* Dario and Stas *Dario Taraborelli *Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter <http://twitter.com/readermeter>

7 years, 7 months

Re: [Wiki-research-l] Research on automatically created articles

by siddhartha banerjee

Ziko, Thanks for your detailed email. Agree on all the comments. Some earlier comments might have been harsh, but I understand that there is a valid reason behind it and also the dedication of so many people involved to help reach Wikipedia where it is today. We should have been more diligent in finding out policies and rules (including IRB) before entering content on Wikipedia. We promise not to repeat anything of this sort in the future and also I am trying to summarize all that has been discussed here to prevent such unpleasant experiences from other researchers in this area. -- Sidd

7 years, 7 months

Fwd: [Research-Internal] Fwd: Dumps Rewrite getting underway (help needed!)

by Leila Zia

FYI ---------- Forwarded message ---------- From: Ariel Glenn WMF <ariel(a)wikimedia.org> Date: Mon, Sep 12, 2016 at 9:07 AM Subject: [Research-Internal] Fwd: Dumps Rewrite getting underway (help needed!) To: research-internal(a)lists.wikimedia.org ---------- Forwarded message ---------- From: Ariel Glenn WMF <ariel(a)wikimedia.org> Date: Mon, Sep 5, 2016 at 2:35 PM Subject: Dumps Rewrite getting underway (help needed!) To: Wikipedia Xmldatadumps-l <Xmldatadumps-l(a)lists.wikimedia.org> Hello folks, I know a number of you have subscribed to the Dumps Rewrite project ( https://phabricator.wikimedia.org/tag/dumps-rewrite/) but I bet none of you actually watch it or any of its tasks. So here's a heads up. I'm getting started on work on the job scheduler/workflow manager piece; this would accept lists of dump tasks (in the current setup, "dump stubs for el wikipedia"), call a callback to turn each of them into small jobs that can be completed in less than an hour, submit and monitor these jobs with retries, dependencies etc, call a callback to recombine the outputs of the jobs, and notify some caller on success of te whole operation. First up is evaluating existing packages and choosing one to use as a foundation. Please contribute! See the following tasks: https://phabricator.wikimedia.org/T143205: Draft usage scenarios for job/workflow manager <https://phabricator.wikimedia.org/T143205> https://phabricator.wikimedia.org/T143206: List requirements needed for task/job/workflow manager <https://phabricator.wikimedia.org/T143206> https://phabricator.wikimedia.org/T143207: Evaluate software packages for job/task/workflow management <https://phabricator.wikimedia.org/T143207> Also, can someone please forward this on to analytics-l and research-l? I'm not on those lists but they will no doubt have a lot of useful expertise here. Thanks! Ariel _______________________________________________ Research-Internal mailing list Research-Internal(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/research-internal

7 years, 7 months

Fwd: [Wikimedia-l] Open call for Project Grant proposals (Sep 12-October 11)

by Pine W

Forwarding. ---------- Forwarded message ---------- From: Marti Johnson <mjohnson(a)wikimedia.org> Date: Mon, Sep 12, 2016 at 4:34 PM Subject: [Wikimedia-l] Open call for Project Grant proposals (Sep 12-October 11) To: Wikimedia Mailing List <wikimedia-l(a)lists.wikimedia.org> Hi everyone, The Wikimedia Foundation Project Grants program launches its second open call today, September 12. We will be accepting proposals through October 11 for new ideas to improve Wikimedia projects. Funds are available to support individuals, groups and organizations to implement new experiments and proven ideas, whether focused on building a new tool or gadget, organizing a better process on your wiki, researching an important issue, coordinating an editathon series or providing other support for community-building. Ideas from the current Inspire Campaign on addressing harassment are very welcome. <https://meta.wikimedia.org/wiki/Grants:IdeaLab/Inspire> Do you have have a good idea, but would like some feedback before applying? Put it into the IdeaLab, where volunteers and staff can give you advice and guidance on how to bring it to life. < https://meta.wikimedia.org/wiki/Grants:IdeaLab> Once your idea is ready, it can be easily migrated into a grant request. Marti Johnson and I will also be hosting weekly proposals clinics via Hangouts for real-time discussions about the Project Grants Open Call. We’ll answer questions and help you make your proposal better. Dates and times are as follows: * Fri, Sep 16, 1400- 1500 UTC * Tues, Sep 20, 0100 - 0200 UTC * Wed, Sep 28, 1400 - 1500 UTC * Tue, Oct 4, 2200 - 2300 UTC * Tues, Oct 11, 0200 - 0300 UTC * Tue, Oct 11, 1600 -1700 UTC Links for Hangouts are available here: < https://meta.wikimedia.org/wiki/Grants:Project> We are excited to see your grant ideas that will support our community and make an impact on the future of Wikimedia projects. Put your idea into motion, and submit your proposal between September 12 and October 11! < https://meta.wikimedia.org/wiki/Grants:Project/Apply> Please feel free to get in touch with me (mjohnson(a)wikimedia.org) or Alex Wang (awang(a)wikimedia.org) with questions about getting started with your project! Warm regards, Marti *Marti JohnsonProgram Officer* *Individual Grants* *Wikimedia Foundation <http://wikimediafoundation.org/wiki/Home>* +1 415-839-6885 Skype: Mjohnson_WMF Imagine a world in which every single human being can freely share <http://youtu.be/ci0Pihl2zXY> in the sum of all knowledge. Help us make it a reality! Support Wikimedia <https://donate.wikimedia.org/> _______________________________________________ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/ wiki/Mailing_lists/Guidelines New messages to: Wikimedia-l(a)lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>

7 years, 7 months

The Wikimedia Research Newsletter 6(8) is out

by masssly＠ymail.com

The August 2016 issue of the Wikimedia Research Newsletter is out: https://blog.wikimedia.org/2016/09/12/research-newsletter-august-2016/ https://meta.wikimedia.org/wiki/Research:Newsletter/2016/August In this issue: 1 AI-generated Wikipedia articles give rise to debate about research ethics 2 Ethics researcher: Vandal fighters should not be allowed to see whether an edit was made anonymously 3 Briefly 3.1 Conferences and events *** 18 recent publications were covered or listed in this issue *** Thanks to Denny Vrandečić for contributing. Masssly, Tilman Bayer and Dario Taraborelli --- Wikimedia Research Newsletter https://meta.wikimedia.org/wiki/Research:Newsletter/ * Follow us on Twitter: @WikiResearch * Receive this newsletter by mail: https://lists.wikimedia.org/mailman/listinfo/research-newsletter * Subscribe to the RSS feed: http://blog.wikimedia.org/c/research-2/wikimedia-research-newsletter/feed/

7 years, 7 months

index of current research on wikipedia?

by Joe Corneli

Hi, I noticed that the pages here: https://en.wikipedia.org/wiki/Academic_studies_about_Wikipedia and https://en.wikipedia.org/wiki/Wikipedia:Academic_studies_of_Wikipedia ...don't have any 2015 or 2016 articles. Naturally, such articles do exist (30,400 hits for "Wikipedia" since 2015 on Google Scholar). Indeed, given the quantity and diversity of material, keeping track of it might require an entire meta-wikipedia ;-? However the materials at https://meta.wikimedia.org/wiki/Category:Research don't seem to provide a very comprehensive guide to existing literature. I assume some researchers are keeping track of some facets of recent work in this area: are you archiving e.g. BibTeX files somewhere? Could these be shared/curated in a wiki-like way? There is a very limited list (indeed, just one entry from 2013), here: https://zenodo.org/collection/user-wikimedia ... So that's probably not where the action is at the moment! For comparison, there is a nice (but not terribly long) crowdsourced index of papers using Stack Exchange data here; this looks reasonably up to date: http://meta.stackexchange.com/questions/134495/academic-papers-using-stack-… Joe

7 years, 8 months

Research or other documentation on potential GLAM partner issues

by Leigh Thelmadatter

(sorry for the cross posting with cultural partners l) I am working on a research project with my school and have a question. I am looking for a research paper or other documentation (if it exists) on whether or not in 2016, do potential GLAM partners still have the same questions/objections towards working with Wikipedia (time, resources, releasing materials in free licenses) despite the major collaborations we do have. I have run into this in Mexico, but dont know how much of it is a local issue and how much is a global issue.

7 years, 8 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l September 2016