Wiki-research-l

wiki-research-l@lists.wikimedia.org

8 participants
2988 discussions

I intend to survey my students about Wikipedia assignments - can you suggest any refinements?
by Piotr Konieczny 30 Sep '16

30 Sep '16

Dear colleagues, My new research project, inspired by the following CfP ( *http://www.asanet.org/journals/TS/SpecialIssueCall.cfm)* aims at trying to judge how effective our teaching assignments on Wikipedia have been, in the context of my globalization lectures in which students have created or expanded dozens of Wikipedia articles (you can see partial list of articles created by my students at https://en.wikipedia.org/wiki/User:Piotrus/Educational_project_results to get an idea of what I had them to do over the past few years). It is clear that Wikipedia benefits, but what about the students? Here are my two questions to you. First, my main source of data is going to be a survey of my former students (N<100). I wonder if anyone is familiar with literature on relevant metrics (i.e. how to design a survey to measure the effectiveness of a teaching instrument)? I have never surveyed students before, and while I am in the middle of a lit review, any suggestions would be appreciated. I am somewhat familiar with the literature on teaching with Wikipedia, but sadly few works have published surveys used. If anything comes to mind that you think would be good to use for comparative studies, that would also be helpful. Second, here is my draft survey: http://tinyurl.com/hehckvs I'd appreciate any comments: is it too long? Are some questions ambiguous? Unnecessary? Leading and creating bias in subsequent questions? Should I rephrase something? Should I ask something else? Thank you for any comments, and do not hesitate to be critical - I'd much rather redo the survey now then after I send it out :) -- Piotr Konieczny, PhD http://hanyang.academia.edu/PiotrKonieczny http://scholar.google.com/citations?user=gdV8_AEAAAAJ http://en.wikipedia.org/wiki/User:Piotrus

2 1

Intro to "Committee" and "Diversity" section approved, "Conflict of interest" needs more work
by Matthew Flaschen 29 Sep '16

29 Sep '16

The community approved the introduction to the Committee section (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Page:_Code_of_Conduct.…) (the part after "Committee" and before the "Diversity" section), as well as the "Diversity" section (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Diversity). There was not consensus to approve the "Conflict of interest" (https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Conflict_of_interest) section. Work will continue on this section. See the top of https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Finalize_.22Confl… (including the sections linked from those bullet points). Thanks, Matt Flaschen

1 1

Google open source research on automatic image captioning
by Pine W 29 Sep '16

29 Sep '16

Perhaps of interest: "...We’re making the latest version of our image captioning system available as an open source model in TensorFlow." https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open… Pine

3 3

Re: [Wiki-research-l] Wiki-research-l Digest, Vol 133, Issue 18
by Kevin G Crowston 29 Sep '16

29 Sep '16

We are doing research on a similar situation combining machine vision processing with volunteer annotations in a citizen science project. It would be interesting to see how much translates across these settings, e.g., if our ideas about using the machine annotations are applicable here as well. Kevin Crowston | Associate Dean for Research and Distinguished Professor of Information Science | School of Information Studies Syracuse University 348 Hinds Hall Syracuse, New York 13244 t (315) 443.1676 f 315.443.5806 e crowston(a)syr.edu <mailto:crowston@syr.edu> crowston.syr.edu <http://crowston.syr.edu/> From: Jan Dittrich <jan.dittrich(a)wikimedia.de<mailto:jan.dittrich@wikimedia.de>> Subject: Re: [Wiki-research-l] Google open source research on automatic image captioning I find it interesting which impact this could have on the sense of achievement for volunteers, if captions are autogenerated or suggested and them possibly affirmed or corrected. On one hand one could assume a decreased sense of ownership, on the other hand, it might be more easier to comment/correct then to write from scratch and feel much more efficient. Jan 2016-09-27 23:08 GMT+02:00<http://airmail.calendar/2016-09-27%2017:08:00%20EDT> Dario Taraborelli <dtaraborelli(a)wikimedia.org<mailto:dtaraborelli@wikimedia.org>>: > I forwarded this separately to internally at WMF a few days ago. Clearly – > before thinking of building workflows for human contributors to generate > captions or rich descriptors of media files in Commons – we should look at > what's available in terms of off-the-shelf machine learning services and > libraries. > > #1 rule of sane citizen science/crowdsourcing projects: don't ask humans > to perform tedious tasks machines are pretty good at, get humans to curate > inputs and outputs of machines instead. > > D

1 0

Fwd: [Analytics] Identifying Wikipedia stubs in various languages
by Giuseppe Profiti 23 Sep '16

23 Sep '16

[forwarding my answer from analytics ml, I forgot to subscribe to this list too] Hi Robert, one solution may be to use a query on Wikidata to retrieve the name for the stubs category in all the different languages. Then you could use a tool like PetScan to retrive all the pages in such categories, or write your own tool by using either a query on the database or Mediawiki API. You can find a sample solution here: http://paws-public.wmflabs.org/paws-public/3270/Stub%20categories.ipynb I wrote that thing while on a train, so it may be messy and/or sub-optimal. I would like to thank Alex Monk and Yuvi Panda for their help with SQL on paws today. Best, Giuseppe 2016-09-20 11:26 GMT+02:00 Robert West <west(a)cs.stanford.edu>: > Hi everyone, > > Does anyone know if there's a straightforward (ideally language-independent) > way of identifying stub articles in Wikipedia? > > Whatever works is ok, whether it's publicly available data or data > accessible only on the WMF cluster. > > I've found lists for various languages (e.g., Italian or English), but the > lists are in different formats, so separate code is required for each > language, which doesn't scale. > > I guess in the worst case, I'll have to grep for the respective stub > templates in the respective wikitext dumps, but even this requires to know > for each language what the respective template is. So if anyone could point > me to a list of stub templates in different languages, that would also be > appreciated. > > Thanks! > Bob > > -- > Up for a little language game? -- http://www.unfun.me > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics >

2 1

Research Showcase, September 21, 2016
by Sarah R 22 Sep '16

22 Sep '16

Hi Everyone, The next Research Showcase will be live-streamed this Wednesday, September 21, 2016 at 11:30 AM (PST) 18:30 (UTC). YouTube stream: https://www.youtube.com/watch?v=fTDkVeqjw80 As usual, you can join the conversation on IRC at #wikimedia-research. And, you can watch our past research showcases here <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#September_2016>. This month's showcase includes. Finding News Citations for WikipediaBy *Besnik Fetahu <http://www.l3s.de/~fetahu/> (Leibniz University of Hannover)*An important editing policy in Wikipedia is to provide citations for added statements in Wikipedia pages, where statements can be arbitrary pieces of text, ranging from a sentence to a paragraph. In many cases citations are either outdated or missing altogether. In this work we address the problem of finding and updating news citations for statements in entity pages. We propose a two- stage supervised approach for this problem. In the first step, we construct a classifier to find out whether statements need a news citation or other kinds of citations (web, book, journal, etc.). In the second step, we develop a news citation algorithm for Wikipedia statements, which recommends appropriate citations from a given news collection. Apart from IR techniques that use the statement to query the news collection, we also formalize three properties of an appropriate citation, namely: (i) the citation should entail the Wikipedia statement, (ii) the statement should be central to the citation, and (iii) the citation should be from an authoritative source. We perform an extensive evaluation of both steps, using 20 million articles from a real-world news collection. Our results are quite promising, and show that we can perform this task with high precision and at scale. Designing and Building Online Discussion SystemsBy *Amy X. Zhang <http://people.csail.mit.edu/axz/> (MIT)*Today, conversations are everywhere on the Internet and come in many different forms. However, there are still many problems with discussion interfaces today. In my talk, I will first give an overview of some of the problems with discussion systems, including difficulty dealing with large scales, which exacerbates additional problems with navigating deep threads containing lots of back-and-forth and getting an overall summary of a discussion. Other problems include dealing with moderation and harassment in discussion systems and gaining control over filtering, customization, and means of access. Then I will focus on a few projects I am working on in this space now. The first is Wikum, a system I developed to allow users to collaboratively generate a wiki-like summary from threaded discussion. The second, which I have just begun, is exploring the design space of presentation and navigation of threaded discussion. I will next discuss Murmur, a mailing list hybrid system we have built to implement and test ideas around filtering, customization, and flexibility of access, as well as combating harassment. Finally, I'll wrap up with what I am working on at Google Research this summer: developing a taxonomy to describe online forum discussion and using this information to extract meaningful content useful for search, summarization of discussions, and characterization of communities. Hope to see you there! Sarah R. Rodlund Senior Project Coordinator-Engineering, Wikimedia Foundation srodlund(a)wikimedia.org

1 1

Re: [Wiki-research-l] [Analytics] Upcoming reboots of stat and Hadoop hosts due to Kernel upgrades
by Dan Andreescu 22 Sep '16

22 Sep '16

+ research (btw, if we cc analytics and research does that reach everyone using these boxes? Like discovery, fundraising, etc? Basically, everyone who doesn't see this message, raise your hand :)) On Wed, Sep 21, 2016 at 10:44 AM, Luca Toscano <ltoscano(a)wikimedia.org> wrote: > Hi everybody, > > the Analytics team is going to reboot all the stat hosts (stat1002, > stat1003 and stat1004) and the Hadoop cluster nodes to install new kernels > (security upgrade required). The work will start tomorrow morning (Sep > 22nd) at around 9:00 AM CEST. > This task might interfere with ongoing Hadoop jobs or processes running on > the stat* hosts, so please let me know if there is any motivation to > postpone the maintenance. > > Please also feel free to reach out to the analytics IRC channel or to me > directly if you have more questions :) > > Thanks! > > Regards, > > Luca > > _______________________________________________ > Analytics mailing list > Analytics(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >

2 1

Identifying Wikipedia stubs in various languages
by Robert West 21 Sep '16

21 Sep '16

Hi everyone, Does anyone know if there's a straightforward (ideally language-independent) way of identifying stub articles in Wikipedia? Whatever works is ok, whether it's publicly available data or data accessible only on the WMF cluster. I've found lists for various languages (e.g., Italian <https://it.wikipedia.org/wiki/Categoria:Stub> or English <https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the lists are in different formats, so separate code is required for each language, which doesn't scale. I guess in the worst case, I'll have to grep for the respective stub templates in the respective wikitext dumps, but even this requires to know for each language what the respective template is. So if anyone could point me to a list of stub templates in different languages, that would also be appreciated. Thanks! Bob -- Up for a little language game? -- http://www.unfun.me

4 4

measuring time to proofread wikipedias and the Making Work Pay Tax Credit
by James Salsman 18 Sep '16

18 Sep '16

I am pleased to announce that, thanks to Google Summer of Code student Priyanka Mandikal, the project for the Accuracy Review of Wikipedias project has delivered a working demonstration of open source code and data available here: https://github.com/priyankamandikal/arowf/ Please try it out at: http://tools.wmflabs.org/arowf/ We need your help to test it and try it out and send us comments. You can read more about the project here: https://priyankamandikal.github.io/posts/gsoc-2016-project-overview/ The formal project report, still in progress (Google docs comments from anyone are most welcome) is at: https://docs.google.com/document/d/1_AiOyVn9Qf5ne1qCHIygUU3OTJcbpkb14N3rIty… This allows experiments to measure, for example, how long it would take to complete proofreading of the wikipedias with and without paying editors to work alongside volunteers. I am sure everyone agrees that is an interesting question which bears directly on budget expectations. I hope multiple organizations use the published methods and their Python implementations to make such measurements. I would also like to suggest a proposal related to the questions in both of the following reviews: http://unotes.hartford.edu/announcements/images/2014_03_04_Cerasoli_and_Nic… http://onlinelibrary.wiley.com/doi/10.1111/1748-8583.12080/abstract The most recent solicitation of community input for the Foundation's Public Policy team I've seen said that they would like suggestions for specific issues as long as the suggestions did not involve endorsements of or opposition to any specific candidates. My support for adjusting copyright royalties on a sliding scale to transfer wealth from larger to smaller artists has been made clear, and I do not believe there are any concerns that I have not addressed concerning alignment to mission or effectiveness. I would also like to propose a related endorsement. The Making Work Pay tax credit (MWPTC) is a negative payroll tax that expired in 2010. It has all the advantages of an expanded Earned Income Tax Credit (EITC) but would happen with every paycheck. Reinstating the Making Work Pay tax credit would serve to reduce economic inequality. This proposal is within the scope of the Foundation's mission because reducing economic inequality should serve to empower people to develop educational content for the projects because of the increased levels of support for artistic production among a broader set of potential editors with additional discretionary free time due to increased wealth. This proposal is needed because economic inequality produces more excess avoidable deaths and leads to fewer years of productive life than global warming. This proposal would provide substantial benefits to the movement, the community, the Foundation, the US and the world if it were to be successfully adopted. For the reasons stated above, this proposal will be seen as positive. Here is some background and supporting information: * MWPTC overview: https://en.wikipedia.org/wiki/Making_Work_Pay_tax_credit * MWPTC details: http://tpcprod.urban.org/taxtopics/2011_work.cfm * Problems with expanding the EITC: http://www.taxpolicycenter.org/taxvox/eitc-expansion-backed-obama-and-ryan-… * Educational advantages of expanding the EITC: https://www.brookings.edu/opinions/this-policy-would-help-poor-kids-more-th… * Financial advantages of expanding the EITC: http://www.cbpp.org/research/federal-tax/strengthening-the-eitc-for-childle… * The working class has lost half their wealth over the past two decades: https://www.nerdwallet.com/blog/finance/why-people-are-angry/ * Health effects of addressing economic inequality: http://talknicer.com/ehlr.pdf * Economic growth effects of addressing economic inequality: http://talknicer.com/egma.pdf * Unemployment and underemployment effects of addressing economic inequality: http://diposit.ub.edu/dspace/bitstream/2445/33140/1/617293.pdf For an example of how a campaign on this issue could be conducted based on the issues identified in the sources above, please see: http://bit.ly/mwptc Please share your thoughts on the wikipedias proofreading time measurement effort and this related public policy proposal. I expect that some people will say that they do not understand how the public policy proposal relates to the project to measure the amount of time it would take to proofread the wikipedias. I am happy to explain that in detail if and when needed. On a related note, I would like to point out that the project report Google doc suggests future work involving a peer learning system for speaking skills using the same architecture as we derived from the constraints for successfully performing simultaneous paid and volunteer proofreading. I would like people to keep that in mind when evaluating the utility of these proposals. Sincerely, Jim Salsman

1 0

[Wikistats 2.0] [Regular Update] Wrapping up Q1
by Dan Andreescu 17 Sep '16

17 Sep '16

We're starting to wrap up Q1, so it's time for another wikistats update. First, a quick reminder: ----- If you currently use the existing reports, PLEASE give feedback in the section(s) at https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpRepor ts/Future_per_report Bonus points for noting what you use, how you use it, and explaining what elements you most appreciate or might want added. ----- Ok, so this is our list of high level goals, and as we were saying before, we're focusing on taking a vertical slice through 4, 5, and 6 so we can deliver functionality and iterate. 1. [done] Build pipeline to process and analyze *pageview* data 2. [done] Load pageview data into an *API* 3. [ ] *Sanitize* pageview data with more dimensions for public consumption 4. [ ] Build pipeline to process and analyze *editing* data 5. [ ] Load editing data into an *API* 6. [ ] *Sanitize* editing data for public consumption 7. [ ] *Design* UI to organize dashboards built around new data 8. [ ] Build enough *dashboards* to replace the main functionality of stats.wikipedia.org 9. [ ] Officially Replace stats.wikipedia.org with *(maybe) analytics.wikipedia.org <http://analytics.wikipedia.org/>* ***. [ ] Bonus: *replace dumps generation* based on the new data pipelines So here's the progress since last time by high level goal: 4. We can rebuild most all page and user histories from logging, revision, page, archive, and user mediawiki tables. The scala / spark algorithm scales well and can process english wikipedia in less than an hour. Once history is rebuilt, we want to join it into a denormalized schema. We have an algorithm that works on simplewiki rather quickly, but we're *still working on scaling* it to work with english wiki. For that reason, our vertical slice this quarter may include *only simplewiki*. In addition to denormalizing the data to make it very simple for analysts and researchers to work with, we're also computing columns like "this edit was reverted at X timestamp" or "this page was deleted at X timestamp". These will all be available in one flat schema. 5. We loaded the simplewiki data into Druid and put Pivot on top of it. It's fantastically fun, I had to close that tab or I would've lost a day browsing around. For a small db like simplewiki, Druid should have no problem maintaining an updated version of the computed columns mentioned above. (I say updated because "this edit was reverted" is a fact that can change from false to true at some point in the future). We're still not 100% sure whether Druid can do that with the much larger enwiki data, but we're testing that. And we're also testing ClickHouse, another highly performant OLAP big data columnar store, just in case. In short, we can update *once a week* already, and we're working on seeing how feasible it is to update more often than that. 6. We ran into a *problem* when thinking about sanitizing the data. Our initial idea was to filter out the same columns that are filtered out when data is replicated to labsdb. But we found rows are also filtered and the process for doing that filtering is in need of a lot of love and care. So we may side-track to see if we can help out our fellow DBAs and labs ops in the process, maybe unifying the edit data sanitization. Steps remaining for having simplewiki data in Druid / Pivot by the end of Q1: * vet data with Erik * finish productionizing our Pivot install so internal/NDA folks can play with it

1 0

Jump to page:

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l