Dear colleagues,
My new research project, inspired by the following CfP (
*http://www.asanet.org/journals/TS/SpecialIssueCall.cfm)* aims at trying
to judge how effective our teaching assignments on Wikipedia have been,
in the context of my globalization lectures in which students have
created or expanded dozens of Wikipedia articles (you can see partial
list of articles created by my students at
https://en.wikipedia.org/wiki/User:Piotrus/Educational_project_results
to get an idea of what I had them to do over the past few years). It is
clear that Wikipedia benefits, but what about the students? Here are my
two questions to you.
First, my main source of data is going to be a survey of my former
students (N<100). I wonder if anyone is familiar with literature on
relevant metrics (i.e. how to design a survey to measure the
effectiveness of a teaching instrument)? I have never surveyed students
before, and while I am in the middle of a lit review, any suggestions
would be appreciated. I am somewhat familiar with the literature on
teaching with Wikipedia, but sadly few works have published surveys
used. If anything comes to mind that you think would be good to use for
comparative studies, that would also be helpful.
Second, here is my draft survey: http://tinyurl.com/hehckvs
I'd appreciate any comments: is it too long? Are some questions
ambiguous? Unnecessary? Leading and creating bias in subsequent
questions? Should I rephrase something? Should I ask something else?
Thank you for any comments, and do not hesitate to be critical - I'd
much rather redo the survey now then after I send it out :)
--
Piotr Konieczny, PhD
http://hanyang.academia.edu/PiotrKoniecznyhttp://scholar.google.com/citations?user=gdV8_AEAAAAJhttp://en.wikipedia.org/wiki/User:Piotrus
We are doing research on a similar situation combining machine vision processing with volunteer annotations in a citizen science project. It would be interesting to see how much translates across these settings, e.g., if our ideas about using the machine annotations are applicable here as well.
Kevin Crowston | Associate Dean for Research and Distinguished Professor of Information Science | School of Information Studies
Syracuse University
348 Hinds Hall
Syracuse, New York 13244
t (315) 443.1676 f 315.443.5806 e crowston(a)syr.edu <mailto:crowston@syr.edu>
crowston.syr.edu <http://crowston.syr.edu/>
From: Jan Dittrich <jan.dittrich(a)wikimedia.de<mailto:jan.dittrich@wikimedia.de>>
Subject: Re: [Wiki-research-l] Google open source research on automatic image captioning
I find it interesting which impact this could have on the sense of
achievement for volunteers, if captions are autogenerated or suggested and
them possibly affirmed or corrected.
On one hand one could assume a decreased sense of ownership,
on the other hand, it might be more easier to comment/correct then to write
from scratch and feel much more efficient.
Jan
2016-09-27 23:08 GMT+02:00<http://airmail.calendar/2016-09-27%2017:08:00%20EDT> Dario Taraborelli <dtaraborelli(a)wikimedia.org<mailto:dtaraborelli@wikimedia.org>>:
> I forwarded this separately to internally at WMF a few days ago. Clearly –
> before thinking of building workflows for human contributors to generate
> captions or rich descriptors of media files in Commons – we should look at
> what's available in terms of off-the-shelf machine learning services and
> libraries.
>
> #1 rule of sane citizen science/crowdsourcing projects: don't ask humans
> to perform tedious tasks machines are pretty good at, get humans to curate
> inputs and outputs of machines instead.
>
> D
[forwarding my answer from analytics ml, I forgot to subscribe to this list too]
Hi Robert,
one solution may be to use a query on Wikidata to retrieve the name
for the stubs category in all the different languages. Then you could
use a tool like PetScan to retrive all the pages in such categories,
or write your own tool by using either a query on the database or
Mediawiki API.
You can find a sample solution here:
http://paws-public.wmflabs.org/paws-public/3270/Stub%20categories.ipynb
I wrote that thing while on a train, so it may be messy and/or sub-optimal.
I would like to thank Alex Monk and Yuvi Panda for their help with SQL
on paws today.
Best,
Giuseppe
2016-09-20 11:26 GMT+02:00 Robert West <west(a)cs.stanford.edu>:
> Hi everyone,
>
> Does anyone know if there's a straightforward (ideally language-independent)
> way of identifying stub articles in Wikipedia?
>
> Whatever works is ok, whether it's publicly available data or data
> accessible only on the WMF cluster.
>
> I've found lists for various languages (e.g., Italian or English), but the
> lists are in different formats, so separate code is required for each
> language, which doesn't scale.
>
> I guess in the worst case, I'll have to grep for the respective stub
> templates in the respective wikitext dumps, but even this requires to know
> for each language what the respective template is. So if anyone could point
> me to a list of stub templates in different languages, that would also be
> appreciated.
>
> Thanks!
> Bob
>
> --
> Up for a little language game? -- http://www.unfun.me
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
Hi Everyone,
The next Research Showcase will be live-streamed this Wednesday, September
21, 2016 at 11:30 AM (PST) 18:30 (UTC).
YouTube stream: https://www.youtube.com/watch?v=fTDkVeqjw80
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#September_2016>.
This month's showcase includes.
Finding News Citations for WikipediaBy *Besnik Fetahu
<http://www.l3s.de/~fetahu/> (Leibniz University of Hannover)*An important
editing policy in Wikipedia is to provide citations for added statements in
Wikipedia pages, where statements can be arbitrary pieces of text, ranging
from a sentence to a paragraph. In many cases citations are either outdated
or missing altogether. In this work we address the problem of finding and
updating news citations for statements in entity pages. We propose a two-
stage supervised approach for this problem. In the first step, we construct
a classifier to find out whether statements need a news citation or other
kinds of citations (web, book, journal, etc.). In the second step, we
develop a news citation algorithm for Wikipedia statements, which
recommends appropriate citations from a given news collection. Apart from
IR techniques that use the statement to query the news collection, we also
formalize three properties of an appropriate citation, namely: (i) the
citation should entail the Wikipedia statement, (ii) the statement should
be central to the citation, and (iii) the citation should be from an
authoritative source. We perform an extensive evaluation of both steps,
using 20 million articles from a real-world news collection. Our results
are quite promising, and show that we can perform this task with high
precision and at scale.
Designing and Building Online Discussion SystemsBy *Amy X. Zhang
<http://people.csail.mit.edu/axz/> (MIT)*Today, conversations are
everywhere on the Internet and come in many different forms. However, there
are still many problems with discussion interfaces today. In my talk, I
will first give an overview of some of the problems with discussion
systems, including difficulty dealing with large scales, which exacerbates
additional problems with navigating deep threads containing lots of
back-and-forth and getting an overall summary of a discussion. Other
problems include dealing with moderation and harassment in discussion
systems and gaining control over filtering, customization, and means of
access. Then I will focus on a few projects I am working on in this space
now. The first is Wikum, a system I developed to allow users to
collaboratively generate a wiki-like summary from threaded discussion. The
second, which I have just begun, is exploring the design space of
presentation and navigation of threaded discussion. I will next discuss
Murmur, a mailing list hybrid system we have built to implement and test
ideas around filtering, customization, and flexibility of access, as well
as combating harassment. Finally, I'll wrap up with what I am working on at
Google Research this summer: developing a taxonomy to describe online forum
discussion and using this information to extract meaningful content useful
for search, summarization of discussions, and characterization of
communities.
Hope to see you there!
Sarah R. Rodlund
Senior Project Coordinator-Engineering, Wikimedia Foundation
srodlund(a)wikimedia.org
+ research
(btw, if we cc analytics and research does that reach everyone using these
boxes? Like discovery, fundraising, etc? Basically, everyone who doesn't
see this message, raise your hand :))
On Wed, Sep 21, 2016 at 10:44 AM, Luca Toscano <ltoscano(a)wikimedia.org>
wrote:
> Hi everybody,
>
> the Analytics team is going to reboot all the stat hosts (stat1002,
> stat1003 and stat1004) and the Hadoop cluster nodes to install new kernels
> (security upgrade required). The work will start tomorrow morning (Sep
> 22nd) at around 9:00 AM CEST.
> This task might interfere with ongoing Hadoop jobs or processes running on
> the stat* hosts, so please let me know if there is any motivation to
> postpone the maintenance.
>
> Please also feel free to reach out to the analytics IRC channel or to me
> directly if you have more questions :)
>
> Thanks!
>
> Regards,
>
> Luca
>
> _______________________________________________
> Analytics mailing list
> Analytics(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
Hi everyone,
Does anyone know if there's a straightforward (ideally
language-independent) way of identifying stub articles in Wikipedia?
Whatever works is ok, whether it's publicly available data or data
accessible only on the WMF cluster.
I've found lists for various languages (e.g., Italian
<https://it.wikipedia.org/wiki/Categoria:Stub> or English
<https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the lists
are in different formats, so separate code is required for each language,
which doesn't scale.
I guess in the worst case, I'll have to grep for the respective stub
templates in the respective wikitext dumps, but even this requires to know
for each language what the respective template is. So if anyone could point
me to a list of stub templates in different languages, that would also be
appreciated.
Thanks!
Bob
--
Up for a little language game? -- http://www.unfun.me
I am pleased to announce that, thanks to Google Summer of Code student
Priyanka Mandikal, the project for the Accuracy Review of Wikipedias
project has delivered a working demonstration of open source code and
data available here:
https://github.com/priyankamandikal/arowf/
Please try it out at:
http://tools.wmflabs.org/arowf/
We need your help to test it and try it out and send us comments. You
can read more about the project here:
https://priyankamandikal.github.io/posts/gsoc-2016-project-overview/
The formal project report, still in progress (Google docs comments
from anyone are most welcome) is at:
https://docs.google.com/document/d/1_AiOyVn9Qf5ne1qCHIygUU3OTJcbpkb14N3rIty…
This allows experiments to measure, for example, how long it would
take to complete proofreading of the wikipedias with and without
paying editors to work alongside volunteers. I am sure everyone agrees
that is an interesting question which bears directly on budget
expectations. I hope multiple organizations use the published methods
and their Python implementations to make such measurements. I would
also like to suggest a proposal related to the questions in both of
the following reviews:
http://unotes.hartford.edu/announcements/images/2014_03_04_Cerasoli_and_Nic…http://onlinelibrary.wiley.com/doi/10.1111/1748-8583.12080/abstract
The most recent solicitation of community input for the Foundation's
Public Policy team I've seen said that they would like suggestions for
specific issues as long as the suggestions did not involve
endorsements of or opposition to any specific candidates. My support
for adjusting copyright royalties on a sliding scale to transfer
wealth from larger to smaller artists has been made clear, and I do
not believe there are any concerns that I have not addressed
concerning alignment to mission or effectiveness. I would also like to
propose a related endorsement.
The Making Work Pay tax credit (MWPTC) is a negative payroll tax that
expired in 2010. It has all the advantages of an expanded Earned
Income Tax Credit (EITC) but would happen with every paycheck.
Reinstating the Making Work Pay tax credit would serve to reduce
economic inequality.
This proposal is within the scope of the Foundation's mission because
reducing economic inequality should serve to empower people to develop
educational content for the projects because of the increased levels
of support for artistic production among a broader set of potential
editors with additional discretionary free time due to increased
wealth. This proposal is needed because economic inequality produces
more excess avoidable deaths and leads to fewer years of productive
life than global warming. This proposal would provide substantial
benefits to the movement, the community, the Foundation, the US and
the world if it were to be successfully adopted. For the reasons
stated above, this proposal will be seen as positive.
Here is some background and supporting information:
* MWPTC overview: https://en.wikipedia.org/wiki/Making_Work_Pay_tax_credit
* MWPTC details: http://tpcprod.urban.org/taxtopics/2011_work.cfm
* Problems with expanding the EITC:
http://www.taxpolicycenter.org/taxvox/eitc-expansion-backed-obama-and-ryan-…
* Educational advantages of expanding the EITC:
https://www.brookings.edu/opinions/this-policy-would-help-poor-kids-more-th…
* Financial advantages of expanding the EITC:
http://www.cbpp.org/research/federal-tax/strengthening-the-eitc-for-childle…
* The working class has lost half their wealth over the past two
decades: https://www.nerdwallet.com/blog/finance/why-people-are-angry/
* Health effects of addressing economic inequality:
http://talknicer.com/ehlr.pdf
* Economic growth effects of addressing economic inequality:
http://talknicer.com/egma.pdf
* Unemployment and underemployment effects of addressing economic
inequality: http://diposit.ub.edu/dspace/bitstream/2445/33140/1/617293.pdf
For an example of how a campaign on this issue could be conducted
based on the issues identified in the sources above, please see:
http://bit.ly/mwptc
Please share your thoughts on the wikipedias proofreading time
measurement effort and this related public policy proposal.
I expect that some people will say that they do not understand how the
public policy proposal relates to the project to measure the amount of
time it would take to proofread the wikipedias. I am happy to explain
that in detail if and when needed. On a related note, I would like to
point out that the project report Google doc suggests future work
involving a peer learning system for speaking skills using the same
architecture as we derived from the constraints for successfully
performing simultaneous paid and volunteer proofreading. I would like
people to keep that in mind when evaluating the utility of these
proposals.
Sincerely,
Jim Salsman
We're starting to wrap up Q1, so it's time for another wikistats update.
First, a quick reminder:
-----
If you currently use the existing reports, PLEASE give feedback in the
section(s) at
https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpRepor
ts/Future_per_report
Bonus points for noting what you use, how you use it, and explaining what
elements you most appreciate or might want added.
-----
Ok, so this is our list of high level goals, and as we were saying before,
we're focusing on taking a vertical slice through 4, 5, and 6 so we can
deliver functionality and iterate.
1. [done] Build pipeline to process and analyze *pageview* data
2. [done] Load pageview data into an *API*
3. [ ] *Sanitize* pageview data with more dimensions for public
consumption
4. [ ] Build pipeline to process and analyze *editing* data
5. [ ] Load editing data into an *API*
6. [ ] *Sanitize* editing data for public consumption
7. [ ] *Design* UI to organize dashboards built around new data
8. [ ] Build enough *dashboards* to replace the main functionality
of stats.wikipedia.org
9. [ ] Officially Replace stats.wikipedia.org with *(maybe)
analytics.wikipedia.org
<http://analytics.wikipedia.org/>*
***. [ ] Bonus: *replace dumps generation* based on the new data
pipelines
So here's the progress since last time by high level goal:
4. We can rebuild most all page and user histories from logging, revision,
page, archive, and user mediawiki tables. The scala / spark algorithm
scales well and can process english wikipedia in less than an hour. Once
history is rebuilt, we want to join it into a denormalized schema. We have
an algorithm that works on simplewiki rather quickly, but we're *still
working on scaling* it to work with english wiki. For that reason, our
vertical slice this quarter may include *only simplewiki*. In addition to
denormalizing the data to make it very simple for analysts and researchers
to work with, we're also computing columns like "this edit was reverted at
X timestamp" or "this page was deleted at X timestamp". These will all be
available in one flat schema.
5. We loaded the simplewiki data into Druid and put Pivot on top of it.
It's fantastically fun, I had to close that tab or I would've lost a day
browsing around. For a small db like simplewiki, Druid should have no
problem maintaining an updated version of the computed columns mentioned
above. (I say updated because "this edit was reverted" is a fact that can
change from false to true at some point in the future). We're still not
100% sure whether Druid can do that with the much larger enwiki data, but
we're testing that. And we're also testing ClickHouse, another highly
performant OLAP big data columnar store, just in case. In short, we can
update *once a week* already, and we're working on seeing how feasible it
is to update more often than that.
6. We ran into a *problem* when thinking about sanitizing the data. Our
initial idea was to filter out the same columns that are filtered out when
data is replicated to labsdb. But we found rows are also filtered and the
process for doing that filtering is in need of a lot of love and care. So
we may side-track to see if we can help out our fellow DBAs and labs ops in
the process, maybe unifying the edit data sanitization.
Steps remaining for having simplewiki data in Druid / Pivot by the end of
Q1:
* vet data with Erik
* finish productionizing our Pivot install so internal/NDA folks can play
with it