Hi everyone,
Does anyone know if there's a straightforward (ideally
language-independent) way of identifying stub articles in Wikipedia?
Whatever works is ok, whether it's publicly available data or data
accessible only on the WMF cluster.
I've found lists for various languages (e.g., Italian
<https://it.wikipedia.org/wiki/Categoria:Stub> or English
<https://en.wikipedia.org/wiki/Category:All_stub_articles>), but the lists
are in different formats, so separate code is required for each language,
which doesn't scale.
I guess in the worst case, I'll have to grep for the respective stub
templates in the respective wikitext dumps, but even this requires to know
for each language what the respective template is. So if anyone could point
me to a list of stub templates in different languages, that would also be
appreciated.
Thanks!
Bob
--
Up for a little language game? -- http://www.unfun.me
I am pleased to announce that, thanks to Google Summer of Code student
Priyanka Mandikal, the project for the Accuracy Review of Wikipedias
project has delivered a working demonstration of open source code and
data available here:
https://github.com/priyankamandikal/arowf/
Please try it out at:
http://tools.wmflabs.org/arowf/
We need your help to test it and try it out and send us comments. You
can read more about the project here:
https://priyankamandikal.github.io/posts/gsoc-2016-project-overview/
The formal project report, still in progress (Google docs comments
from anyone are most welcome) is at:
https://docs.google.com/document/d/1_AiOyVn9Qf5ne1qCHIygUU3OTJcbpkb14N3rIty…
This allows experiments to measure, for example, how long it would
take to complete proofreading of the wikipedias with and without
paying editors to work alongside volunteers. I am sure everyone agrees
that is an interesting question which bears directly on budget
expectations. I hope multiple organizations use the published methods
and their Python implementations to make such measurements. I would
also like to suggest a proposal related to the questions in both of
the following reviews:
http://unotes.hartford.edu/announcements/images/2014_03_04_Cerasoli_and_Nic…http://onlinelibrary.wiley.com/doi/10.1111/1748-8583.12080/abstract
The most recent solicitation of community input for the Foundation's
Public Policy team I've seen said that they would like suggestions for
specific issues as long as the suggestions did not involve
endorsements of or opposition to any specific candidates. My support
for adjusting copyright royalties on a sliding scale to transfer
wealth from larger to smaller artists has been made clear, and I do
not believe there are any concerns that I have not addressed
concerning alignment to mission or effectiveness. I would also like to
propose a related endorsement.
The Making Work Pay tax credit (MWPTC) is a negative payroll tax that
expired in 2010. It has all the advantages of an expanded Earned
Income Tax Credit (EITC) but would happen with every paycheck.
Reinstating the Making Work Pay tax credit would serve to reduce
economic inequality.
This proposal is within the scope of the Foundation's mission because
reducing economic inequality should serve to empower people to develop
educational content for the projects because of the increased levels
of support for artistic production among a broader set of potential
editors with additional discretionary free time due to increased
wealth. This proposal is needed because economic inequality produces
more excess avoidable deaths and leads to fewer years of productive
life than global warming. This proposal would provide substantial
benefits to the movement, the community, the Foundation, the US and
the world if it were to be successfully adopted. For the reasons
stated above, this proposal will be seen as positive.
Here is some background and supporting information:
* MWPTC overview: https://en.wikipedia.org/wiki/Making_Work_Pay_tax_credit
* MWPTC details: http://tpcprod.urban.org/taxtopics/2011_work.cfm
* Problems with expanding the EITC:
http://www.taxpolicycenter.org/taxvox/eitc-expansion-backed-obama-and-ryan-…
* Educational advantages of expanding the EITC:
https://www.brookings.edu/opinions/this-policy-would-help-poor-kids-more-th…
* Financial advantages of expanding the EITC:
http://www.cbpp.org/research/federal-tax/strengthening-the-eitc-for-childle…
* The working class has lost half their wealth over the past two
decades: https://www.nerdwallet.com/blog/finance/why-people-are-angry/
* Health effects of addressing economic inequality:
http://talknicer.com/ehlr.pdf
* Economic growth effects of addressing economic inequality:
http://talknicer.com/egma.pdf
* Unemployment and underemployment effects of addressing economic
inequality: http://diposit.ub.edu/dspace/bitstream/2445/33140/1/617293.pdf
For an example of how a campaign on this issue could be conducted
based on the issues identified in the sources above, please see:
http://bit.ly/mwptc
Please share your thoughts on the wikipedias proofreading time
measurement effort and this related public policy proposal.
I expect that some people will say that they do not understand how the
public policy proposal relates to the project to measure the amount of
time it would take to proofread the wikipedias. I am happy to explain
that in detail if and when needed. On a related note, I would like to
point out that the project report Google doc suggests future work
involving a peer learning system for speaking skills using the same
architecture as we derived from the constraints for successfully
performing simultaneous paid and volunteer proofreading. I would like
people to keep that in mind when evaluating the utility of these
proposals.
Sincerely,
Jim Salsman
We're starting to wrap up Q1, so it's time for another wikistats update.
First, a quick reminder:
-----
If you currently use the existing reports, PLEASE give feedback in the
section(s) at
https://www.mediawiki.org/wiki/Analytics/Wikistats/DumpRepor
ts/Future_per_report
Bonus points for noting what you use, how you use it, and explaining what
elements you most appreciate or might want added.
-----
Ok, so this is our list of high level goals, and as we were saying before,
we're focusing on taking a vertical slice through 4, 5, and 6 so we can
deliver functionality and iterate.
1. [done] Build pipeline to process and analyze *pageview* data
2. [done] Load pageview data into an *API*
3. [ ] *Sanitize* pageview data with more dimensions for public
consumption
4. [ ] Build pipeline to process and analyze *editing* data
5. [ ] Load editing data into an *API*
6. [ ] *Sanitize* editing data for public consumption
7. [ ] *Design* UI to organize dashboards built around new data
8. [ ] Build enough *dashboards* to replace the main functionality
of stats.wikipedia.org
9. [ ] Officially Replace stats.wikipedia.org with *(maybe)
analytics.wikipedia.org
<http://analytics.wikipedia.org/>*
***. [ ] Bonus: *replace dumps generation* based on the new data
pipelines
So here's the progress since last time by high level goal:
4. We can rebuild most all page and user histories from logging, revision,
page, archive, and user mediawiki tables. The scala / spark algorithm
scales well and can process english wikipedia in less than an hour. Once
history is rebuilt, we want to join it into a denormalized schema. We have
an algorithm that works on simplewiki rather quickly, but we're *still
working on scaling* it to work with english wiki. For that reason, our
vertical slice this quarter may include *only simplewiki*. In addition to
denormalizing the data to make it very simple for analysts and researchers
to work with, we're also computing columns like "this edit was reverted at
X timestamp" or "this page was deleted at X timestamp". These will all be
available in one flat schema.
5. We loaded the simplewiki data into Druid and put Pivot on top of it.
It's fantastically fun, I had to close that tab or I would've lost a day
browsing around. For a small db like simplewiki, Druid should have no
problem maintaining an updated version of the computed columns mentioned
above. (I say updated because "this edit was reverted" is a fact that can
change from false to true at some point in the future). We're still not
100% sure whether Druid can do that with the much larger enwiki data, but
we're testing that. And we're also testing ClickHouse, another highly
performant OLAP big data columnar store, just in case. In short, we can
update *once a week* already, and we're working on seeing how feasible it
is to update more often than that.
6. We ran into a *problem* when thinking about sanitizing the data. Our
initial idea was to filter out the same columns that are filtered out when
data is replicated to labsdb. But we found rows are also filtered and the
process for doing that filtering is in need of a lot of love and care. So
we may side-track to see if we can help out our fellow DBAs and labs ops in
the process, maybe unifying the edit data sanitization.
Steps remaining for having simplewiki data in Druid / Pivot by the end of
Q1:
* vet data with Erik
* finish productionizing our Pivot install so internal/NDA folks can play
with it
The Wikimedia Foundation's Discovery and Research teams recently hosted an
introductory workshop on the SPARQL query language and the Wikidata Query
Service.
We made the video stream <https://www.youtube.com/watch?v=NaMdh4fXy18> and
materials
<https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/2016_SPARQL_Wor…>
(demo
queries, slidedecks) from this workshop publicly available.
Guest speakers:
- Ruben Verborgh, *Ghent University* and *Linked Data Fragments*
- Benjamin Good, *Scripps Research Institute* and *Gene Wiki*
- Tim Putman, *Scripps Research Institute* and *Gene Wiki*
- Lucas, *@WikidataFacts*
Dario and Stas
*Dario Taraborelli *Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
Ziko,
Thanks for your detailed email. Agree on all the comments.
Some earlier comments might have been harsh, but I understand that there is
a valid reason behind it and also the dedication of so many people involved
to help reach Wikipedia where it is today.
We should have been more diligent in finding out policies and rules
(including IRB) before entering content on Wikipedia. We promise not to
repeat anything of this sort in the future and also I am trying to
summarize all that has been discussed here to prevent such unpleasant
experiences from other researchers in this area.
-- Sidd
FYI
---------- Forwarded message ----------
From: Ariel Glenn WMF <ariel(a)wikimedia.org>
Date: Mon, Sep 12, 2016 at 9:07 AM
Subject: [Research-Internal] Fwd: Dumps Rewrite getting underway (help
needed!)
To: research-internal(a)lists.wikimedia.org
---------- Forwarded message ----------
From: Ariel Glenn WMF <ariel(a)wikimedia.org>
Date: Mon, Sep 5, 2016 at 2:35 PM
Subject: Dumps Rewrite getting underway (help needed!)
To: Wikipedia Xmldatadumps-l <Xmldatadumps-l(a)lists.wikimedia.org>
Hello folks,
I know a number of you have subscribed to the Dumps Rewrite project (
https://phabricator.wikimedia.org/tag/dumps-rewrite/) but I bet none of you
actually watch it or any of its tasks. So here's a heads up.
I'm getting started on work on the job scheduler/workflow manager piece;
this would accept lists of dump tasks (in the current setup, "dump stubs
for el wikipedia"), call a callback to turn each of them into small jobs
that can be completed in less than an hour, submit and monitor these jobs
with retries, dependencies etc, call a callback to recombine the outputs of
the jobs, and notify some caller on success of te whole operation.
First up is evaluating existing packages and choosing one to use as a
foundation. Please contribute! See the following tasks:
https://phabricator.wikimedia.org/T143205: Draft usage scenarios for
job/workflow manager <https://phabricator.wikimedia.org/T143205>
https://phabricator.wikimedia.org/T143206: List requirements needed for
task/job/workflow manager <https://phabricator.wikimedia.org/T143206>
https://phabricator.wikimedia.org/T143207: Evaluate software packages for
job/task/workflow management <https://phabricator.wikimedia.org/T143207>
Also, can someone please forward this on to analytics-l and research-l?
I'm not on those lists but they will no doubt have a lot of useful
expertise here.
Thanks!
Ariel
_______________________________________________
Research-Internal mailing list
Research-Internal(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/research-internal
Forwarding.
---------- Forwarded message ----------
From: Marti Johnson <mjohnson(a)wikimedia.org>
Date: Mon, Sep 12, 2016 at 4:34 PM
Subject: [Wikimedia-l] Open call for Project Grant proposals (Sep
12-October 11)
To: Wikimedia Mailing List <wikimedia-l(a)lists.wikimedia.org>
Hi everyone,
The Wikimedia Foundation Project Grants program launches its second open
call today, September 12. We will be accepting proposals through October
11 for new ideas to improve Wikimedia projects.
Funds are available to support individuals, groups and organizations to
implement new experiments and proven ideas, whether focused on building a
new tool or gadget, organizing a better process on your wiki, researching
an important issue, coordinating an editathon series or providing other
support for community-building.
Ideas from the current Inspire Campaign on addressing harassment are very
welcome. <https://meta.wikimedia.org/wiki/Grants:IdeaLab/Inspire>
Do you have have a good idea, but would like some feedback before
applying? Put it into the IdeaLab, where volunteers and staff can give you
advice and guidance on how to bring it to life. <
https://meta.wikimedia.org/wiki/Grants:IdeaLab> Once your idea is ready,
it can be easily migrated into a grant request.
Marti Johnson and I will also be hosting weekly proposals clinics via
Hangouts for real-time discussions about the Project Grants Open Call.
We’ll answer questions and help you make your proposal better. Dates and
times are as follows:
* Fri, Sep 16, 1400- 1500 UTC
* Tues, Sep 20, 0100 - 0200 UTC
* Wed, Sep 28, 1400 - 1500 UTC
* Tue, Oct 4, 2200 - 2300 UTC
* Tues, Oct 11, 0200 - 0300 UTC
* Tue, Oct 11, 1600 -1700 UTC
Links for Hangouts are available here: <
https://meta.wikimedia.org/wiki/Grants:Project>
We are excited to see your grant ideas that will support our community and
make an impact on the future of Wikimedia projects. Put your idea into
motion, and submit your proposal between September 12 and October 11! <
https://meta.wikimedia.org/wiki/Grants:Project/Apply>
Please feel free to get in touch with me (mjohnson(a)wikimedia.org) or Alex
Wang (awang(a)wikimedia.org) with questions about getting started with your
project!
Warm regards,
Marti
*Marti JohnsonProgram Officer*
*Individual Grants*
*Wikimedia Foundation <http://wikimediafoundation.org/wiki/Home>*
+1 415-839-6885
Skype: Mjohnson_WMF
Imagine a world in which every single human being can freely share
<http://youtu.be/ci0Pihl2zXY> in the sum of all knowledge. Help us make it
a reality!
Support Wikimedia <https://donate.wikimedia.org/>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
wiki/Mailing_lists/Guidelines
New messages to: Wikimedia-l(a)lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
<mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
(sorry for the cross posting with cultural partners l)
I am working on a research project with my school and have a question.
I am looking for a research paper or other documentation (if it exists) on whether or not in 2016, do potential GLAM partners still have the same questions/objections towards working with Wikipedia (time, resources, releasing materials in free licenses) despite the major collaborations we do have.
I have run into this in Mexico, but dont know how much of it is a local issue and how much is a global issue.