Pursuant to prior discussions about the need for a research
policy on Wikipedia, WikiProject Research is drafting a
policy regarding the recruitment of Wikipedia users to
participate in studies.
At this time, we have a proposed policy, and an accompanying
group that would facilitate recruitment of subjects in much
the same way that the Bot Approvals Group approves bots.
The policy proposal can be found at:
http://en.wikipedia.org/wiki/Wikipedia:Research
The Subject Recruitment Approvals Group mentioned in the proposal
is being described at:
http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group
Before we move forward with seeking approval from the Wikipedia
community, we would like additional input about the proposal,
and would welcome additional help improving it.
Also, please consider participating in WikiProject Research at:
http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research
--
Bryan Song
GroupLens Research
University of Minnesota
Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi all!
tl;dr: Stop using stat100[23] by September 1st.
We’re finally replacing stat1002 and stat1003. These boxes are out of
warranty, and are running Ubuntu Trusty, while most of the production fleet
is already on Debian Jessie or even Debian Stretch.
stat1005 is the new stat1002 replacement. If you have access to stat1002,
you also have access to stat1005. I’ve copied over home directories from
stat1002.
stat1006 is the new stat1003 replacement. If you have access to stat1003,
you also have access to stat1006. I’ve copied over home directories from
stat1003.
I have not migrated any personal cron jobs running on stat1002 or
stat1003. I need your help for this!
Both of these boxes are running Debian Stretch. As such, packages that
your work depends on may have upgraded. Please log into the new boxes and
try stuff out! If you find anything that doesn’t work, please let me know
by commenting on https://phabricator.wikimedia.org/T152712.
Please be fully migrated to the new nodes by September 1st. This will give
us enough time to fully decommission stat1002 and stat1003 by the end of
this quarter.
I’ve only done a single rsync of home directories. If there is new data on
stat1002 or stat1003 that you want rsynced over, let me know on the ticket.
A few notes:
- stat1002 used to have /a. This has been removed in favor of /srv. /a no
longer exists.
- Home directories are now much larger. You no longer need to create
personal directories in /srv.
- /tmp is still small, so please be careful. If you are running long jobs
that generate temporary data, please have those jobs write into your home
directory, rather than /tmp.
- We might implement user home directory quotas in the future.
Thanks all! I’ll send another email in about a months time to remind you
of the impending deadline of Sept 1.
-Andrew Otto
Hi Everyone,
The next Research Showcase will be live-streamed this Wednesday, July 26,
2017 at 11:30 AM (PST) 18:30 UTC.
YouTube stream: https://www.youtube.com/watch?v=yC1jgK8C8aQ
As usual, you can join the conversation on IRC at #wikimedia-research. And,
you can watch our past research showcases here
<https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#July_2017>.
This month's presentation:
Freedom versus Standardization: Structured Data Generation in a Peer
Production CommunityBy *Andrew Hall*In addition to encyclopedia articles
and software, peer production communities produce *structured data*, e.g.,
Wikidata and OpenStreetMap’s metadata. Structured data from peer production
communities has become increasingly important due to its use by
computational applications, such as CartoCSS, MapBox, and Wikipedia
infoboxes. However, this structured data is usable by applications only if
it follows *standards.* We did an interview study focused on
OpenStreetMap’s knowledge production processes to investigate how – and how
successfully – this community creates and applies its data standards. Our
study revealed a fundamental tension between the need to produce structured
data in a standardized way and OpenStreetMap’s tradition of contributor
freedom. We extracted six themes that manifested this tension and three
overarching concepts, *correctness, community,* and *code,* which help make
sense of and synthesize the themes. We also offer suggestions for improving
OpenStreetMap’s knowledge production processes, including new data models,
sociotechnical tools, and community practices.
Kindly,
Sarah R. Rodlund
Senior Project Coordinator-Product & Technology, Wikimedia Foundation
srodlund(a)wikimedia.org
Hi all,
With the start of the new fiscal year in Wikimedia Foundation on July
1, the Research team has officially started the work on Program 12:
Growing contributor diversity. [1] Here are a few
announcements/pointers about this program and the research and work
that will be going to it:
* We aim to keep the research documentation for this project on the
corresponding research page on meta. [2]
* Research tasks are hard to break down and track in task-tracking
systems. This being said, any task that we can break down and track
will be documented under the corresponding Epic task on Phabricator.
[3]
* The goals for this Program for July-September 2017 (Quarter 1) are
captured on MediaWiki. [4] (The Phabricator epic will be updated with
corresponding tasks as we start working on them.)
* Our three formal collaborators (cc-ed) will contribute to this
program: Jérôme Hergueux from ETH, Paul Seabright from TSE, and Bob
West from EPFL. We are thankful to these people who have agreed to
spend their time and expertise on this project in the coming year, and
to those of you who have already worked with us as we were shaping the
proposal for this project and are planning to continue your
contributions to this program. :)
* I act as the point of contact for this research in Wikimedia
Foundation. Please feel free to reach out to me (directly, if it
cannot be shared publicly) if you have comments/questions about the
project in the coming year.
Best,
Leila
[1] https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2017-2018/…
[2] https://meta.wikimedia.org/wiki/Research:Voice_and_exit_in_a_voluntary_work…
[3] https://phabricator.wikimedia.org/T166083
[4] https://www.mediawiki.org/wiki/Wikimedia_Technology/Goals/2017-18_Q1#Resear…
--
Leila Zia
Senior Research Scientist
Wikimedia Foundation
This was in the recent Research Newsletter:
https://www.econstor.eu/bitstream/10419/127472/1/847290360.pdf
They found a correlation between the length of articles about tourist
destinations and the number of tourists visiting them. They tried to
influence other destinations by adding content and did not find a
correlation in the subsequent number of tourists, suggesting that the
causation flows from tourism to article length instead.
But I was taken aback by the last line of their paper, "using the
suggested research design to study other areas of information
acquisition, such as medicine or school choices could be fruitful
directions."
Are there any ethical guidelines concerning whether this is
reasonable? Should there be?
Hi all,
[If you are not interested in discussions related to the category system
(on English Wikipedia)
, you can stop here. :)]
We have run into a problem that some of you may have thought about or
addressed before. We are trying to clean up the category system on English
Wikipedia by turning the category structure to an IS-A hierarchy. (The
output of this work can be useful for the research on template
recommendation [1], for example, but the use-cases won't stop there). One
issue that we are facing is the following:
We are currently
using
SQL dumps to extract categories associated with every article on English
Wikipedia (main namespace). [2]
Using this approach, we get 5 categories associated with Flow cytometry
bioinformatics article [3]:
Flow_cytometry
Bioinformatics
Wikipedia_articles_published_in_peer-reviewed_literature
Wikipedia_articles_published_in_PLOS_Computational_Biology
CS1_maint:_Multiple_names:_authors_list
The problem is that only the first two categories are the ones we are
interested in. We have one cleaning step through which we only keep
categories that belong to category Article and that step removes the last
category above, but the other two Wikipedia_... remain there. We need to
somehow prune the data and clean it from those two categories.
One way we could do the above would be to parse wikitext instead of the SQL
dumps and focus on extracting categories marked by pattern [[Category:XX]],
but in that case, we would lose a good category such as
Guided_missiles_of_Norway
because that's generated by a template.
Any ideas on how we can start with a "cleaner" dataset of categories
related to the topic of the articles as opposed to maintenance related or
other types of categories?
Thanks,
Leila
[1] https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia
_stubs_across_languages
[2] The exact code we use is
SELECT p.page_id id, p.page_title title, cl.cl_to category
FROM categorylinks cl
JOIN page p
on cl.cl_from = p.page_id
where cl_type = 'page'
and page_namespace = 0
and page_is_redirect = 0
and the edges of the category graph are extracted with
*SELECT p.page_title category, cl.cl_to parent *
*FROM categorylinks cl *
*JOIN page p *
*ON p.page_id = cl.cl_from *
*where p.page_namespace = 14*
[3] https://en.wikipedia.org/wiki/Flow_cytometry_bioinformatics