Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
One of our longest running sets of stats on Wikipedia is the time between
ten million edits - we now have stats for this over a fifteen year period.
After emailing User:Katalaveno and getting their agreement I have moved
https://en.wikipedia.org/w/index.php?title=User:Katalaveno/TBE&redirect=no
to https://en.wikipedia.org/wiki/Wikipedia:Time_Between_Edits and am making
a few changes.
One area that perhaps someone on this list can explain is the difference
between number of edits as measured by revisionID and as measured by
NUMBEROFEDITS - the difference is over a hundred million. That is too big a
number for it to be a measure of logged admin actions, unless when you
delete a page it increments number of edits for each revision deleted. It
might be in the right ballpark to give a measure of edit conflicts, if so
it would be very good to have a measure of something we had thought
unmeasurable.
So I'm wondering if anyone on this list knows the difference between
{{NUMBEROFEDITS}}. and revisionID
WSC
Hey folks,
See below a message I originally sent to the AI mailing list. I thought
you might be interested. I've flagged a lot of the items that I'd filed
with the "Research_Ideas" tag[1].
1. https://phabricator.wikimedia.org/tag/research_ideas/
-Aaron
---------- Forwarded message ----------
From: Aaron Halfaker <aaron.halfaker(a)gmail.com>
Date: Tue, Jan 31, 2017 at 11:19 AM
Subject: [AI] AI Wishlist initialized and a new Phab Tag
To: Application of Artificial Intelligence and other advanced computing
strategies to Wikimedia Projects <ai(a)lists.wikimedia.org>
Hey folks,
I hosted the AI Wishlist session at the Developer Summit[1]. At that
session, we brainstormed a set of AIs that we think would be interesting to
implement. Generally I asked people to do their best to follow template
that would help us remember why the AI was important, what it would help
with, and what resources might help get it implement.
Well, I've taken all of the notes and filed a large set of phab tasks under
a new "artificial-intelligence" tag. Please review all of the fun, new
proposals that are listed there and make sure you subscribe to those that
you're interested in.
See https://phabricator.wikimedia.org/tag/artificial-intelligence/
1. https://phabricator.wikimedia.org/T147710
-Aaron
_______________________________________________
AI mailing list
AI(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai
Wikipedia has probably had some substantial external impacts. Are there
any studies quantifying them? Maybe increased scientific literacy? Or
maybe GDP rises with access to Wikipedia?
Are there any studies that have explored how Wikipedia has affected
economic or social issues?
I'm looking for any references you've got.
-Aaron
Hi everyone,
Following the successful experience of the last two years, we are
organizing another Wiki Workshop in 2017, this time as part of WWW2017
<http://www.www2017.com.au/> in Perth (Yes! We're coming to Australia.:),
on April 3 or 4 (exact date to be determined).
You can read more about the call for papers and the workshops at
http://snap.stanford.edu/wikiworkshop2017. Please note that the deadline
for the submissions to be considered for proceedings is January 24. All
other submissions should be received by February 26.
If you have questions about the workshop, please let us know on this list
or at wikiworkshop(a)googlegroups.com.
Looking forward to seeing you in Perth.
Best,
Leila, on behalf of the organizers
--
Leila Zia
Senior Research Scientist
Wikimedia Foundation
Hi folks,
Re: Wikipedia and scientific literacy/information diffusion, perhaps this
is relevant:
"Amplifying the impact of open access: Wikipedia and the diffusion of
science"
http://onlinelibrary.wiley.com/doi/10.1002/asi.23687/full
On Wed, Jan 25, 2017 at 7:00 AM, <
wiki-research-l-request(a)lists.wikimedia.org> wrote:
> Send Wiki-research-l mailing list submissions to
> wiki-research-l(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> or, via email, send a message with subject or body 'help' to
> wiki-research-l-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> wiki-research-l-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wiki-research-l digest..."
>
>
> Today's Topics:
>
> 1. Re: Request: Studies of external impacts of Wikipedia
> (Leigh Thelmadatter)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 25 Jan 2017 09:47:22 +0000
> From: Leigh Thelmadatter <osamadre(a)hotmail.com>
> To: Research into Wikimedia content and communities
> <wiki-research-l(a)lists.wikimedia.org>
> Subject: Re: [Wiki-research-l] Request: Studies of external impacts of
> Wikipedia
> Message-ID:
> <BY2PR05MB648575FA7328F35806506FBCD740@BY2PR05MB648.
> namprd05.prod.outlook.com>
>
> Content-Type: text/plain; charset="utf-8"
>
> This is an area I am interested in also. I run two groups of Mexican
> students who work with Wiki project for their "servicio social," a
> community service requirement for all Mexican undergrads. There was some
> question this semester as to whether the program should continue as they
> were looking for evidence of "social impact"... which they were defining as
> students having direct contact with beneficiares (think reading to children
> or serving food at a soup kitchen). We did convince the powers-that-be that
> while there may not be face-to-face, we can provide numbers as to how many
> people access the materials that students create/improve (but cannot break
> it down as to how many of those are from Mexico).
>
> ________________________________
> From: Wiki-research-l <wiki-research-l-bounces(a)lists.wikimedia.org> on
> behalf of Pine W <wiki.pine(a)gmail.com>
> Sent: Tuesday, January 24, 2017 7:23:17 PM
> To: Research into Wikimedia content and communities
> Subject: Re: [Wiki-research-l] Request: Studies of external impacts of
> Wikipedia
>
> I have a few thoughts.
>
> Thinking financially here: while I'm not aware of studies, the rise of
> Wikipedia coincided with the demise of Encarta. Also, I think that you'd
> want to take into consideration the impacts that Wikipedia has had via its
> appearance in Google search results and in Google's information summary
> panels; I'm sure that Google has reaped substantial financial benefits from
> Wikipedia. (This is a mixed blessing.) You might consider making an
> estimate of how many millions of dollars university and school libraries
> have saved by not purchasing proprietary encyclopedias.
>
> You might consult with WikiProject Medicine and WPMF to learn about the
> public health impacts of their efforts in content development and
> translation efforts, which they seem to think have been substantial in the
> developing world.
>
> I believe that the education folks in WMF and WEF have done some analyses
> of how Wikipedia assignments have may have yielded improved student
> engagement with material than traditional course assignments.
>
> There are probably also financial benefits that others have reaped from
> using open source MediaWiki software. Perhaps the folks in WMF Tech would
> be able to provide some analysis of the benefits of MediaWiki to external
> organizations.
>
> HTH,
>
> Pine
>
>
> On Tue, Jan 24, 2017 at 2:19 PM, Aaron Halfaker <ahalfaker(a)wikimedia.org<
> mailto:ahalfaker@wikimedia.org>> wrote:
> Wikipedia has probably had some substantial external impacts. Are there
> any studies quantifying them? Maybe increased scientific literacy? Or
> maybe GDP rises with access to Wikipedia?
>
> Are there any studies that have explored how Wikipedia has affected
> economic or social issues?
>
> I'm looking for any references you've got.
>
> -Aaron
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org<mailto:Wiki-
> research-l(a)lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
Regarding Kerry Raymond's "Patriotic editing hypothesis", I've done
some very simple informal investigation regarding the quality of
geographic articles, these are mostly on cities, towns, counties, etc.
in en:Wikipedia. Geographic articles have much lower average quality
scores than other subjects (see
https://en.wikipedia.org/wiki/User:Smallbones/Quality4by4 )
With just a small bit of poking around it's obvious that the quality
difference between geo articles and the rest is due to geo articles
about countries where English is not the native language. A bit more
poking and something that should have been really obvious jumps out.
French geo articles on FR:Wiki are much better (at least longer) than
the corresponding EN:Wiki article; Russian geo articles are much
better on RU:Wiki than on EN:Wiki, etc.
This is certainly consistent with the "Patriotic editing hypothesis"
if we define patriotism by language rather than by borders. It could
be checked out with other language versions e.g. German vs. French;
(Finnish, Estonian, Polish, German, or Hungarian, etc.) vs.Russian;
Chinese vs. any language.
The hypothesis even had a very practical implication - we should
translate more geo articles from their native language Wikipedias.
Hope this helps,
Pete Ekman
====
Date: Tue, 24 Jan 2017 11:12:58 +1000
From: "Kerry Raymond" <kerry.raymond(a)gmail.com>
To: "'Research into Wikimedia content and communities'"
<wiki-research-l(a)lists.wikimedia.org>
Subject: [Wiki-research-l] regional KPIs
Message-ID: <006701d275df$02016b90$060442b0$(a)gmail.com>
Content-Type: text/plain; charset="utf-8"
As previously came up in discussion about chapters, it would be very useful
to have national data about Wikipedia activities, which can be determined
(generally) from IP addresses. Now I understand the privacy argument in
relation to logged-in users (not saying I agree with it though in relation
to aggregate data). However, can we find a proxy that does not have the
privacy considerations.
My hypothesis is that national content is predominantly written by users
resident in that nation. And that therefore activity on national content can
be used as a proxy for national user editing activity.
In the case of Australia, we could describe Australian national content in
either of two ways: articles within the closure of the
[[Category:Australia]] and/or those tagged as {{WikiProject Australia}}.
There are arguments for/against either (neither is perfect, in my experience
the category closure will tend to have false positives and the project will
tend to have false negatives).
I would like to know what correlation exists between national editor
activity (as determined from IP addresses mapped to location) and national
content edits and if/how it changes over time for various nations. This is
research that only WMF can do because WMF has the IP addresses and the rest
of us can't have them for privacy reasons.
If we could establish that a strong-enough correlation existed between them,
we could use national content activity (for which there is no privacy
consideration) as a proxy for national editing activity. And we might even
be able to come up with a multiplier for each nation to provide comparable
data for national editing activity.
Now, it may be that we need to restrict the edits themselves in some way to
maximise the correlations between national content and same-nation editor
activity.
My second hypothesis is "semantic" edits (e.g. edits that add large amounts
of content or citation) to national content will be more highly correlated
with same-nation editors than "syntactic" edits (e.g. fix spelling,
punctuation or Manual of Style issues) will be. I suspect most bots and
other automated/semi-automated edits are doing syntactic edits.
Now, some of you will probably be aware of
[https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2017-01-17/Recen
t_research Female Wikipedians aren't more likely to edit women biographies].
So it may well be that my patriotic-editing hypothesis is also untrue. But
it would be nice to know one way or the other.
Kerry
I didn't find much for my review. Page 55 in:
http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/6012/pdf/imm6012.p…
There is an impact on the encyclopaedia market. :) The downfall of
Encarta gets attributed to Wikipedia. There is a recent paper from Shane
Greenstein on the Encarta/Britannica story (not that much about Wikipedia).
http://www.hbs.edu/faculty/Publication%20Files/Reference%20Wars%20-%20Green…
- Finn Aarup Nielsen
On 01/24/2017 11:19 PM, Aaron Halfaker wrote:
> Wikipedia has probably had some substantial external impacts. Are there
> any studies quantifying them? Maybe increased scientific literacy? Or
> maybe GDP rises with access to Wikipedia?
>
> Are there any studies that have explored how Wikipedia has affected
> economic or social issues?
>
> I'm looking for any references you've got.
>
> -Aaron
>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>