Hi all,
For all Hive users using stat1002/1004, you might have seen a deprecation
warning when you launch the hive client - that claims it's being replaced
with Beeline. The Beeline shell has always been available to use, but it
required supplying a database connection string every time, which was
pretty annoying. We now have a wrapper
<https://github.com/wikimedia/operations-puppet/blob/production/modules/role…>
script
setup to make this easier. The old Hive CLI will continue to exist, but we
encourage moving over to Beeline. You can use it by logging into the
stat1002/1004 boxes as usual, and launching `beeline`.
There is some documentation on this here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline.
If you run into any issues using this interface, please ping us on the
Analytics list or #wikimedia-analytics or file a bug on Phabricator
<http://phabricator.wikimedia.org/tag/analytics>.
(If you are wondering stat1004 whaaat - there should be an announcement
coming up about it soon!)
Best,
--Madhu :)
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
This was in the recent Research Newsletter:
https://www.econstor.eu/bitstream/10419/127472/1/847290360.pdf
They found a correlation between the length of articles about tourist
destinations and the number of tourists visiting them. They tried to
influence other destinations by adding content and did not find a
correlation in the subsequent number of tourists, suggesting that the
causation flows from tourism to article length instead.
But I was taken aback by the last line of their paper, "using the
suggested research design to study other areas of information
acquisition, such as medicine or school choices could be fruitful
directions."
Are there any ethical guidelines concerning whether this is
reasonable? Should there be?
Hi all,
This is just a friendly reminder that we plan to turn off the RCStream
service after July 7th.
We’re tracking as best we can the progress of porting clients over at
https://phabricator.wikimedia.org/T156919. But, we can only help with what
we know about. If you’ve got something still running on RCStream that
hasn’t yet ported, let us know, and/or switch soon!
Thanks!
-Andrew Otto
On Wed, Feb 8, 2017 at 9:28 AM, Andrew Otto <otto(a)wikimedia.org> wrote:
> Hi everyone!
>
> Wikimedia is releasing a new service today: EventStreams
> <https://wikitech.wikimedia.org/wiki/EventStreams>. This service allows
> us to publish arbitrary streams of JSON event data to the public.
> Initially, the only stream available will be good ol’ RecentChanges
> <https://www.mediawiki.org/wiki/Manual:RCFeed>. This event stream
> overlaps functionality already provided by irc.wikimedia.org and RCStream
> <https://wikitech.wikimedia.org/wiki/RCStream>. However, this new
> service has advantages over these (now deprecated) services.
>
>
> 1.
>
> We can expose more than just RecentChanges.
> 2.
>
> Events are delivered over streaming HTTP (chunked transfer) instead of
> IRC or socket.io. This requires less client side code and fewer
> special routing cases on the server side.
> 3.
>
> Streams can be resumed from the past. By using EventSource, a
> disconnected client will automatically resume the stream from where it left
> off, as long as it resumes within one week. In the future, we would like
> to allow users to specify historical timestamps from which they would like
> to begin consuming, if this proves safe and tractable.
>
>
> I did say deprecated! Okay okay, we may never be able to fully deprecate
> irc.wikimedia.org. It’s used by too many (probably sentient by now) bots
> out there. We do plan to obsolete RCStream, and to turn it off in a
> reasonable amount of time. The deadline iiiiiis July 7th, 2017. All
> services that rely on RCStream should migrate to the HTTP based
> EventStreams service by this date. We are committed to assisting you in
> this transition, so let us know how we can help.
>
> Unfortunately, unlike RCStream, EventStreams does not have server side
> event filtering (e.g. by wiki) quite yet. How and if this should be done
> is still under discussion <https://phabricator.wikimedia.org/T152731>.
>
> The RecentChanges data you are used to remains the same, and is available
> at https://stream.wikimedia.org/v2/stream/recentchange. However, we may
> have something different for you, if you find it useful. We have been
> internally producing new Mediawiki specific events
> <https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema…>
> for a while now, and could expose these via EventStreams as well.
>
> Take a look at these events, and tell us what you think. Would you find
> them useful? How would you like to subscribe to them? Individually as
> separate streams, or would you like to be able to compose multiple event
> types into a single stream via an API? These things are all possible.
>
> I asked for a lot of feedback in the above paragraphs. Let’s try and
> centralize this discussion over on the mediawiki.org EventStreams talk
> page <https://www.mediawiki.org/wiki/Talk:EventStreams>. In summary,
> the questions are:
>
>
> -
>
> What RCStream clients do you maintain, and how can we help you migrate
> to EventStreams?
> <https://www.mediawiki.org/wiki/Topic:Tkjkee2j684hkwc9>
> -
>
> Is server side filtering, by wiki or arbitrary event field, useful to
> you? <https://www.mediawiki.org/wiki/Topic:Tkjkabtyakpm967t>
> -
>
> Would you like to consume streams other than RecentChanges?
> <https://www.mediawiki.org/wiki/Topic:Tkjk4ezxb4u01a61> (Currently
> available events are described here
> <https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema…>
> .)
>
>
>
> Thanks!
> - Andrew Otto
>
>
>
Apologies for cross-posting
Call for Posters & Demos
SEMANTiCS 2017 - The Linked Data Conference
13th International Conference on Semantic Systems
Amsterdam, Netherlands
September 11 -14, 2017
http://2017.semantics.cc
For details please go to: https://2017.semantics.cc/calls
Important Dates (Posters & Demos Track):
*Submission Deadline: July 10, 2017 (11:59 pm, Hawaii time)
*Notification of Acceptance: August 10, 2017 (11:59 pm, Hawaii time)
*Camera-Ready Paper: August 18, 2017 (11:59 pm, Hawaii time)
As in the previous years, SEMANTiCS’17 proceedings will be published by
ACM ICPS (pending) and CEUR WS proceedings.
This year, SEMANTiCS features a special Data Science track, which is an
opportunity to bring together researchers and practitioners interested
in data science and its intersection with Linked Data to present their
ideas and discuss the most important scientific, technical and
socio-economical challenges of this emerging field.
SEMANTiCS 2017 will especially welcome submissions for the following hot
topics:
*Metadata, Versioning and Data Quality Management
*Semantics for Safety, Security & Privacy
*Web Semantics, Linked (Open) Data & schema.org
*Corporate Knowledge Graphs
*Knowledge Integration and Language Technologies
*Economics of Data, Data Services and Data Ecosystems
Special Track (please check appropriate topic in submission system)
*Data Science
Following the success of previous years, we welcome any submissions
related but not limited to the following ‘horizontal’ (research) and
‘vertical’ (industries) topics:
Horizontals:
*Enterprise Linked Data & Data Integration
*Knowledge Discovery & Intelligent Search
*Business Models, Governance & Data Strategies
*Semantics in Big Data
*Text Analytics
*Data Portals & Knowledge Visualization
*Semantic Information Management
*Document Management & Content Management
*Terminology, Thesaurus & Ontology Management
*Smart Connectivity, Networking & Interlinking
*Smart Data & Semantics in IoT
*Semantics for IT Safety & Security
*Semantic Rules, Policies & Licensing
*Community, Social & Societal Aspects
Data Science Special Track Horizontals:
*Large-Scale Data Processing (stream processing, handling large-scale
graphs)
*Data Analytics (Machine Learning, Predictive Analytics, Network Analytics)
*Communicating Data (Data Visualization, UX & Interaction Design,
Crowdsourcing)
*Cross-cutting Issues (Ethics, Privacy, Security, Provenance)
Verticals:
*Industry & Engineering
*Life Sciences & Health Care
*Public Administration
*e-Science
*Digital Humanities
*Galleries, Libraries, Archives & Museums (GLAM)
*Education & eLearning
*Media & Data Journalism
*Publishing, Marketing & Advertising
*Tourism & Recreation
*Financial & Insurance Industry
*Telecommunication & Mobile Services
*Sustainable Development: Climate, Water, Air, Ecology
*Energy, Smart Homes & Smart Grids
*Food, Agriculture & Farming
*Safety, Security & Privacy
*Transport, Environment & Geospatial
Posters & Demos Track
The Posters & Demonstrations Track invites innovative work in progress,
late-breaking research and innovation results, and smaller contributions
in all fields related to the broadly understood Semantic Web. These
include submissions on innovative applications with impact on end users
such as demos of solutions that users may test or that are yet in the
conceptual phase, but are worth discussing, and also applications, use
cases or pieces of code that may attract developers and potential
research or business partners. This also concerns new data sets made
publicly available.
The informal setting of the Posters & Demonstrations Track encourages
participants to present innovations to the research community, business
users and find new partners or clients and engage in discussions about
the presented work. Such discussions can be invaluable inputs for the
future work of the presenters, while offering conference participants an
effective way to broaden their knowledge of the emerging research trends
and to network with other researchers.
Poster and demo submissions should consist of a paper that describe the
work, its contribution to the field or novelty aspects. Submissions must
be original and must not have been submitted for publication elsewhere.
Accepted papers will be published in HTML (RASH) in CEUR and, as such,
the camera-ready version of the papers will be required in HTML,
following the poster and demo guidelines (https://goo.gl/3BEpV7). Papers
should be submitted through EasyChair
(https://easychair.org/conferences/?conf=semantics2017 and should be
less than 2200 words in length (equivalent to 4 pages), including the
whole content of the paper.
For the initial reviewing phase, authors may submit a PDF version of the
paper following any layout. After acceptance, authors are required to
submit the camera-ready in HTML (RASH).
Submissions will be reviewed by experienced and knowledgeable
researchers and practitioners; each submission will receive detailed
feedback. For demos, we encourage authors to include links enabling the
reviewers to test the application or review the component.
For details please go to: https://2017.semantics.cc/calls
Hi everybody,
the Analytics team is working on some alter tables to the Eventlogging
'log' database on analytics-store (dbstore1002) and analytics-slave
(db1047) as part of https://phabricator.wikimedia.org/T167162.
The list of alter tables are the following:
https://phabricator.wikimedia.org/P5570
This should be a transparent change but I thought it would have been better
to keep all of you informed in case of unintended regressions or
side-effects. The context of the alter tables is in T167162 but the TL;DR
is that we need nullable attributes across all the EL tables (except fields
like id, uuid and timestamp) to be able to sanitize data with our new
eventlogging_cleaner script (https://phabricator.wikimedia.org/T156933).
Please let me know if you encounter any issue with this change.
Thanks in advance!
Luca
Is anyone studying the rate at which external links become unavailable
on Wikipedia projects?
I just did a quick tally and less than 40% of the external links cited
in the introductions of L1-vital enwiki health and social science
articles I sampled were good, and that's only counting those which
didn't already have a {{dead link}} tag.
I thought that the bots were doing a better job of replacing dead
links with archive copies than they apparently are. Do we need to fund
this as an official effort?
Dear all
I've been working as Wikimedian in Residence at UNESCO for the past two
years working on a number of activities including:
* Sharing UNESCO media content on Wikimedia projects
* Sharing UNESCO open license text on English language Wikipedia
* Promoting Wiki Loves competitions through UNESCO social media
* Encouraging other UN agencies to adopt open licenses.
More information is here:
https://en.wikipedia.org/wiki/Wikipedia:WikiProject_United_Nations
I'm working with a researcher at UNESCO to understand the impact of what
I've been doing and would like to some suggestions on where to start with a
research project. The researcher has a background in statistics and is
familiar with R but is not very knowledgeable about Wikimedia projects. I'm
not familiar with much of the research done on Wikimedia projects other
than metrics tools like BaGLAMa, GLAMorgan etc that I use for reporting. I
guess what I'm looking for is a general overview and case studies on
research projects done on Wikimedia projects and any specific examples done
with the kind of work I'm doing.
Many thanks
John