Forwarding a reply from Joseph that somehow didn't go through.
---------- Forwarded message ----------
From: Joseph Allemandou <jallemandou(a)wikimedia.org>
To: Research into Wikimedia content and communities <wiki-research-l@lists.
wikimedia.org>, gerard.meijssen(a)gmail.com
Hi Gerard,
Here are my two cents on your questions.
About redlinks, you are correct in saying that the 3% of "other" link-type
are jumps from a page to another (using http-referer), while the hyperlink
from the origin to the target allowing for such a jump doesn't exist in the
origin page at the moment of computation.
>From my exploration of the dataset, such "other" links happen with the
"manually-edited-with-error" url class (the "-" article has a lot of such
entering links for instance), as well as with links that I think have been
edited in the origin page (for instance in November 2017 dataset, there are
"other" links from page "Kevin Spacey" to "Dan Savage",
"hebephilia","pedophilia or "Harvey_Weinstein" - Those links are confirmed
as existing at some point in the page in November, but not anymore at the
beginning of December when the pages hyperlinks are snapshot).
As for your question about what people are looking for and don't find, the
one way I can think of to get ideas is to use detailed session analysis
correlated with search results, in order to try to get a signal of pages
reached from search and not being visited for long. Even if I think we have
data we could use in that respect on the cluster, we can't publish such
details externally for privacy concerns, obviously.
Please let me know if what I say makes sense :)
Many thanks
Joseph Allemandou
> Hoi,
> Do I understand well that the 3% of "other" links are the ones that have
> articles at *this *time but they did not exist at the time of the dump. So
> in effect they are not red links?
>
> Is there any way to find the articles people were seeking but could not
> find??
> Thanks,
> GerardM
>
> On 16 January 2018 at 20:21, Leila Zia <leila(a)wikimedia.org> wrote:
>
> > Hi all,
> >
> > For archive happiness:
> >
> > Clickstream dataset is now being generated on a monthly basis for 5
> > Wikipedia languages (English, Russian, German, Spanish, and Japanese).
> You
> > can access the data at https://dumps.wikimedia.org/other/clickstream/
> and
> > read more about the release and those who contributed to it at
> > https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/
> >
> > Best,
> > Leila
> >
>
Hoi,
I am a big fan of suggesting people to write articles / do work that will
be read, will be used. In a blogpost [1], I suggest the accumulation of
these click streams and use the missing popular articles as suggestions for
new articles. Articles that people seek and are truly missing are also
obvious candidates as suggestions for new articles.
My question: how hard is it to do this accumulation and analysis for
missing new articles and, combine it with suggestions to authors to write
something that is likely to prove popular? Does this idea have merit?
Thanks,
GerardM
[1]
https://ultimategerardm.blogspot.nl/2018/01/wikipedia-entering-rabbit-hole.…
On 18 January 2018 at 21:37, Joseph Allemandou <jallemandou(a)wikimedia.org>
wrote:
> Hi Gerard,
> Here are my two cents on your questions.
>
> About redlinks, you are correct in saying that the 3% of "other" link-type
> are jumps from a page to another (using http-referer), while the hyperlink
> from the origin to the target allowing for such a jump doesn't exist in the
> origin page at the moment of computation.
> From my exploration of the dataset, such "other" links happen with the
> "manually-edited-with-error" url class (the "-" article has a lot of such
> entering links for instance), as well as with links that I think have been
> edited in the origin page (for instance in November 2017 dataset, there are
> "other" links from page "Kevin Spacey" to "Dan Savage",
> "hebephilia","pedophilia or "Harvey_Weinstein" - Those links are confirmed
> as existing at some point in the page in November, but not anymore at the
> beginning of December when the pages hyperlinks are snapshot).
>
> As for your question about what people are looking for and don't find, the
> one way I can think of to get ideas is to use detailed session analysis
> correlated with search results, in order to try to get a signal of pages
> reached from search and not being visited for long. Even if I think we have
> data we could use in that respect on the cluster, we can't publish such
> details externally for privacy concerns, obviously.
>
> Please let me know if what I say makes sense :)
> Many thanks
> Joseph Allemandou
>
>
>> Hoi,
>> Do I understand well that the 3% of "other" links are the ones that have
>> articles at *this *time but they did not exist at the time of the dump. So
>>
>> in effect they are not red links?
>>
>> Is there any way to find the articles people were seeking but could not
>> find??
>> Thanks,
>> GerardM
>>
>> On 16 January 2018 at 20:21, Leila Zia <leila(a)wikimedia.org> wrote:
>>
>> > Hi all,
>> >
>> > For archive happiness:
>> >
>> > Clickstream dataset is now being generated on a monthly basis for 5
>> > Wikipedia languages (English, Russian, German, Spanish, and Japanese).
>> You
>> > can access the data at https://dumps.wikimedia.org/other/clickstream/
>> and
>> > read more about the release and those who contributed to it at
>> > https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-
>> clickstream/
>> >
>> > Best,
>> > Leila
>> >
>> >
>> >
>> > --
>> > Leila Zia
>> > Senior Research Scientist
>> > Wikimedia Foundation
>> >
>> > On Tue, Feb 17, 2015 at 11:00 AM, Dario Taraborelli <
>> > dtaraborelli(a)wikimedia.org> wrote:
>> >
>> > > We’re glad to announce the release of an aggregate clickstream dataset
>> > > extracted from English Wikipedia
>> > >
>> > > http://dx.doi.org/10.6084/m9.figshare.1305770
>> > >
>> > > This dataset contains counts of *(referer, article) *pairs aggregated
>> > > from the HTTP request logs of English Wikipedia. This snapshot
>> captures
>> > 22
>> > > million *(referer, article)* pairs from a total of 4 billion requests
>> > > collected during the month of January 2015.
>> > >
>> > > This data can be used for various purposes:
>> > > • determining the most frequent links people click on for a given
>> article
>> > > • determining the most common links people followed to an article
>> > > • determining how much of the total traffic to an article clicked on a
>> > > link in that article
>> > > • generating a Markov chain over English Wikipedia
>> > >
>> > > We created a page on Meta for feedback and discussion about this
>> release:
>> > > https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream
>> > >
>> > > Ellery and Dario
>> > >
>> > > _______________________________________________
>> > > Analytics mailing list
>> > > Analytics(a)lists.wikimedia.org
>> > > https://lists.wikimedia.org/mailman/listinfo/analytics
>> > >
>> > >
>> > _______________________________________________
>> > Wiki-research-l mailing list
>> > Wiki-research-l(a)lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> >
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>>
>> --
>>
>> *Dario Taraborelli *Director, Head of Research, Wikimedia Foundation
>> wikimediafoundation.org • nitens.org • @readermeter
>> <http://twitter.com/readermeter>
>>
>
>
>
> --
> *Joseph Allemandou*
> Data Engineer @ Wikimedia Foundation
> IRC: joal
>
Hi everyone,
We’re preparing for the January 2018 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201801 and add your name next to any paper you are interested in covering. Our target publication date is on January 26 UTC. As usual, short notes and one-paragraph reviews are most welcome.
Highlights from this month:
• Can conference papers have information value through Wikipedia? An investigation of four engineering fields
• Collaborative Approach to Developing a Multilingual Ontology: A Case Study of Wikidata
• Determining Quality of Articles in Polish Wikipedia Based on Linguistic Features
• Emo, Love, and God: Making Sense of Urban Dictionary, a Crowd-Sourced Online Dictionary
• Fostering Public Good Contributions with Symbolic Awards: A Large-Scale Natural Field Experiment at Wikipedia
• Knowledge categorization affects popularity and quality of Wikipedia articles
• The Conceptual Correspondence between the Encyclopaedia and Wikipedia
• The Wisdom of Polarized Crowds
• Use of Louisiana's Digital Cultural Heritage by Wikipedians
• What Makes Wikipedia's Volunteer Editors Volunteer?
• Wikipedia-integrated publishing: a comparison of successful models
If you have any question about the format or process feel free to get in touch off-list.
Masssly, Tilman Bayer and Dario Taraborelli
[1] http://meta.wikimedia.org/ wiki/Research:Newsletter
Hi everyone,
We’re preparing for the January 2018 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201801 and add your name next to any paper you are interested in covering. Our target publication date is on January 26 UTC. As usual, short notes and one-paragraph reviews are most welcome.
Highlights from this month:
• Can conference papers have information value through Wikipedia? An investigation of four engineering fields
• Collaborative Approach to Developing a Multilingual Ontology: A Case Study of Wikidata
• Determining Quality of Articles in Polish Wikipedia Based on Linguistic Features
• Emo, Love, and God: Making Sense of Urban Dictionary, a Crowd-Sourced Online Dictionary
• Fostering Public Good Contributions with Symbolic Awards: A Large-Scale Natural Field Experiment at Wikipedia
• Knowledge categorization affects popularity and quality of Wikipedia articles
• The Conceptual Correspondence between the Encyclopaedia and Wikipedia
• The Wisdom of Polarized Crowds
• Use of Louisiana's Digital Cultural Heritage by Wikipedians
• What Makes Wikipedia's Volunteer Editors Volunteer?
• Wikipedia-integrated publishing: a comparison of successful models
If you have any question about the format or process feel free to get in touch off-list.
Masssly, Tilman Bayer and Dario Taraborelli
[1] http://meta.wikimedia.org/ wiki/Research:Newsletter
Hi everyone,
We’re preparing for the January 2018 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201801 and add your name next to any paper you are interested in covering. Our target publication date is on January 26 UTC. As usual, short notes and one-paragraph reviews are most welcome.
Highlights from this month:
• Can conference papers have information value through Wikipedia? An investigation of four engineering fields
• Collaborative Approach to Developing a Multilingual Ontology: A Case Study of Wikidata
• Determining Quality of Articles in Polish Wikipedia Based on Linguistic Features
• Emo, Love, and God: Making Sense of Urban Dictionary, a Crowd-Sourced Online Dictionary
• Fostering Public Good Contributions with Symbolic Awards: A Large-Scale Natural Field Experiment at Wikipedia
• Knowledge categorization affects popularity and quality of Wikipedia articles
• The Conceptual Correspondence between the Encyclopaedia and Wikipedia
• The Wisdom of Polarized Crowds
• Use of Louisiana's Digital Cultural Heritage by Wikipedians
• What Makes Wikipedia's Volunteer Editors Volunteer?
• Wikipedia-integrated publishing: a comparison of successful models
If you have any question about the format or process feel free to get in touch off-list.
Masssly, Tilman Bayer and Dario Taraborelli
[1] http://meta.wikimedia.org/ wiki/Research:Newsletter
At the Dev Summit, Birgit Müller and I will run a session on Growing the
MediaWiki Technical Community. If you're attending, we hope you will
consider joining us.
Everyone (attending the Dev Summit or not) is welcome and encouraged to
participate at https://phabricator.wikimedia.org/T183318 (please comment
there, rather than by email).
We are discussing the following questions:
* What would allow you to develop and plan your software more efficiently?
* What would make software development more fun for you?
* What other Open Source communities do we share interests with?
* How can we change our processes to take technical debt more seriously?
"Develop" means any kind of work on a software system, including design,
documentation, etc.
Our topics are:
* Better processes and project management practices, integrating all
developers and allowing them to work more efficiently
* Building partnerships with other Open Source communities on shared
interests (e.g. translation, audio, video)
* Reducing technical debt
Matt Flaschen
Interesting/relevant research venue...
---------- Forwarded message ----------
From: Sandra Fauconnier <sandra.fauconnier(a)gmail.com>
Date: Thu, Jan 18, 2018 at 10:07 AM
Subject: [Wikidata] Fwd: Call for Papers: EuropeanaTech 2018 Conference
To: "Discussion list for the Wikidata project." <
wikidata(a)lists.wikimedia.org>
Hi all!
Here's a call for proposals for the EuropeanaTech conference, which will
take place in Rotterdam, May 15-16, 2018.
https://pro.europeana.eu/event/europeanatech-conference-2018
Some of the suggested topics are very Wikidata- and Wikimedia-related.
Best! Sandra (User:Spinster)
---------- Forwarded message ----------
From: Gregory Markus <gmarkus(a)beeldengeluid.nl>
Date: Thu, Jan 18, 2018 at 9:17 AM
Subject: Call for Papers: EuropeanaTech 2018 Conference
To: EUROPEANA-TECH(a)list.ecompass.nl
Dear EuropeanaTech community
EuropeanaTech is about the practical application of research concepts and
the latest technologies to digital libraries. For this edition of
EuropeanaTech, we concentrate on t*he three D’s: Data, Discovery and
Delivery*. Intertwined are the concepts of participation, linked and big
data; language and tools. Across all the subjects we are looking for the
inclusion of rigorous evaluations of the outcomes.
The conference will be a mix of invited speakers and successful
presentations from this call. We are not expecting an academic paper but a
lively presentation of work that you have been doing under the subjects
below. We are as interested in the glorious failures as we are in the
gorgeous successes.
Submission Guidelines
Please submit your proposal* by February 7*. It should contain a title, an
abstract of 250 words, some key words and a two sentence evaluation of its
practical benefits or learning. The Programme Committee will evaluate all
the submitted proposals and will notify you before the end of February if
your proposal has been selected for presentation. *We have room for up to
15 presentations* to be given in the conference as a result of this call.
The conference fee and your travelling costs will be covered if your
presentation is chosen.
Submissions are to be made via EasyChair: https://easychair.org/conferen
ces/?conf=eurtech18
List of Topics
*DATA*
1.
*User generated content and metadata:* from crowdsourcing of
descriptive data and transcription projects to Wikidata and structured data
on the Commons to how to combine institutional and user generated metadata.
We are looking for what has worked, or what hasn’t and can be done better.
2.
*Enhancing the results of digitisation: *various applications connect
the act of digitisation with required data processes for the use of the
data. What are the latest techniques, have they been applied at scale, do
they work in the in the more challenging audio-visual areas? We are
interested in everything from 3D capture, OCR, sound/video labelling,
named entity recognition and feature detection, to machine or deep learning
to help classify and categorise the digitised data.
3.
*Decentralisation vs Centralisation:* We know that aggregation works as
a process to bring together disparate data, standardised, normalise it and
make it available to other parties, but we also know that this is labour
intensive, very hierarchical, and does not distribute knowledge and
expertise. On the other hand more decentralised ways of working have yet to
be really proven in practice. Presentations that give the latest thinking
on how we can best enable access to cultural heritage data and reduce
friction costs are welcome, particularly with evaluation on the relative
strengths and weaknesses.
4.
*Multilingualism*: Google has more or less cracked full text translation
of mainstream languages, but we are still struggling with niche languages
and metadata. Presentations that evaluate the current thinking or give
insights into the latest work in the area would fit well in this section of
the creation and use of multilingual data in Cultural Heritage.
*DISCOVERY*
1.
*User Interaction: *Search is still the dominant means of gaining access
to the wealth of cultural heritage data now online, but does it represent
that wealth? Search is ungenerous: it withholds information, and demands a
query, what are the alternatives? Papers on generous interfaces and
frictionless design are sought to shed new light on how Cultural Heritage
can show itself more deeply. Evaluating the benefits and weaknesses to the
user in the process.
2.
*Artificial Intelligence: *For this subject topics ranging from machine
learning to neural network-based approaches to Cultural Heritage are
welcome. This includes applications of AI from image feature recognition to
natural language processing, and from building search interfaces on
features/colour similarity between images and discovery to the use of human
metadata and computer vision. We would also be interested in the audio and
moving image equivalents. Anything dealing with the combination of metadata
tags, image similarity and machine learning based on user input would be
very relevant as would Artificial Intelligence technology for content
curation.
*DELIVERY*
1.
*Digital Innovation: *The corporate culture of our memory institutions,
set up to preserve and conserve our heritage and the organisation of
digital innovation are not a marriage made in heaven. The Labs/Skunkworks
model is increasingly seen at best as an interim stage and at worst as a
dead end for organising innovation. So how should GLAMs go about organising
for digital innovation? How can governments and/or funders best support
digital transformation of the GLAM-sector?
2.
*Evaluation techniques:* Evaluation should be part of everything we do
in the publicly funded space of most of cultural heritage, but the how is
struggling to gain a common language, one that we can apply so funders get
a picture of the project within its broader context. Evaluation has been
requested to be part of all papers submitted but the latest in techniques
and agreement on a framework for the sector would constitute useful
insights.
3.
*Open source community:* What is the health and standing of the open
source community within the cultural heritage sector? Does it thrive or is
it a nice idea that is not a reality? How can projects with a limited
lifespan create and sustain products for the sector at large while
developing and engaging a thriving community around them? From
digitisation to search engine development should there be more emphasis on
the need for vibrant open source communities and more resources to
realizing them? Papers on barriers and successes are requested.
--
*Gregory Markus*
Project Leader
*Netherlands Institute for Sound and Vision*
*Media Parkboulevard 1
<https://maps.google.com/?q=Media+Parkboulevard+1&entry=gmail&source=g>,
1217 WE Hilversum | Postbus 1060, 1200 BB
Hilversum | *
*beeldengeluid.nl* <http://www.beeldengeluid.nl/>
*T* 0612350556
*Aanwezig:* - ma, di, wo, do, vr
===== This is the mailing list of the EuropeanaTech community -
http://pro.europeana.eu/europeana-tech You can unsubscribe at
http://list.ecompass.nl/listserv/cgi-bin/wa?SUBED1=EUROPEANA-TECH&A=1
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Jonathan T. Morgan
Senior Design Researcher
Wikimedia Foundation
User:Jmorgan (WMF) <https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)>
Hi all,
Apologies for cross-posting.
*************** WEB SCIENCE SUMMER SCHOOL 2018, Hannover, Germany
***************
The Web Science Summer School 2018 (WWSSS’18) will be held in Hannover,
Germany. It is hosted by L3S Research Center and will take place from
30.07.18 till 04.08.18.
Web Science is the emergent study of the people and technologies,
applications, processes and practices that shape and are shaped by the
World Wide Web. Web Science aims to draw together theories, methods and
findings from across academic disciplines, and to collaborate with
industry, business, government and civil society, to develop our
knowledge and understanding of the Web: the largest socio-technical
infrastructure in human history.
Web Science requires mining and understanding data from the Web,
requiring both technical skills for handling big (Web) data as well as
fundamental understanding of the social, psychological or legal aspects
underpinning online activities.
The WWSSS’18 will address the inter-disciplinary field of Web Science by
focusing on lectures which tackle the aforementioned challenges in
topics such as data science and data mining, big data processing,
information retrieval, Web governance as well as the sociology and
psychology of online interactions.
Alongside lectures that will address major trends in Web Science, the
Summer School will provide hands-on training in data processing,
analysis and methods, team work, and opportunities to present current
research. Participants shall work on specific tasks linked to the
datasets provided, and will be mentored by local instructors. All teams
will present the results of their work on the last day of the school.
Speakers, tutors and the full program are currently being finalized.
Registration for the summer school will be made open for everyone. In
addition, there will be a selection process for a few scholarships that
will be awarded to students to cover participation costs.
Follow the updates at http://wwsss18.webscience.org/, and do not miss
the chance to be a part of this enriching experience! Please feel free
to contact the chairs of the summer school or the local organization
team if you have any queries.
*********************************************************************************
--
Ujwal Gadiraju
L3S Research Center
Leibniz Universität Hannover
30167 Hannover, Germany
Phone: +49. 511. 762-5772
Fax: +49. 511. 762-19712
E-Mail: gadiraju(a)l3s.de
Web: www.l3s.de/~gadiraju/
Hey all,
a reminder that the livestream of our monthly research showcase starts in
45 minutes (11.30 PT)
- Video: https://www.youtube.com/watch?v=L-1uzYYneUo
- IRC: #wikimedia-research
- Abstracts:
https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase#January_2018
Dario
On Tue, Jan 16, 2018 at 9:45 AM, Lani Goto <lgoto(a)wikimedia.org> wrote:
> Hi Everyone,
>
> The next Research Showcase will be live-streamed this Wednesday, January
> 17, 2018 at 11:30 AM (PST) 19:30 UTC.
>
> YouTube stream: https://www.youtube.com/watch?v=L-1uzYYneUo
>
> As usual, you can join the conversation on IRC at #wikimedia-research. And,
> you can watch our past research showcases here.
>
> This month's presentation:
>
> *What motivates experts to contribute to public information goods? A field
> experiment at Wikipedia*
> By Yan Chen, University of Michigan
> Wikipedia is among the most important information sources for the general
> public. Motivating domain experts to contribute to Wikipedia can improve
> the accuracy and completeness of its content. In a field experiment, we
> examine the incentives which might motivate scholars to contribute their
> expertise to Wikipedia. We vary the mentioning of likely citation, public
> acknowledgement and the number of views an article receives. We find that
> experts are significantly more interested in contributing when citation
> benefit is mentioned. Furthermore, cosine similarity between a Wikipedia
> article and the expert's paper abstract is the most significant factor
> leading to more and higher-quality contributions, indicating that better
> matching is a crucial factor in motivating contributions to public
> information goods. Other factors correlated with contribution include
> social distance and researcher reputation.
>
> *Wikihounding on Wikipedia*
> By Caroline Sinders, WMF
> Wikihounding (a form of digital stalking on Wikipedia) is incredibly
> qualitative and quantitive. What makes wikihounding different then
> mentoring? It's the context of the action or the intention. However, all
> interactions inside of a digital space has a quantitive aspect to it, every
> comment, revert, etc is a data point. By analyzing data points
> comparatively inside of wikihounding cases and reading some of the cases,
> we can create a baseline for what are the actual overlapping similarities
> inside of wikihounding to study what makes up wikihounding. Wikihounding
> currently has a fairly loose definition. Wikihounding, as defined by the
> Harassment policy on en:wp, is: “the singling out of one or more editors,
> joining discussions on multiple pages or topics they may edit or multiple
> debates where they contribute, to repeatedly confront or inhibit their
> work. This is with an apparent aim of creating irritation, annoyance or
> distress to the other editor. Wikihounding usually involves following the
> target from place to place on Wikipedia.” This definition doesn't outline
> parameters around cases such as frequency of interaction, duration, or
> minimum reverts, nor is there a lot known about what a standard or
> canonical case of wikihounding looks like. What is the average wikihounding
> case? This talk will cover the approaches myself and members of the
> research team: Diego Saez-Trumper, Aaron Halfaker and Jonathan Morgan are
> taking on starting this research project.
>
> --
> Lani Goto
> Project Assistant, Engineering Admin
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: Wikimedia-l(a)lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-request@lists.wikimedia.org?subject=unsubscribe>
--
*Dario Taraborelli *Director, Head of Research, Wikimedia Foundation
wikimediafoundation.org • nitens.org • @readermeter
<http://twitter.com/readermeter>
We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario