We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
*** CALL FOR PAPERS ***
http://sdiwc.net/conferences/bigdata2015/
The Second International Conference on Data Mining, Internet Computing, and
Big Data (BigData2015)
University of Mauritius, Le Reduit, Moka, Mauritius
June 29 – July 01, 2015
All registered papers will be included in SDIWC Digital Library.
==============================================================
The conference aims to enable researchers build connections between
different digital applications.
The event will be held over three days, with presentations delivered by
researchers from the international
community, including presentations from keynote speakers and
state-of-the-art lectures.
RESEARCH TOPICS ARE NOT LIMITED TO:
* Data Mining Tasks & Algorithms
Explorative and visual data mining
Mining text and semi-structured data
Multimedia mining (audio/video)
Segmentation/Clustering/Association
Web mining
Artificial neural networks
Link and sequence analysis
Evolutionary computation/meta heuristics
* Data Mining Integration & Process
Distributed and grid based data mining
Metadata and ontologies
Mining large scale data
Attribute discretization and encoding
Feature selection and transformation
Model interpretation
Data cleaning and preparation
* Data Mining Applications
Bioinformatics
Business/Corporate/Industrial Data Mining
Credit Scoring
Data Mining in Logistics
Database Marketing
Direct Marketing
Engineering Mining
Medicine Data Mining
Military Data Mining
Security Data Mining
Social Science Mining
Time series analysis and visualization
Anomaly detection
Association rule learning
Classification
Cloud based infrastructure (applications, storage and resources)
Cluster analysis
Crowd-sourcing
Data fusion and integration
Data-mining grids
Distributed databases
Distributed file systems
Ensemble learning
Genetic algorithms
Machine learning
Massively parallel-processing (MPP) databases
Natural language processing
Neural networks
Pattern recognition
Predictive modelling
* Internet Computing
Design and analysis of internet protocols and engineering
Digital libraries/digital image collections
Electronic commerce and internet
Grid based computing and internet tools
Internet and emerging technologies
Internet and video technologies
Internet applications and appliances
Internet banking systems
Internet based decision support systems
Internet law and compliance
Internet security and trust
Markup Languages
Metacomputing
Mobile computing and the internet
Network architectures and network computing
Novel Java applications on internet
Quality of service
Search engines
Social networks
The WWW and intranets
The internet and Cloud computing
Web based computing
Web interfaces to databases
Web site design and coordination
Search-based applications
Sentiment analysis
Signal processing
Simulation
Supervised and unsupervised learning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
IMPORTANT DATES:
Submission Deadline : May 29, 2015
Notification of Acceptance : 2 - 4 weeks from the submission date
Camera Ready Submission : Open from now until June 09, 2015
Registration Date : Open from now until June 09, 2015
Conference Dates : June 29 - July 01, 2015
Researchers are encouraged to submit their work electronically.
All papers will be fully refereed by a minimum of two specialized referees.
Before final acceptance, all referees comments must be considered.
Paper Submission: hhttp://
sdiwc.net/conferences/bigdata2015/paper-submission/
Write us for more details: bigdata15(a)sdiwc.net
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Hi everybody,
we’re preparing for the May 2015 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201505 and add your name next to any paper you are interested in covering. As usual, short notes and one-paragraph reviews are most welcome.
Highlights from this month:
Wikidata through the Eyes of DBpedia
Predicting elections from online information flows: towards theoretically informed models
Understanding Graph Structure of Wikipedia for Query Expansion
A New Epistemic Culture Wikipedia as an Arena for the Production of Knowledge in Late Modernity
The EU Public Interest Clinic and Wikimedia Present: Extending Freedom of Panorama in Europe
Utilizing the Wikidata System to Improve the Quality of Medical Content in Wikipedia in Diverse Languages: A Pilot Study
Eliciting Disease Data from Wikipedia Articles
Centre Stage: How Social Network Position Shapes Linguistic Coordination
Synthesizing knowledge from disagreement
Aligning Sentences from Standard Wikipedia to Simple Wikipedia
Debating reliable sources: writing the history of the Vietnam War on Wikipedia
Turning Introductory Comparative Politics and Elections Courses into Social Science Research Communities Using Wikipedia: Improving Both Teaching and Research
If you have any question about the format or process feel free to get in touch off-list.
Masssly, Tilman Bayer and Dario Taraborelli
[1] http://meta.wikimedia.org/wiki/Research:Newsletter
Hi everyone,
The next research showcase will be live-streamed this Wednesday, May 13 at
11.30 PT. The streaming link will be posted on the lists a few minutes
before the showcase starts and as usual, you can join the conversation on
IRC at #wikimedia-research.
We look forward to seeing you!
Leila
This month
*The people's classifier: Towards an open model for algorithmic
infrastructure*
By Aaron Halfaker <https://www.mediawiki.org/wiki/User:Halfak_(WMF)>
Recent research has implicated that Wikipedia's algorithmic infrastructure
is perpetuating social issues. However, these same algorithmic tools are
critical to maintaining efficiency of open projects like Wikipedia at
scale. But rather than simply critiquing algorithmic wiki-tools and calling
for less algorithmic infrastructure, I'll propose a different strategy --
an open approach to building this algorithmic infrastructure. In this
presentation, I'll demo a set of services that are designed to open a
critical part Wikipedia's quality control infrastructure -- machine
classifiers. I'll also discuss how this strategy unites critical/feminist
HCI with more dominant narratives about efficiency and productivity.
*Social transparency online*
By Jennifer Marlow <http://www.aboutjmarlow.com/> and Laura Dabbish
<http://www.lauradabbish.com/>
An emerging Internet trend is greater social transparency, such as the use
of real names in social networking sites, feeds of friends' activities,
traces of others' re-use of content, and visualizations of team
interactions. There is a potential for this transparency to radically
improve coordination, particularly in open collaboration settings like
Wikipedia. In this talk, we will describe some of our research identifying
how transparency influences collaborative performance in online work
environments. First, we have been studying professional social networking
communities. Social media allows individuals in these communities to create
an interest network of people and digital artifacts, and get
moment-by-moment updates about actions by those people or changes to those
artifacts. It affords and unprecedented level of transparency about the
actions of others over time. We will describe qualitative work examining
how members of these communities use transparency to accomplish their
goals. Second, we have been looking at the impact of making workflows
transparent. In a series of field experiments we are investigating how
socially transparent interfaces, and activity trace information in
particular, influence perceptions and behavior towards others and
evaluations of their work.
Thank you, Federico!
Your link to phabricator explain this. However, it would be nice if such
changes will be described in Read.me file
Alex
On Tue, May 12, 2015 at 2:00 PM, <
wiki-research-l-request(a)lists.wikimedia.org> wrote:
> Send Wiki-research-l mailing list submissions to
> wiki-research-l(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> or, via email, send a message with subject or body 'help' to
> wiki-research-l-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> wiki-research-l-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wiki-research-l digest..."
>
>
> Today's Topics:
>
> 1. Re: How to explain drop in random searches (Daniel Moyer)
> 2. Re: How to explain drop in random searches (Federico Leva (Nemo))
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 11 May 2015 23:13:15 -0700
> From: Daniel Moyer <moyerd(a)usc.edu>
> To: Research into Wikimedia content and communities
> <wiki-research-l(a)lists.wikimedia.org>
> Subject: Re: [Wiki-research-l] How to explain drop in random searches
> Message-ID:
> <CAKvQcvXcMXSc2SkDVJTTbs2MXuCSpeHcHeSd=
> gWkg6bwY8DqjQ(a)mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> A lot of thanks and credit to the analytics team for keeping these counts
> running.
>
> That being said, it might be a good idea not to draw too many conclusions
> from the pageview counts on user behaviour without a closer analysis,
> especially for the Special:* pages. As demonstrated by the October 16th
> drop, these are strongly affected by instrument bias.
>
> On Mon, May 11, 2015 at 10:56 PM, Alex Druk <alex.druk(a)gmail.com> wrote:
>
> > Because similar patterns are observed for many other languages (but not
> > all), it looks like R.Stuart Geiger explanation is correct: from October
> > 16 2014 Special:Random page is just not counted any more (with some not
> > clear exceptions).
> >
> > That’s a pity because we lost valuable source of info how Wikipedia users
> > look for information. Random search was (and is?) a major way users
> explore
> > Wikipedias. In many languages Special:Random was significantly higher
> than
> > Main_Page count and certainly higher than search with index.php.
> >
> > (I do not want to point finger, but maybe somebody at WMF considered this
> > emotionally.)
> >
> > IMHO, logs should be logs and log actual activity. At least such
> dramatic
> > changes in logging user’s activity should be documented somewhere.
> Betters
> > in Read.me file that should accompany raw logs.
> >
> > >Date: Mon, 11 May 2015 20:08:40 -0700
> > >From: "R.Stuart Geiger" <sgeiger(a)gmail.com>
> > >To: Research into Wikimedia content and communities
> > > <wiki-research-l(a)lists.wikimedia.org>
> > >Subject: Re: [Wiki-research-l] Wiki-research-l Digest, Vol 117, Issue
> > > 14
> > >Message-ID:
> > > <CAKt0Q=e-_=0=
> > aepKeSVnT0Ce2FmJZu5bNtpNnYwZV7x21A3tXQ(a)mail.gmail.com>
> > >Content-Type: text/plain; charset="utf-8"
> > >
> > >Going from 86,000,000 a month to 31,000 a month is quite a drop, and the
> > >shift is pretty dramatic. It goes from 1.7 million one day to 715 the
> next
> > >and stays flat (http://stats.grok.se/en/201410/Special:Random).
> > >
> > >I was also thinking there could be a bot or something that is scraping
> > >Special:Random, but the drop also happens for Special:Random/Talk --
> which
> > >hardly anybody uses, but it still drops flat the same day (
> > >http://stats.grok.se/en/201410/Special:Random/Talk). It doesn't happen
> > for
> > >Special:Upload or Special:Log though.
> > >
> > >October 16th, 2014 is the day it changes. Anybody know of something that
> > >might have changed that day with logging? Also, there have to be way
> more
> > >than ~1,000 hits a day to Special:Random. Perhaps pageviews started to
> be
> > >counted for the page that it got redirected to, rather than the
> > >Special:Random page itself. But then why wouldn't it go to 0? What are
> > >those ~1,000 hits a day?
> > >
> > >[image: 👻] ~~ it is a mystery ~~ [image: 👻]
> >
> >
> > --
> > Thank you.
> >
> > Alex Druk
> > alex.druk(a)gmail.com
> > www.wikipediatrends.com
> > (775) 237-8550 Google voice
> >
> > _______________________________________________
> > Wiki-research-l mailing list
> > Wiki-research-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> >
>
Because similar patterns are observed for many other languages (but not
all), it looks like R.Stuart Geiger explanation is correct: from October
16 2014 Special:Random page is just not counted any more (with some not
clear exceptions).
That’s a pity because we lost valuable source of info how Wikipedia users
look for information. Random search was (and is?) a major way users explore
Wikipedias. In many languages Special:Random was significantly higher than
Main_Page count and certainly higher than search with index.php.
(I do not want to point finger, but maybe somebody at WMF considered this
emotionally.)
IMHO, logs should be logs and log actual activity. At least such dramatic
changes in logging user’s activity should be documented somewhere. Betters
in Read.me file that should accompany raw logs.
>Date: Mon, 11 May 2015 20:08:40 -0700
>From: "R.Stuart Geiger" <sgeiger(a)gmail.com>
>To: Research into Wikimedia content and communities
> <wiki-research-l(a)lists.wikimedia.org>
>Subject: Re: [Wiki-research-l] Wiki-research-l Digest, Vol 117, Issue
> 14
>Message-ID:
> <CAKt0Q=e-_=0=aepKeSVnT0Ce2FmJZu5bNtpNnYwZV7x21A3tXQ(a)mail.gmail.com
>
>Content-Type: text/plain; charset="utf-8"
>
>Going from 86,000,000 a month to 31,000 a month is quite a drop, and the
>shift is pretty dramatic. It goes from 1.7 million one day to 715 the next
>and stays flat (http://stats.grok.se/en/201410/Special:Random).
>
>I was also thinking there could be a bot or something that is scraping
>Special:Random, but the drop also happens for Special:Random/Talk -- which
>hardly anybody uses, but it still drops flat the same day (
>http://stats.grok.se/en/201410/Special:Random/Talk). It doesn't happen for
>Special:Upload or Special:Log though.
>
>October 16th, 2014 is the day it changes. Anybody know of something that
>might have changed that day with logging? Also, there have to be way more
>than ~1,000 hits a day to Special:Random. Perhaps pageviews started to be
>counted for the page that it got redirected to, rather than the
>Special:Random page itself. But then why wouldn't it go to 0? What are
>those ~1,000 hits a day?
>
>👻 ~~ it is a mystery ~~ 👻
--
Thank you.
Alex Druk
alex.druk(a)gmail.com
www.wikipediatrends.com
(775) 237-8550 Google voice
I just grep monthly totals from Erik Zachte
http://dumps.wikimedia.org/other/pagecounts-ez/merged/ (grep "^en.z
Special:Random ")
On Mon, May 11, 2015 at 2:00 PM, <
wiki-research-l-request(a)lists.wikimedia.org> wrote:
> Send Wiki-research-l mailing list submissions to
> wiki-research-l(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> or, via email, send a message with subject or body 'help' to
> wiki-research-l-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> wiki-research-l-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wiki-research-l digest..."
>
>
> Today's Topics:
>
> 1. Re: How to explain drop in random searches (Oliver Keyes)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 10 May 2015 08:30:37 -0400
> From: Oliver Keyes <okeyes(a)wikimedia.org>
> To: Research into Wikimedia content and communities
> <wiki-research-l(a)lists.wikimedia.org>
> Subject: Re: [Wiki-research-l] How to explain drop in random searches
> Message-ID:
> <
> CAAUQgdA6jVzgs3QQXVVgsh7MFthWxpD97uisDTJmjNUgZZXH5A(a)mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> Using what data?
>
> On 10 May 2015 at 05:29, Alex Druk <alex.druk(a)gmail.com> wrote:
> > Hi everyone,
> >
> >
> >
> > I try to learn dynamic of random searches (Special:Random) on English
> > Wikipedia.
> >
> > From 01/2012 to 10/2014 average number of random searches per month was
> > about 86 millions or about 30% of Main_Page pageviews, but from November
> > 2014 it drop to 31,000 per month (or 0.008% of Main_page).
> >
> > How to explain such a dramatic drop? Any ideas?
> >
> >
> > --
> > Thank you.
> >
> > Alex Druk, PhD
> > wikipediatrends.com
> > alex.druk(a)gmail.com
> > (775) 237-8550 Google voice
> >
> > _______________________________________________
> > Wiki-research-l mailing list
> > Wiki-research-l(a)lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
>
>
> ------------------------------
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
> End of Wiki-research-l Digest, Vol 117, Issue 14
> ************************************************
>
--
Thank you.
Alex Druk
alex.druk(a)gmail.com
(775) 237-8550 Google voice
Hi everyone,
I try to learn dynamic of random searches (Special:Random) on English
Wikipedia.
>From 01/2012 to 10/2014 average number of random searches per month was
about 86 millions or about 30% of Main_Page pageviews, but from November
2014 it drop to 31,000 per month (or 0.008% of Main_page).
How to explain such a dramatic drop? Any ideas?
--
Thank you.
Alex Druk, PhD
wikipediatrends.com
alex.druk(a)gmail.com
(775) 237-8550 Google voice
Cross-posting to research and analytics, too!
---------- Forwarded message ----------
From: Oliver Keyes <okeyes(a)wikimedia.org>
Date: 6 May 2015 at 13:11
Subject: Traffic to the portal from Zero providers
To: wikimedia-search(a)lists.wikimedia.org
Hey all,
(Throwing this to the public list, because transparency is Good)
I recently did a presentation on a traffic analysis to the Wikipedia
"home page" - www.wikipedia.org.[1]
One of the biggest visualisations, in impact terms, showed that a lot
of portal traffic - far more, proportionately, than traffic to
Wikipedia overall - is coming from India and Brazil.[2] One of the
hypotheses was that this could be Zero traffic.
I've done a basic analysis of the traffic, looking specifically at the
zero headers,[3] and this hypothesis turns out to be incorrect -
almost no zero traffic is hitting the portal. The traffic we're seeing
from Brazil and India is not zero-based.
This makes a lot of sense (the reason mobile traffic redirects to the
enwiki home page from the portal is the Zero extension, so presumably
this happens specifically to Zero traffic) but it does mean that our
null hypothesis - that this traffic is down to ISP-level or
device-level design choices and links - is more likely to be correct.
[1] http://ironholds.org/misc/homepage_presentation.html
[2] http://ironholds.org/misc/homepage_presentation.html#/11
[3] https://phabricator.wikimedia.org/T98076
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
hi all,
to announce it here as well, just FYI: we made a little userscript (whoCOLOR) available that you can use with tamper-/greasemonkey in order to show original authors of text in any (english so far) wikipedia article. still far from perfect, but it works and it's already kind of useful I would say. you can download the prototype version here for trying out yourself: http://f-squared.org/whovisual/
code under MIT should be up on github in the next few days.
any volunteers/collaborators/re-users for this are welcome of course. i have some ideas for further extension listed on the website and aaron also suggested making it into a gadget, which is another great idea.
cheers,
fabian
--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: + 49 (0) 221-47694-208
fabian.floeck(a)gesis.org<mailto:fabian.floeck@gesis.org>
www.gesis.orgwww.facebook.com/gesis.org