We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hello,
I work for a consulting firm called Strategy&. We have been engaged by Facebook on behalf of Internet.org to conduct a study on assessing the state of connectivity globally. One key area of focus is the availability of relevant online content. We are using a the availability of encyclopedic knowledge in one's primary language as a proxy for relevant content. We define this as 100K+ Wikipedia articles in one's primary language. We have a few questions related to this analysis prior to publishing it:
* We are currently using the article count by language based on Wikimedia's foundation public link: Source: http://meta.wikimedia.org/wiki/List_of_Wikipedias. Is this a reliable source for article count - does it include stubs?
* Is it possible to get historic data for article count. It would be great to monitor the evolution of the metric we have defined over time?
* What are the biggest drivers you've seen for step change in the number of articles (e.g., number of active admins, machine translation, etc.)
* We had to map Wikipedia language codes to ISO 639-3 language codes in Ethnologue (source we are using for primary language data). The 2 language code for a wikipedia language in the "List of Wikipedias" sometimes matches but not always the ISO 639-1 code. Is there an easy way to do the mapping?
Many Thanks,
Rawia
[Description: Strategy& Logo]
Formerly Booz & Company
Rawia Abdel Samad
Direct: +9611985655 | Mobile: +97455153807
Email: Rawia.AbdelSamad(a)strategyand.pwc.com<mailto:Rawia.AbdelSamad@strategyand.pwc.com>
www.strategyand.com
Hi Analytics,
I have a query help question. Trying to run the following to see how many
clicks are generated by ~3000 pages in english wikipedia. I get 0 results
when I run this query, but that seems unlikely as it means nobody clicked
on the mobile web's barack obama page (or any of the other pages I am
looking at) during the hour in question. I am almost sure I have a stupid
'where' clause in here that is blocking me. Any thoughts?
Thanks,
J
the table of pages I am looking at is in hive under
jkatz.browse_prototype_page_list4
Here is a sample:
category title url
Category:20th-century American politicians Al Gore Al_Gore
Category:20th-century American politicians Barbara Boxer Barbara_Boxer
Category:20th-century American politicians Ben Cardin Ben_Cardin
Here is my query:
SELECT * from jkatz.browse_prototype_page_list4 L
INNER JOIN (
SELECT count(*) AS num_pvs, referer, uri_path
FROM wmf.webrequest
WHERE TRUE = TRUE
AND webrequest_source = 'mobile'
AND year = 2015
AND month = 5
AND day = 25
AND hour = 1
AND agent_type <> 'spider'
AND is_pageview = TRUE
AND x_analytics_map['mf-m'] IS NULL
AND access_method = 'mobile web'
AND uri_host like '%en.wikipedia.org%'
GROUP BY
referer, uri_path
) R
on R.referer = concat('http://en.m.wikipedia.org/wiki/', L.url)
ORDER BY category
LIMIT 50;
Hi,
For the article recommendation test, we queried user table to get editors'
email addresses. We then excluded the emails that were not verified. We've
received a comment here
<https://meta.wikimedia.org/wiki/Research_talk:Increasing_article_coverage#U…>
that suggests the user has changed his/her email addresse and we have
somehow retained the old email address. I'd like to get to the bottom of
this problem. Can someone help with this, in the Talk page or here? Are we
looking at the wrong table? And in general, how can old information be in
the user table?
Sorry for sending an email over the weekend. A response on Monday would be
great. :-)
Thank you!
Leila
switching to the public list with Bob's permission.
On Wed, Jun 24, 2015 at 1:58 PM, Robert West <robert.bob.west(a)gmail.com>
wrote:
> Hi everyone,
>
> I'd like to find all enwiki articles that were ever marked with the
> {{hoax}} template. Pages with that template mostly end up being deleted, so
> they're not available in the public revision dumps
> <https://dumps.wikimedia.org/enwiki/20150602/>.
>
> Hence my question:
> Is there a way of getting access to the full enwiki revision dump
> including all deleted pages?
> I don't know yet which deleted articles I'm interested in, but will only
> know that after having done a pass over the full revision history.
>
> I know that viewing deleted content is problematic
> <https://en.wikipedia.org/wiki/Wikipedia:Viewing_deleted_content> (hence
> I'm sending this request to this internal research list), but I signed an
> NDA and have access to data on HDFS via stat1002, so there might be a way
> for me to access that data?
>
> I'm also aware of a list of archived hoaxes
> <https://en.wikipedia.org/wiki/Wikipedia:List_of_hoaxes_on_Wikipedia>,
> but many shorter-lived hoaxes that got deleted fast are not included there.
>
> Thanks -- any pointers welcome!
> Bob
>
>
> --
> Up for a little language game? -- http://www.unfun.me
>
> _______________________________________________
> Research-Internal mailing list
> Research-Internal(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/research-internal
>
>
Hey all,
We're building a lot of tools out on Labs. From a RESTful API to a
Wikidata Query Service, we're making neat things and Labs is proving
the perfect place to prototype them - in all-but-one-respects.
A crucial part of these tools being not just useful but measurably
useful is the logs being available to parse. When you combine that
with the constraints on getting things on to do the main cluster, what
you have is a situation where much of our beta or alpha software has
no integration with our existing data storage systems but absolutely
/needs/ it to verify that it's worth keeping and provide data about
usage.
So I'm asking, I guess, two things. The first is: can we have a firm
commitment that we'll get this kind of stuff into Hadoop? Right now we
have a RESTful API everywhere that is not (to my knowledge) throwing
data into the request logs. We have a WDQS that isn't either.
Undoubtedly we have other tools I haven't encountered. It's paramount
that the first question we ask with new services or systems is "so
when does new traffic data start hitting the analytics cluster?"
Second: what's best practices for this? What resources are available?
If I'm starting a service on Labs that provides data to third-parties,
what would analytics recommend my easiest path is to getting request
logs into Hadoop?
--
Oliver Keyes
Research Analyst
Wikimedia Foundation
Hello,
I'm inquiring about the delay for publishing the January compressed Wikistats files that are maintained by Erik Zachte. I'm guessing those processes are given a low priority compared to the content backups that need to run. More generally, I'm interested in finding new ways that I can help out. I'm an ex-Microsoftie who is now on the fraud analytics team at TD Bank. I've been involved with the Wikimedia group in Atlanta. I organize the picnic each summer, and helped get the rest of the historic buildings photographed. I've dabbled in reverting vandalism, and I contribute to articles when I actually have something to contribute. I don't feel like I've settled into a contributor role that really fits me yet though.
I enjoy using a variety of the traffic data sets that Wikimedia publishes. It seems the traffic servers get bogged down sometimes though. Can I help? Should I try to get the Atlanta group to pool our donations this year for an extra computer?
Thanks,
Michael
I just posted a comment on the famous task:
https://phabricator.wikimedia.org/T44259#1341010 :)
Here it is for those who would rather discuss on this list:
We have finished analyzing the intermediate hourly aggregate with all the
columns that we think are interesting. The data is too large to query and
anonymize in real time. We'd rather get an API out faster than deal with
that problem, so we decided to produce smaller "cubes" [1] of data for
specific purposes. We have two cubes in mind and I'll explain those here.
For each cube, we're aiming to have:
* Direct access to a postgresql database in labs with the data
* API access through RESTBase
* Mondrian / Saiku access in labs for dimensional analysis
* Data will be pre-aggregated so that any single data point has k-anonymity
(we have not determined a good k yet)
* Higher level aggregations will be pre-computed so they use all data
And, the cubes are:
**stats.grok.se Cube: basic pageview data**
Hourly resolution. Will serve the same purpose as stats.grok.se has served
for so many years. The dimensions available will be:
* project - 'Project name from requests host name'
* dialect - 'Dialect from requests path (not set if present in project
name)'
* page_title - 'Page Title from requests path and query'
* access_method - 'Method used to access the pages, can be desktop, mobile
web, or mobile app'
* is_zero - 'accessed through a zero provider'
* agent_type - 'Agent accessing the pages, can be spider or user'
* referer_class - 'Can be internal, external or unknown'
**Geo Cube: geo-coded pageview data**
Daily resolution. Will allow researchers to track the flu, breaking news,
etc. Dimensions will be:
* project - 'Project name from requests hostname'
* page_title - 'Page Title from requests path and query'
* country_code - 'Country ISO code of the accessing agents (computed using
MaxMind GeoIP database)'
* province - 'State / Province of the accessing agents (computed using
MaxMind GeoIP database)'
* city - 'Metro area of the accessing agents (computed using MaxMind GeoIP
database)'
So, if anyone wants another cube, **now** is the time to speak up. We'll
probably add cubes later, but it may be a while.
[1] OLAP cubes: https://en.wikipedia.org/wiki/OLAP_cube
Below are a couple examples of recent slow queries running on
analytics-store. They both use subqueries and UNION. Neither push the
WHERE conditions down to the subqueries, which means indexes are not
used. If you include WHERE clauses in both inner and outer queries you
will achieve index usage, especially for these sorts of eventlogging
queries on event_action or wiki.
MySQL / MariaDB query optimizer does do constant propagation, but
UNION tends to trip it up in all sorts of exciting ways. Be verbose,
abandon DRY principle, and make your life easier :-)
Current query:
SELECT Month.Date, COALESCE(Web.Web, 0) AS Web FROM (
SELECT DATE_FORMAT( ADDDATE(CURDATE() - INTERVAL 30 - 1 DAY,
@num:=@num+1), '%Y-%m-%d' ) AS Date
FROM seq_1_to_100, (SELECT @num:=-1) num LIMIT 30 ) AS Month LEFT JOIN (
SELECT DATE(timestamp) AS Date,
SUM(1) AS Web FROM (
SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount FROM MobileWebEditing_5644223
UNION SELECT timestamp, wiki, event_username,
event_action, event_namespace, event_userEditCount
FROM MobileWebEditing_6077315
UNION SELECT timestamp, wiki, event_username,
event_action, event_namespace, event_userEditCount
from MobileWebEditing_6637866
UNION SELECT timestamp, wiki, event_username,
event_action, event_namespace, event_userEditCount
from MobileWebEditing_7675117
UNION SELECT timestamp, wiki, event_username,
event_action, event_namespace, event_userEditCount
from MobileWebEditing_8599025
) as MobileWebEditing
WHERE event_namespace != 0 AND event_action = 'success' AND wiki
!= 'testwiki'
GROUP BY Date
) AS Web ON Month.Date = Web.Date;
Version with WHERE clauses pushed down to subqueries:
SELECT Month.Date, COALESCE(Web.Web, 0) AS Web FROM (
SELECT DATE_FORMAT( ADDDATE(CURDATE() - INTERVAL 30 - 1 DAY,
@num:=@num+1), '%Y-%m-%d' ) AS Date
FROM seq_1_to_100, (SELECT @num:=-1) num LIMIT 30 ) AS Month LEFT JOIN (
SELECT DATE(timestamp) AS Date,
SUM(1) AS Web FROM (
SELECT timestamp, wiki, event_username, event_action,
event_namespace, event_userEditCount FROM MobileWebEditing_5644223
UNION SELECT timestamp, wiki, event_username,
event_action, event_namespace, event_userEditCount
FROM MobileWebEditing_6077315 WHERE event_namespace !=
0 AND event_action = 'success' AND wiki != 'testwiki'
UNION SELECT timestamp, wiki, event_username,
event_action, event_namespace, event_userEditCount
from MobileWebEditing_6637866 WHERE event_namespace !=
0 AND event_action = 'success' AND wiki != 'testwiki'
UNION SELECT timestamp, wiki, event_username,
event_action, event_namespace, event_userEditCount
from MobileWebEditing_7675117 WHERE event_namespace !=
0 AND event_action = 'success' AND wiki != 'testwiki'
UNION SELECT timestamp, wiki, event_username,
event_action, event_namespace, event_userEditCount
from MobileWebEditing_8599025 WHERE event_namespace !=
0 AND event_action = 'success' AND wiki != 'testwiki'
) as MobileWebEditing
WHERE event_namespace != 0 AND event_action = 'success' AND wiki
!= 'testwiki'
GROUP BY Date
) AS Web ON Month.Date = Web.Date;
Current query (hours):
SELECT day,
`Attempted saves` / `Loads` AS "All wikis",
`Attempted saves - dewiki` / `Loads - dewiki` AS "dewiki",
`Attempted saves - enwiki` / `Loads - enwiki` AS "enwiki",
`Attempted saves - eswiki` / `Loads - eswiki` AS "eswiki",
`Attempted saves - frwiki` / `Loads - frwiki` AS "frwiki",
`Attempted saves - ptwiki` / `Loads - plwiki` AS "plwiki",
`Attempted saves - plwiki` / `Loads - ptwiki` AS "ptwiki" FROM (
SELECT DATE( timestamp ) AS day,
SUM( IF( event_action = 'ready', 1, 0 ) ) AS "Loads",
SUM( IF( event_action = 'ready' AND wiki = 'dewiki', 1, 0 ) ) AS
"Loads - dewiki",
SUM( IF( event_action = 'ready' AND wiki = 'enwiki', 1, 0 ) ) AS
"Loads - enwiki",
SUM( IF( event_action = 'ready' AND wiki = 'eswiki', 1, 0 ) ) AS
"Loads - eswiki",
SUM( IF( event_action = 'ready' AND wiki = 'frwiki', 1, 0 ) ) AS
"Loads - frwiki",
SUM( IF( event_action = 'ready' AND wiki = 'plwiki', 1, 0 ) ) AS
"Loads - plwiki",
SUM( IF( event_action = 'ready' AND wiki = 'ptwiki', 1, 0 ) ) AS
"Loads - ptwiki",
SUM( IF( event_action = 'saveAttempt', 1, 0 ) ) AS "Attempted saves",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'dewiki', 1, 0 )
) AS "Attempted saves - dewiki",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'enwiki', 1, 0 )
) AS "Attempted saves - enwiki",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'eswiki', 1, 0 )
) AS "Attempted saves - eswiki",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'frwiki', 1, 0 )
) AS "Attempted saves - frwiki",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'plwiki', 1, 0 )
) AS "Attempted saves - plwiki",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'ptwiki', 1, 0 )
) AS "Attempted saves - ptwiki" FROM (
SELECT event_action, wiki, timestamp
FROM Edit_10604157
UNION ALL
SELECT event_action, wiki, timestamp
FROM Edit_10676603
) AS valid_edit_schema_revisions
GROUP BY day
ORDER BY day
) as internalQuery;
Version with WHERE clauses pushed down to subqueries:
SELECT day,
`Attempted saves` / `Loads` AS "All wikis",
`Attempted saves - dewiki` / `Loads - dewiki` AS "dewiki",
`Attempted saves - enwiki` / `Loads - enwiki` AS "enwiki",
`Attempted saves - eswiki` / `Loads - eswiki` AS "eswiki",
`Attempted saves - frwiki` / `Loads - frwiki` AS "frwiki",
`Attempted saves - ptwiki` / `Loads - plwiki` AS "plwiki",
`Attempted saves - plwiki` / `Loads - ptwiki` AS "ptwiki" FROM (
SELECT DATE( timestamp ) AS day,
SUM( IF( event_action = 'ready', 1, 0 ) ) AS "Loads",
SUM( IF( event_action = 'ready' AND wiki = 'dewiki', 1, 0 ) ) AS
"Loads - dewiki",
SUM( IF( event_action = 'ready' AND wiki = 'enwiki', 1, 0 ) ) AS
"Loads - enwiki",
SUM( IF( event_action = 'ready' AND wiki = 'eswiki', 1, 0 ) ) AS
"Loads - eswiki",
SUM( IF( event_action = 'ready' AND wiki = 'frwiki', 1, 0 ) ) AS
"Loads - frwiki",
SUM( IF( event_action = 'ready' AND wiki = 'plwiki', 1, 0 ) ) AS
"Loads - plwiki",
SUM( IF( event_action = 'ready' AND wiki = 'ptwiki', 1, 0 ) ) AS
"Loads - ptwiki",
SUM( IF( event_action = 'saveAttempt', 1, 0 ) ) AS "Attempted saves",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'dewiki', 1, 0 )
) AS "Attempted saves - dewiki",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'enwiki', 1, 0 )
) AS "Attempted saves - enwiki",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'eswiki', 1, 0 )
) AS "Attempted saves - eswiki",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'frwiki', 1, 0 )
) AS "Attempted saves - frwiki",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'plwiki', 1, 0 )
) AS "Attempted saves - plwiki",
SUM( IF( event_action = 'saveAttempt' AND wiki = 'ptwiki', 1, 0 )
) AS "Attempted saves - ptwiki" FROM (
SELECT event_action, wiki, timestamp
FROM Edit_10604157
WHERE event_action in ('saveAttempt', 'ready')
UNION ALL
SELECT event_action, wiki, timestamp
FROM Edit_10676603
WHERE event_action in ('saveAttempt', 'ready')
) AS valid_edit_schema_revisions
GROUP BY day
ORDER BY day
) as internalQuery;
--
DBA @ WMF
File modifications on stat1002 haven't been reflected in
http://datasets.wikimedia.org/aggregate-datasets/search/ for ~4 hours;
the awful timing of this means it's effectively broken the search
dashboards (we were in the middle of a format switchover when it
happened)
This needs fixing, hyper-urgently; we cannot display any data without
the ability to /retrieve/ that data. For an example, compare
stat1002::/a/aggregate-datasets/search/app_event_counts.tsv to
http://datasets.wikimedia.org/aggregate-datasets/search/app_event_counts.tsv
--
Oliver Keyes
Research Analyst
Wikimedia Foundation