We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia
http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770>
This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015.
This data can be used for various purposes:
• determining the most frequent links people click on for a given article
• determining the most common links people followed to an article
• determining how much of the total traffic to an article clicked on a link in that article
• generating a Markov chain over English Wikipedia
We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream>
Ellery and Dario
Hi,
I am interested to know if wikipedia makes public how many backlinks each page gets.
I am working on a search for wikipedia, and I as you would expect, it sucks.
So I went and tested same searches directly on wikipedia, and no offence, they suck even more.
So I went on Google, and performed same searches, with the added site:wikipedia.org, and Google was a little bit better (although not much compared with my 1-day-development-seach-engine).
I want to make my wikipedia search better, and having a table that would tell me how many non-wikipedia pages point to a certain wikipedia page, might improve my algorithm.
Anyone knows if wikipedia publishes such data?
Thank you!
Edison Nica
Http://www.0pii.com
Edisonn(a)0pii.com
Sent from my T-Mobile 4G LTE Device
Hey guys,
Good afternoon. I am working on a data-mining project on Wikipedia data.
Our main question is whether the number of page view is correlated with the
number of reverted edits.In order to have a fair comparison between
different portal (People, Technology, Math and etc) , I would like to get
the 250 most viewed pages for each portal (12 in total).
I notice that on Wikipedia, there is a page where weekly page view data are
aggregated and the 5000 most viewed pages of the week are listed.
However, in order to see the behavior between the number of page views and
the number of reverted edits, I would need data for a longer duration (Say
a month or longer).
I wonder if there is any way (hopefully easy way) that I can query the data
I need.
Your help is highly appreciated.
Happy Thanksgiving!
Anson
Hi all,
here is the weekly look at our most important readership metrics (apologies
for the delay). Apart from the usual data, this time there is an additional
chart to illuminate how our mobile readership ratio has developed since
this spring, the iOS app retention stats are back after Apple fixed their
data, and we conclude with some inspiring quotes about climate change
awareness ;)
As laid out earlier
<https://lists.wikimedia.org/pipermail/mobile-l/2015-September/009773.html>,
the main purpose of this report is to raise awareness about how these are
developing, call out the impact of any unusual events in the preceding
week, and facilitate thinking about core metrics in general. We are still
iterating on the presentation and eventually want to create dashboards for
those which are not already available in that form already. Feedback and
discussion welcome.
Now to the usual data. (All numbers below are averages for November 16-22,
2015 unless otherwise noted.)
Pageviews
Total: 540 million/day (-0.0% from the previous week)
Context (April 2015-November 2015):
( see also the Vital Signs dashboard
<https://vital-signs.wmflabs.org/#projects=all/metrics=Pageviews>)
The Analytics team improved web crawler detection further last week
<https://meta.wikimedia.org/w/index.php?title=Dashiki%3APageviewsAnnotations…>,
meaning an “optical” (as opposed to real) drop in human pageviews from
November 19 on - presumably smaller though than the one for September that
we reported in the preceding report.
Desktop: 57.2% (previous week: 57.5%)
Mobile web: 41.6% (previous week: 41.3%)
Apps: 1.2% (previous week: 1.2%)
Context (April 2015-November 2015):
These percentages usually don’t change rapidly from week to week. For a
wider perspective, I’m including a chart of the (aggregate) mobile
percentage this time, too. Technically this information is already
contained in the usual chart above. But here we can see even clearer
indications for an impact of the HTTPS-only switchover during June (it
appears to have taken out desktop traffic mainly), as well as the strong
weekly periodicity (higher mobile ratio on weekends). It looks like mobile
won’t overtake desktop anytime soon.
Global North ratio: 77.3% of total pageviews (previous week: 77.6%)
Context (April 2015-November 2015):
New app installations
Android: 30.9k/day (-44.2% from the previous week)
Daily installs per device, from Google Play
Context (last month):
As described in the previous report, the Android Wikipedia app was featured
in the "New
and Updated Apps" section of the Google Play store from November 5-12, and
while the huge positive impact overall on download numbers is obvious, they
also decreased markedly afterwards. They seem to be coming back up a bit
now, but we are still waiting for some more data before making a final
estimate for the overall effect, and have also contacted Google to see if
they can help us illuminate the mechanism behind this apparent effect.
iOS: 4.69k/day (+2.2% from the previous week)
Download numbers from App Annie
Context (last three months):
No news here.
App user retention
Android: 14.8% (previous week: 15.2%)
(Ratio of app installs opened again 7 days after installation, among all
installed during the previous week. 1:100 sample)
Context (last three months):
iOS: 12.0% (previous week: 11.9%)
(Ratio of app installs opened again 7 days after installation, among all
installed during the previous week. From iTunes Connect, opt-in only = ca.
20-30% of all users)
Context (installation dates from October 18-November 15, 2015):
This metric was left out of last week’s report because of inconsistencies.
Indeed, Apple has since issued a correction notice
<http://www.talkingnewmedia.com/2015/11/24/apple-issues-corrected-itunes-con…>.
Unfortunately it looks like the data underlying the report for the week
until November 8 was affected too, so please disregard the iOS retention
figure given in that report.
Unique app users
Android: 1.190 million / day (-2.2% from the previous week)
Context (last three months):
This too will need another look.
iOS: 281k / day (+0.1% from the previous week)
Context (last three months):
No news here.
After publishing this report regularly for a bit over two months, we may be
rethinking the weekly publication schedule a little - also to keep the
balance between newsworthiness and keeping up general awareness for
longterm developments. In that vein, some inspiring quotes about a weekly
climate change newsletter
<http://www.niemanlab.org/2015/11/climate-change-is-depressing-and-horrible-…>
that begins every issue by reciting the current CO2 ratio in the atmosphere
as a KPI ;)
Ultimately, Meyer said, the newsletter comes out of the idea that “if
you’re worried about something, you should pay regular attention to it.”
“By paying attention to it over time, and watching its texture change over
time, you will come to have ideas about it,” he said. “You will come to
understand it in a new way, and you will contribute in a very small way to
how society addresses this big problem.”
[...]
So it seemed as if a newsletter might be a good way to cover the issue.
[...] “You can get a continuity of storyline,” Meyer said. “You can’t cover
all of everything that’s happening every week in the climate, but you can
watch certain parts develop, and hopefully bring people in over time.” He
leads off the “Macro Trends” section of each issue with the molecules per
million of carbon dioxide in the atmosphere:
The atmosphere is filling with greenhouse gases. The Mauna Loa Observatory
measured an average of 398.51 CO2 molecules per million in the atmosphere
this week. A year ago, it measured 395.84 ppm. Ten years ago, it measured
376.93 ppm.
“What we’re doing now won’t show up in that number for a decade or so,” he
said. “But by reminding myself of it every week, and thinking about its
contours and its direction, that’s a way to stay focused on what matters.”
----
For reference, the queries and source links used are listed below (access
is needed for each). Most of the above charts are available on Commons, too
<https://commons.wikimedia.org/w/index.php?title=Special:ListFiles&offset=20…>
.
hive (wmf)> SELECT SUM(view_count)/7000000 AS avg_daily_views_millions FROM
wmf.projectview_hourly WHERE agent_type = 'user' AND
CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) BETWEEN "2015-11-16"
AND "2015-11-22";
hive (wmf)> SELECT year, month, day,
CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) as date,
sum(IF(access_method <> 'desktop', view_count, null)) AS mobileviews,
SUM(view_count) AS allviews FROM wmf.projectview_hourly WHERE year=2015 AND
agent_type = 'user' GROUP BY year, month, day ORDER BY year, month, day
LIMIT 1000;
hive (wmf)> SELECT access_method, SUM(view_count)/7 FROM
wmf.projectview_hourly WHERE agent_type = 'user' AND
CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) BETWEEN "2015-11-16"
AND "2015-11-22" GROUP BY access_method;
hive (wmf)> SELECT SUM(IF (FIND_IN_SET(country_code,
'AD,AL,AT,AX,BA,BE,BG,CH,CY,CZ,DE,DK,EE,ES,FI,FO,FR,FX,GB,GG,GI,GL,GR,HR,HU,IE,IL,IM,IS,IT,JE,LI,LU,LV,MC,MD,ME,MK,MT,NL,NO,PL,PT,RO,RS,RU,SE,SI,SJ,SK,SM,TR,VA,AU,CA,HK,MO,NZ,JP,SG,KR,TW,US')
> 0, view_count, 0))/SUM(view_count) FROM wmf.projectview_hourly WHERE
agent_type = 'user' AND
CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")) BETWEEN "2015-11-16"
AND "2015-11-22";
hive (wmf)> SELECT year, month, day,
CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0")), SUM(view_count) AS
all, SUM(IF (FIND_IN_SET(country_code,
'AD,AL,AT,AX,BA,BE,BG,CH,CY,CZ,DE,DK,EE,ES,FI,FO,FR,FX,GB,GG,GI,GL,GR,HR,HU,IE,IL,IM,IS,IT,JE,LI,LU,LV,MC,MD,ME,MK,MT,NL,NO,PL,PT,RO,RS,RU,SE,SI,SJ,SK,SM,TR,VA,AU,CA,HK,MO,NZ,JP,SG,KR,TW,US')
> 0, view_count, 0)) AS Global_North_views FROM wmf.projectview_hourly
WHERE year = 2015 AND agent_type='user' GROUP BY year, month, day ORDER BY
year, month, day LIMIT 1000;
https://console.developers.google.com/storage/browser/pubsite_prod_rev_0281…
(“overview”)
https://www.appannie.com/dashboard/252257/item/324715238/downloads/?breakdo…
(select “Total”)
SELECT LEFT(timestamp, 8) AS date, SUM(IF(event_appInstallAgeDays = 0, 1,
0)) AS day0_active, SUM(IF(event_appInstallAgeDays = 7, 1, 0)) AS
day7_active FROM log.MobileWikiAppDailyStats_12637385 WHERE timestamp LIKE
'201511%' AND userAgent LIKE '%-r-%' AND userAgent NOT LIKE '%Googlebot%'
GROUP BY date ORDER BY DATE;
(with the retention rate calculated as day7_active divided by day0_active
from seven days earlier, of course)
https://analytics.itunes.apple.com/#/retention?app=324715238
hive (wmf)> SELECT SUM(IF(platform = 'Android',unique_count,0))/7 AS
avg_Android_DAU_last_week, SUM(IF(platform = 'iOS',unique_count,0))/7 AS
avg_iOS_DAU_last_week FROM wmf.mobile_apps_uniques_daily WHERE
CONCAT(year,LPAD(month,2,"0"),LPAD(day,2,"0")) BETWEEN 20151116 AND
20151122;
hive (wmf)> SELECT CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0"))
as date, unique_count AS Android_DAU FROM wmf.mobile_apps_uniques_daily
WHERE platform = 'Android';
hive (wmf)> SELECT CONCAT(year,"-",LPAD(month,2,"0"),"-",LPAD(day,2,"0"))
as date, unique_count AS iOS_DAU FROM wmf.mobile_apps_uniques_daily WHERE
platform = 'iOS';
--
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
Always fearing doing queries on a lagged replica on labs? Not anymore!
While Betacommand's tool [0] was very useful, it was also very inaccurate,
as it tried to check the lag by looking at the last rows updated, which can
be a lot of time on the least popular wikis.
What I offer now is sub-second accurate lag measuring, by writing on the
production masters the current time, in microseconds, every 0.5 seconds and
making that available on all hosts (using this tool [1]). So, it is more
accurate than SHOW SLAVE STATUS, because it compares the difference with
the original master, and it will work even if replication is broken.
To read it, just do SELECT * FROM heartbeat_p.heartbeat;
And you will get:
+-------+----------------------------+------+
| shard | last_updated | lag |
+-------+----------------------------+------+
| s6 | 2015-11-25T20:20:32.000980 | 0 |
| s2 | 2015-11-25T20:20:32.001030 | 0 |
| s7 | 2015-11-25T20:20:32.001070 | 0 |
| s3 | 2015-11-25T20:20:32.001000 | 0 |
| s4 | 2015-11-25T20:20:32.000920 | 0 |
| s1 | 2015-11-25T20:20:32.000740 | 0 |
| s5 | 2015-11-25T20:20:32.000830 | 0 |
+-------+----------------------------+------+
Read the detailed documentation on: [2]
Use it, create a web page if you want to make it public! Report a ticket if
it gets too high! Report a ticket if you need more info (a record per
wiki?). But I wanted to give you the essentials, and you can build
yourselves on top of that.
Only 2 know bugs:
- There is microsecond accuracy, but it cannot be used until a bug in
MariaDB is fixed [3]
- enwiki will only report s1 lag until that server is restarted due to some
existing filters. We will schedule that at some time in the future.
[0]<http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag>
[1]<https://www.percona.com/doc/percona-toolkit/2.2/pt-heartbeat.html>
[2]<
https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag>
[3]<https://mariadb.atlassian.net/browse/MDEV-9175>
--
Jaime Crespo
<http://wikimedia.org>
We are now working on the "Cases" page of the draft Code of conduct.
This will become a separate page (for readability of the final CoC), but
is being drafted on the same page with the rest.
This includes both the intro section, and all the sub-sections, which
means everything that starts with "2." in the ToC. Currently this is
"Handling reports", "Responses and resolutions", and "Appealing a
resolution". However, the sections within "Cases" may change:
* Section:
https://www.mediawiki.org/wiki/Code_of_Conduct/Draft#Page:_Code_of_conduct_…
* Talk:
https://www.mediawiki.org/wiki/Talk:Code_of_Conduct/Draft#Finishing_the_Cas…
* Alternatively, you can provide anonymous feedback to
conduct-discussion(a)wikimedia.org .
This is the best time to make any necessary changes to this page (and
explain why, in edit summaries and/or talk) and discuss it on the talk page.
Other updates:
* The text of the "Report a problem" section has been frozen. Thanks to
everyone who helped discuss and edit these sections. Participation
(including both named and anonymous) helped us improve the
confidentiality line.
Thanks,
Matt Flaschen
db1046, probably known more to you as m4-master or the master of
analytics-slave, basically the place where the eventlogging database gets
written, it is soon going to run out of disk space.
It has 140GB free and it consumes around 40 per week right now. I have
observed that this rate has been accelerated in the latest weeks, being a
lot slower previously. Maybe it is time to evaluate if keeping older data
live it is worth it, or if some older tables could be archived or deleted.
Please coordinate actionables here: <
https://phabricator.wikimedia.org/T119380>
Regards,
--
Jaime Crespo
<http://wikimedia.org>