The Wikimedia Developer Summit starts this Monday, Jan. 4!
There will be an information and discussion session about the
in-progress Code of Conduct for technical spaces
(https://www.mediawiki.org/wiki/Code_of_Conduct/Draft) on Monday.
Thanks,
Matt Flaschen
Dear wiki analytics team
I am looking at your pagecounts as archived on
https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-12/
Can you tell me from what timezone the time stamps originate?
the filename pagecounts-20151224-070000 indicate pagecounts between from
7am to 8am. This 7 AM to 8 AM period. in what geographical timezone is it?
GMT, UTC + or - how many hours.
thanks very much
Maurice Vergeer
--
________________________________________________
Maurice Vergeer
To contact me, see http://mauricevergeer.nl/node/5
To see my publications, see http://mauricevergeer.nl/node/1
________________________________________________
Team:
As part of our effort of converting eventlogging mysql database to the
tokudb engine we need to stop eventlogging events from flowing into
the MobileWikiAppShareAFact
table, we are using this one table to see how long the conversion will take
in order to plan for a larger outage window.
Let us know if data should be backfilled as it can be, we anticipate events
will not flow into table for the better part of one day.
Thanks,
Nuria
Dear All:
Maybe this question is a little bit too simple, but I did not immediately
find the answer in the docs.
How does the API differentiate between the two user agents spider and bot?
I'm asking because for some articles, there seems to be no bot traffic at
all, including the main page in August:
https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedi…
---
Another, unrelated question:
By my recollection, I read somewhere that the data available via the API
dates back to sometime in May of 2015. However, when doing queries today,
the API only returned data starting on August 1, 2015. Is that correct?
Best,
Felix
Hi Dan,
The aim of our project is to determine whether we can establish a prediction technique for high impact (not high risk) species before they enter Australia and New Zealand. We are using data for species of 18 industries that have already entered and are high or low impact species (at this stage removing moderate impact). Monthly pageviews going back further than May 2015 would be useful (or even daily pageviews, but monthly would suffice). I should be able to use this response for my analysis. Pageview data may only show us high risk pest species - but it is all worth an investigation for us.
Unfortunately I'm not familiar with the methods used to access the older data - but the links you have all sent me will be useful and if I decide I need more data, I can try those methods myself.
Thank you all for your help! I should be okay from here.
Cheers,
Caitlin
-----Original Message-----
From: Analytics [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of analytics-request(a)lists.wikimedia.org
Sent: Saturday, 19 December 2015 5:46 AM
To: analytics(a)lists.wikimedia.org
Subject: Analytics Digest, Vol 46, Issue 38
Send Analytics mailing list submissions to
analytics(a)lists.wikimedia.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.wikimedia.org/mailman/listinfo/analytics
or, via email, send a message with subject or body 'help' to
analytics-request(a)lists.wikimedia.org
You can reach the person managing the list at
analytics-owner(a)lists.wikimedia.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of Analytics digest..."
Today's Topics:
1. Re: Data collection (Dan Andreescu)
----------------------------------------------------------------------
Message: 1
Date: Tue, 15 Dec 2015 08:50:11 -0500
From: Dan Andreescu <dandreescu(a)wikimedia.org>
To: "A mailing list for the Analytics Team at WMF and everybody who
has an interest in Wikipedia and analytics."
<analytics(a)lists.wikimedia.org>
Subject: Re: [Analytics] Data collection
Message-ID:
<CA+aepCRqZ9YwHPCdo-1F-2uF-Vy3OiBZj=PjCD-MturFy0qyVA(a)mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi Caitlin,
Using the python client for the pageview API ( https://github.com/mediawiki-utilities/python-mwviews), you could do:
from mwviews.api import PageviewsClient
p = PageviewsClient()
p.article_views('en.wikipedia',
['Abacarus_hystrix','Acarus_siro','Aceria_tosichella','Acyrthosiphon_pisum','Ahasverus_advena','Anthrenus_flavipes','Aphis_craccivora','Arhopalus','Balaustium_medicagoense','Bemisia_tabaci','Brevicoryne_brassicae','Bruchus','Ceratitis_capitata','Cicadulina','Cryptolestes','Daktulosphaira_vitifoliae','Delia','Ephestia_elutella','Ephestia_kuehniella','Etiella_behrii','Frankliniella_occidentalis','Frankliniella','Henosepilachna_vigintioctopunctata','Heteronychus_arator','Lachesilla_quercus','Lasioderma_serricorne','Liposcelis_bostrychophila','Macrosiphum_euphorbiae','Marchalina_hellenica','Myzus_persicae','Naupactus','Nezara_viridula','Oligonychus_ununguis','Oryzaephilus_surinamensis','Panonychus_ulmi','Penthaleus','Pieris_rapae','Piezodorus','Plodia_interpunctella','Plutella_xylostella','Rhopalosiphon','rhopalosiphum_maidis','Rhopalosiphum_padi','Rhyzopertha_dominica','Sirex_noctilio','Sitophilus_granarius','Sitophilus_oryzae','Sitotroga_cerealella','Sminthurus_viridis','Spodoptera_exempta','Stegobium_paniceum','Tetranychus','Thrips_palmi','Thrips','Tribolium_castaneum','Tribolium_confusum','Trogoderma_granarium','Trogoderma'],
start='20150501')
Some of the articles in your list don't exist on en.wikipedia (like 'Frankliniella') but for what exists this returns the views as far back as we have them. When we're done filling up the API we'll have data back to May 2015, but right now it only goes to August. If you need it back further, you have to parse the dumps as others have said. I'm curious why you need the older data, it's interesting to us as we try to figure out what else to expose through the API. Would monthly pageviews be just as good?
I attached the result of that query here in JSON format
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/analytics/attachments/20151215/3ee3dd…>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pageviews.json
Type: application/json
Size: 460113 bytes
Desc: not available
URL: <https://lists.wikimedia.org/pipermail/analytics/attachments/20151215/3ee3dd…>
------------------------------
Subject: Digest Footer
_______________________________________________
Analytics mailing list
Analytics(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics
------------------------------
End of Analytics Digest, Vol 46, Issue 38
*****************************************
Hi Analytics,
Yesterday, Dec 15, during the course of 1 hour (17h to 18h UTC) there was
an irrecoverable raw_webrequest data loss of ~30%: 25.6% (misc), 19.5%
(mobile), 19.1% (text), 39.1% (upload). This represents around 1% of the
data for that day.
The loss was due to the enabling of IPSec, which encrypts varniskafka
traffic between caches in remote datacenters and the Kafka brokers in
eqiad. During a period of about 40ish minutes, no webrequest logs from
remote datacenters were successfully produced to Kafka.
Here's the outage note:
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest#Changes_and_k…
Sorry for the inconvenience.
--
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
Good afternoon,
I am a student at Berkeley and I am using the raw wikipedia pagecounts data for a project.
I had one quick question: in the name of the files to download, are you using London time?
Thank you
Pierre Alexis Panel
Sent from my iPhone
Dear all,
First and foremost, thanks for making the Wikimedia Pageviews API
available; your work is highly appreciated and super useful! As a
modest "thank you", I am happy to release the JavaScript client
library pageviews.js for Node.js and the browser to make working with
this API easy for JavaScript developers. Please find the code and all
instructions at [1]. The library adds some convenience functions
(getting batch pageviews and limiting the number of results) that were
inspired by Dan Andreescu's Python library [2] and is Promise-based:
===
var pageviews = require('pageviews');
// Getting pageviews for a single article
pageviews.getPerArticlePageviews({
article: 'Berlin',
project: 'en.wikipedia',
start: '20151201',
end: '20151202'
}).then(function(result) {
console.log(JSON.stringify(result, null, 2));
}).catch(function(error) {
console.log(error);
});
// Getting top-n items ranked by pageviews for multiple projects
pageviews.getTopPageviews({
projects: ['en.wikipedia', 'de.wikipedia'], // Plural
year: '2015',
month: '12',
day: '01',
limit: 2 // Limit to the first n results
}).then(function(result) {
console.log(JSON.stringify(result, null, 2));
}).catch(function(error) {
console.log(error);
});
===
On a more technical note—trying to be a good citizen [3]—the client
library sets an identifying User-Agent header in Node.js mode.
However, trying to set the corresponding X-User-Agent (note the "X-")
header from a browser context (XMLHttpRequest cannot override the
browser's intrinsic User-Agent for security reasons), this fails with
an error message "Request header field X-User-Agent is not allowed by
Access-Control-Allow-Headers in preflight response". Maybe you could
change your CORS settings and include X-User-Agent in your
Access-Control-Allow-Headers?!
Hope this is useful.
Thanks,
Tom
--
[1] pageviews.js: https://github.com/tomayac/pageviews.js
[2] python-mwviews: https://github.com/mediawiki-utilities/python-mwviews
[3] User-Agent requirement: https://wikimedia.org/api/rest_v1/?doc
--
Dr. Thomas Steiner, Employee (blog.tomayac.com, twitter.com/tomayac)
Google Germany GmbH, ABC-Str. 19, 20354 Hamburg
Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
Registergericht und -nummer: Hamburg, HRB 86891
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.29 (GNU/Linux)
iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DCharaCTersAttH3b0ttom.hTtP5://xKcd.c0m/1181/
-----END PGP SIGNATURE-----