After exchanging some emails with Erik Zachte, I learned that there is a
new platform in progress which can enable wikimedia to share referral
strings from browser requests provided that they originate from within
wikipedia. I think this shouldn't raise any privacy concerns. I wonder when
this platform will be available and is there any plan to release such data
for outside researchers once this new platform is ready.
I also think that, wikipedia itself can use this data, to suggest articles
automatically without editor's involvement which can increase the
navigation a bit easier, especially for articles where the reader is not
very familiar with.
Thanks
Greetings,
I am looking to do some year-end statistical summaries. I am aware of
the over-reporting incident involving CentralAuth between August and
December 2013:
https://docs.google.com/document/d/1kpJrfataS5KAxGXFoygQVhMlzFftjsvX9HktSAA…
I know that Erik fixed/re-generated the files and fixed the numbers on
the "wikistats" report card, but were the corrected "projectcount" files
ever dumped anywhere? I'd like to re-run these through my system.
--
Andrew G. West, PhD
Research Scientist
Verisign Labs - Reston, VA
Hi all!
I just finished doing a few very rough unscientific comparisons of data sizes and hive query times between uncompressed and snappy compressed webrequest data stored in HDFS.
Check it!
https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Hive
Quick summary: Snappy compresses webrequest JSON data to about 25% of original size. Query times on small datasets (~1 hour) are doubled, but query times are larger datasets are only slightly increased.
The Camus Snappy compression (which was merged upstream at LinkedIn this week) is working great. I'll start using it exclusively for webrequest imports soon.
-Ao
Hi,
on 2014-01-05 ~03:39 UTC gadolinium went down.
Gadolinium (
https://wikitech.wikimedia.org/wiki/Gadolinium
) acts as
* udp2log relay for anything but emery [1].
* eventlogging relay to vanadium.
As gadolinium is a SPOF for those services, they stopped receiving
data when the machine went down.
Gadolinium has been brought up again, and since 2014-01-06 ~17:45 UTC,
services should be getting data again.
Webstatscollector is producing good hourly files again on
dumps.wikimedia.org.
Data for the tsvs gets collected again.
The gzipped tsvs should show up again tomorrow on the usual places.
For eventlogging, I am still awaiting confirmation that is was
affected and that it is seeing good data again.
Best regards,
Christian
[1] So
5xx tsvs
mobile-sampled-100 tsvs
zero tsvs
edits tsvs
[...]
are affected, while only the
sampled-1000 tsvs
api-usage tsvs
glam-nara tsvs
teahouse tsvs
arabic-banner tsvs
missing-wiki tsvs
are not affected.
--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
OpenPGP key transition from 0xEF78CCDE to 0x13C1072F:
http://quelltextlich.at/openpgp-transition-0xEF78CCDE-to-0x13C1072F.txt