After exchanging some emails with Erik Zachte, I learned that there is a
new platform in progress which can enable wikimedia to share referral
strings from browser requests provided that they originate from within
wikipedia. I think this shouldn't raise any privacy concerns. I wonder when
this platform will be available and is there any plan to release such data
for outside researchers once this new platform is ready.
I also think that, wikipedia itself can use this data, to suggest articles
automatically without editor's involvement which can increase the
navigation a bit easier, especially for articles where the reader is not
very familiar with.
I am looking to do some year-end statistical summaries. I am aware of
the over-reporting incident involving CentralAuth between August and
I know that Erik fixed/re-generated the files and fixed the numbers on
the "wikistats" report card, but were the corrected "projectcount" files
ever dumped anywhere? I'd like to re-run these through my system.
Andrew G. West, PhD
Verisign Labs - Reston, VA
I just finished doing a few very rough unscientific comparisons of data sizes and hive query times between uncompressed and snappy compressed webrequest data stored in HDFS.
Quick summary: Snappy compresses webrequest JSON data to about 25% of original size. Query times on small datasets (~1 hour) are doubled, but query times are larger datasets are only slightly increased.
The Camus Snappy compression (which was merged upstream at LinkedIn this week) is working great. I'll start using it exclusively for webrequest imports soon.
on 2014-01-05 ~03:39 UTC gadolinium went down.
) acts as
* udp2log relay for anything but emery .
* eventlogging relay to vanadium.
As gadolinium is a SPOF for those services, they stopped receiving
data when the machine went down.
Gadolinium has been brought up again, and since 2014-01-06 ~17:45 UTC,
services should be getting data again.
Webstatscollector is producing good hourly files again on
Data for the tsvs gets collected again.
The gzipped tsvs should show up again tomorrow on the usual places.
For eventlogging, I am still awaiting confirmation that is was
affected and that it is seeing good data again.
are affected, while only the
are not affected.
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Gruendbergstrasze 65a Email: christian(a)quelltextlich.at
4040 Linz, Austria Phone: +43 732 / 26 95 63
Fax: +43 732 / 26 95 63
OpenPGP key transition from 0xEF78CCDE to 0x13C1072F: