Hi all,
I'm a researcher at the Berkman Center for Internet & Society at Harvard
doing some work on anomaly detection against Wikipedia article request
volumes.
I'd like to create time series of request volumes for as many
article-country pairs as is possible. I'm using a number of different
data sets, but the most useful for our purposes is the pageview_hourly
table. I understand that this is a tremendous amount of data, and we are
in the process of prioritizing the article-country pairs, but my
question is: what is the best/fastest way to query this data from hive?
Writing a query that gets at the data is not a problem, but I'm curious
about possible strategies that could speed up the process.
Here is a reference query that shows the kind of data I'm looking for:
SELECT view_count FROM pageview_hourly WHERE
year = 2015 AND
month = 1 AND
page_title = 'World_War_II' AND
country_code = 'CA' AND
agent_type = 'user'
ORDER BY day, hour;
A couple options that come to mind:
* a year > 0 query vs many yearly, monthly, daily, or hourly queries
* batching articles with page_title IN (...)
* dropping country_code to get all countries at once (or batch like
above)
* ordering posthoc to avoid the map-reduce overhead
Because there's so much data and a query like the above takes ~10
minutes, experimenting with these is a long process. I was hoping
someone more familiar could share any magic that might speed things up
(or tell me there's no magic bullet and everything will take about as
long as everything else). If no one can say quickly off the top of their
head, I can just do that experimentation, but more options to try are
totally welcome.
Thanks,
Justin
Hi,
Do we have a more detailed view of % of clients accessing Wikimedia sites?
What I can currently find is:
https://browser-reports.wmflabs.org/#all-sites-by-browser
It doesn't give me a detailed enough answer to the question: "Which mobile
browsers do people use to read Wikipedia". The built-in browser of an
Android-based phone from 2012 is quite different from the one 2016, but if
I read that chart correctly, it shows both as just "Android".
--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
“We're living in pieces,
I want to live in peace.” – T. Moore
Hi,
Admin stuff: We are very aware of the sensitivity of the data. We've got
a group working on improving privacy protection in data sharing
(http://privacytools.seas.harvard.edu/), which is a tacit
acknowledgement that data sharing as it exists often doesn't protect
privacy. With that said, we've designed our workflow so that all the
analysis happens on Wikimedia's servers, and everyone that needs access
to that analysis right now has access to stat1002, so for the
foreseeable future everything is staying at Wikimedia. I think it's a
bit premature to talk about formats for publishing because we don't know
what our findings are going to look like and many of these sensitivities
depend on the social, political, numerical, etc. contexts. Once we reach
that stage we'll absolutely make sure everyone is comfortable with what
we'd be publishing.
Tech stuff: This is incredibly helpful - thank you both very much. This
is exactly what I was looking for, and way easier than starting from
scratch.
- Justin
Hi all!
As part of https://phabricator.wikimedia.org/T130840, we need to schedule a
short downtime for Hive and Oozie. I would like to proceed with this
tomorrow if there are no objections.
I’d like to schedule this downtime for an hour starting at 14:45 UTC (10:45
EST, 07:45 PST) Wednesday April 20th. The downtime will likely be less
than an hour, but I’m blocking out an hour just in case.
Sorry for the short notice! If this is going to cause anyone trouble let
us know and we will reschedule with more notice.
Thanks all!
-Andrew
Hi, thanks for the great work.
I have a question about data files available at
https://dumps.wikimedia.org/other/pageviews/
As I understood, in each line, the third column is number of pageviews
and the fourth one is its total traffic. But the fourth column is always
zero! Did I make a mistake in understanding the meaning of these
numbers?
For example in line 346 of file
https://dumps.wikimedia.org/other/pageviews/2016/2016-04/projectviews-20160…
we see "en - 4271631 0" which means there were 4271631 views for that
project, with zero traffic!
Thanks in advance,
Hassan
Hello!
The analytics team is happy to announce that the Unique Devices data is now
available to be queried programmatically via an API.
This means that getting the daily number of unique devices [1] for English
Wikipedia for the month of February 2016, for all sites (desktop and
mobile) is as easy as launching this query:
https://wikimedia.org/api/rest_v1/metrics/unique-devices/en.wikipedia.org/a…
You can get started by taking a look at our docs:
https://wikitech.wikimedia.org/wiki/Analytics/Unique_Devices#Quick_Start
If you are not familiar with the Unique Devices data the main thing you
need to know is that
is a good proxy metric to measure Unique Users, more info below.
Since 2009, the Wikimedia Foundation used comScore to report data about
unique web visitors. In January 2016, however, we decided to stop
reporting comScore numbers [2] because of certain limitations in the
methodology, these limitations translated into misreported mobile usage. We
are now ready to replace comscore numbers with the Unique Devices Dataset .
While unique devices does not equal unique visitors, it is a good proxy for
that metric, meaning that a major increase in the number of unique devices
is likely to come from an increase in distinct users. We understand that
counting uniques raises fairly big privacy concerns and we use a very
private conscious way to count unique devices, it does not include any
cookie by which your browser history can be tracked [3].
[1] https://meta.wikimedia.org/wiki/Research:Unique_Devices
[2] [https://meta.wikimedia.org/wiki/ComScore/Announcement
[3]
https://meta.wikimedia.org/wiki/Research:Unique_Devices#How_do_we_count_uni…
devices.3F
Hey Folks,
I am suddenly having trouble logging into stat1002 or stat1003 using my
normal settings. Here is what I get:
*Jon-Katzs-Air:~ jkatz$ ssh jkatz(a)stat1003.eqiad.wmnet*
*@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@*
*@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @*
*@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@*
*IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!*
*Someone could be eavesdropping on you right now (man-in-the-middle
attack)!*
*It is also possible that a host key has just been changed.*
*The fingerprint for the RSA key sent by the remote host is*
*[REDACTED BY JK].*
*Please contact your system administrator.*
*Add correct host key in /Users/jkatz/.ssh/known_hosts to get rid of this
message.*
*Offending RSA key in /Users/jkatz/.ssh/known_hosts:1*
*RSA host key for bast1001.wikimedia.org <http://bast1001.wikimedia.org>
has changed and you have requested strict checking.*
*Host key verification failed.*
*ssh_exchange_identification: Connection closed by remote host*
Here is my host.config file:
*ForwardAgent no*
*Host !bast1001.wikimedia.org <http://bast1001.wikimedia.org>
*.wikimedia.org <http://wikimedia.org> *.wmnet*
* ProxyCommand ssh -a -W %h:%p bast1001.wikimedia.org
<http://bast1001.wikimedia.org>*
Is someone doing something nasty?
Any ideas?
Thanks,
J
Hi there,
Hope my email finds you well. I need to know how can I access to
coordinates of specific geographical places within this log:
https://dumps.wikimedia.org/other/pagecounts-raw/
Thanks in advance,
Nima Dashtban
Hi,
Jimbo, on his talk page, said I should ask somebody at the WMF about
what's been going on and what might be happening soon re: Wikipedia
demographic surveys
Please see, https://en.wikipedia.org/wiki/User_talk:Jimbo_Wales#Updated_demographic_sta…
My feelings are:
*This is an important problem that should be easily dealt with and can
be solved by a commitment from the top brass, and
*we need to take a spectacularly simple approach to just get it done.
Any feedback would be appreciated via my e-mail or even on Jimbo's talk.
Sincerely,
Pete Ekman
aka User:Smallbones