Jorg, take a look at https://dumps.wikimedia.org/other/pagecounts-ez/ which has compressed data without losing granularity. You can get monthly files here and download a lot less data.
On Mon, Mar 6, 2017 at 5:40 AM, Jörg Jung joerg.jung@retevastum.de wrote:
Marcel,
thanx for ur quick answer. My main issue with dumps (or i don't get something) is:
I need to download them first to be able to aggregate and filter. Which for the year 2016 would be: 40MB(middle) * 24h * 30d * 12m = about 350TB
As i am not sitting directly at DE-CIX but in my private office i will face a pretty hard time with that :-)
So my idea is that somebody "closer" to the raw data would basically do the aggregation and filtering for me...
Will somebody (please) ?
Thanx, JJ
Am 06.03.2017 um 11:14 schrieb Marcel Ruiz Forns:
Hi Jörg, :]
Do you mean top 250K most viewed *articles* in de.wikipedia.org http://de.wikipedia.org?
If so, I think you can get that from the dumps indeed. You can find 2016 hourly pageview stats by article for all wikis here: https://dumps.wikimedia.org/other/pageviews/2016/
Note that the wiki codes (first column) you're interested in are: /de/, /de.m/ and /de.zero/. The third column holds the number of pageviews you're after. Also, this data set does not include bot traffic as recognized by the pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view . As files are hourly and contain data for all wikis, you'll need some aggregation and filtering.
Cheers!
On Mon, Mar 6, 2017 at 2:59 AM, Jörg Jung <joerg.jung@retevastum.de mailto:joerg.jung@retevastum.de> wrote:
Ladies, gents, for a project i plan i'd need the following data: Top 250K sites for 2016 in project de.wikipedia.org <http://de.wikipedia.org>, user-access. I only need the name of the site and the corrsponding number of user-accesses (all channels) for 2016 (sum over the year). As far as i can see i can't get that data via REST or by aggegating dumps. So i'd like to ask here, if someone likes to helpout. Thanx, cheers, JJ -- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.
de>
Web: www.retevastum.de <http://www.retevastum.de> www.datengraphie.de <http://www.datengraphie.de> www.digitaletat.de <http://www.digitaletat.de> www.olfaktum.de <http://www.olfaktum.de> _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de Web: www.retevastum.de www.datengraphie.de www.digitaletat.de www.olfaktum.de
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics