Ladies, gents,
for a project i plan i'd need the following data:
Top 250K sites for 2016 in project de.wikipedia.org, user-access.
I only need the name of the site and the corrsponding number of user-accesses (all channels) for 2016 (sum over the year).
As far as i can see i can't get that data via REST or by aggegating dumps.
So i'd like to ask here, if someone likes to helpout.
Thanx, cheers, JJ
Hi Jörg, :]
Do you mean top 250K most viewed *articles* in de.wikipedia.org?
If so, I think you can get that from the dumps indeed. You can find 2016 hourly pageview stats by article for all wikis here: https://dumps.wikimedia.org/other/pageviews/2016/
Note that the wiki codes (first column) you're interested in are: *de*, *de.m* and *de.zero*. The third column holds the number of pageviews you're after. Also, this data set does not include bot traffic as recognized by the pageview definition https://meta.wikimedia.org/wiki/Research:Page_view. As files are hourly and contain data for all wikis, you'll need some aggregation and filtering.
Cheers!
On Mon, Mar 6, 2017 at 2:59 AM, Jörg Jung joerg.jung@retevastum.de wrote:
Ladies, gents,
for a project i plan i'd need the following data:
Top 250K sites for 2016 in project de.wikipedia.org, user-access.
I only need the name of the site and the corrsponding number of user-accesses (all channels) for 2016 (sum over the year).
As far as i can see i can't get that data via REST or by aggegating dumps.
So i'd like to ask here, if someone likes to helpout.
Thanx, cheers, JJ
-- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de Web: www.retevastum.de www.datengraphie.de www.digitaletat.de www.olfaktum.de
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Marcel,
thanx for ur quick answer. My main issue with dumps (or i don't get something) is:
I need to download them first to be able to aggregate and filter. Which for the year 2016 would be: 40MB(middle) * 24h * 30d * 12m = about 350TB
As i am not sitting directly at DE-CIX but in my private office i will face a pretty hard time with that :-)
So my idea is that somebody "closer" to the raw data would basically do the aggregation and filtering for me...
Will somebody (please) ?
Thanx, JJ
Am 06.03.2017 um 11:14 schrieb Marcel Ruiz Forns:
Hi Jörg, :]
Do you mean top 250K most viewed *articles* in de.wikipedia.org http://de.wikipedia.org?
If so, I think you can get that from the dumps indeed. You can find 2016 hourly pageview stats by article for all wikis here: https://dumps.wikimedia.org/other/pageviews/2016/
Note that the wiki codes (first column) you're interested in are: /de/, /de.m/ and /de.zero/. The third column holds the number of pageviews you're after. Also, this data set does not include bot traffic as recognized by the pageview definition https://meta.wikimedia.org/wiki/Research:Page_view. As files are hourly and contain data for all wikis, you'll need some aggregation and filtering.
Cheers!
On Mon, Mar 6, 2017 at 2:59 AM, Jörg Jung <joerg.jung@retevastum.de mailto:joerg.jung@retevastum.de> wrote:
Ladies, gents, for a project i plan i'd need the following data: Top 250K sites for 2016 in project de.wikipedia.org <http://de.wikipedia.org>, user-access. I only need the name of the site and the corrsponding number of user-accesses (all channels) for 2016 (sum over the year). As far as i can see i can't get that data via REST or by aggegating dumps. So i'd like to ask here, if someone likes to helpout. Thanx, cheers, JJ -- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> Web: www.retevastum.de <http://www.retevastum.de> www.datengraphie.de <http://www.datengraphie.de> www.digitaletat.de <http://www.digitaletat.de> www.olfaktum.de <http://www.olfaktum.de> _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Jorg, take a look at https://dumps.wikimedia.org/other/pagecounts-ez/ which has compressed data without losing granularity. You can get monthly files here and download a lot less data.
On Mon, Mar 6, 2017 at 5:40 AM, Jörg Jung joerg.jung@retevastum.de wrote:
Marcel,
thanx for ur quick answer. My main issue with dumps (or i don't get something) is:
I need to download them first to be able to aggregate and filter. Which for the year 2016 would be: 40MB(middle) * 24h * 30d * 12m = about 350TB
As i am not sitting directly at DE-CIX but in my private office i will face a pretty hard time with that :-)
So my idea is that somebody "closer" to the raw data would basically do the aggregation and filtering for me...
Will somebody (please) ?
Thanx, JJ
Am 06.03.2017 um 11:14 schrieb Marcel Ruiz Forns:
Hi Jörg, :]
Do you mean top 250K most viewed *articles* in de.wikipedia.org http://de.wikipedia.org?
If so, I think you can get that from the dumps indeed. You can find 2016 hourly pageview stats by article for all wikis here: https://dumps.wikimedia.org/other/pageviews/2016/
Note that the wiki codes (first column) you're interested in are: /de/, /de.m/ and /de.zero/. The third column holds the number of pageviews you're after. Also, this data set does not include bot traffic as recognized by the pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view . As files are hourly and contain data for all wikis, you'll need some aggregation and filtering.
Cheers!
On Mon, Mar 6, 2017 at 2:59 AM, Jörg Jung <joerg.jung@retevastum.de mailto:joerg.jung@retevastum.de> wrote:
Ladies, gents, for a project i plan i'd need the following data: Top 250K sites for 2016 in project de.wikipedia.org <http://de.wikipedia.org>, user-access. I only need the name of the site and the corrsponding number of user-accesses (all channels) for 2016 (sum over the year). As far as i can see i can't get that data via REST or by aggegating dumps. So i'd like to ask here, if someone likes to helpout. Thanx, cheers, JJ -- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.
de>
Web: www.retevastum.de <http://www.retevastum.de> www.datengraphie.de <http://www.datengraphie.de> www.digitaletat.de <http://www.digitaletat.de> www.olfaktum.de <http://www.olfaktum.de> _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>
-- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de Web: www.retevastum.de www.datengraphie.de www.digitaletat.de www.olfaktum.de
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Yeah, Dan, that will work, thanx.
Just out of curiosity: Why are there three projects for "de" and what is the difference between them ? /de/,/de.m/ and /de.zero/
Cheers, JJ
Am 06.03.2017 um 15:45 schrieb Dan Andreescu:
Jorg, take a look at https://dumps.wikimedia.org/other/pagecounts-ez/ which has compressed data without losing granularity. You can get monthly files here and download a lot less data.
On Mon, Mar 6, 2017 at 5:40 AM, Jörg Jung <joerg.jung@retevastum.de mailto:joerg.jung@retevastum.de> wrote:
Marcel, thanx for ur quick answer. My main issue with dumps (or i don't get something) is: I need to download them first to be able to aggregate and filter. Which for the year 2016 would be: 40MB(middle) * 24h * 30d * 12m = about 350TB As i am not sitting directly at DE-CIX but in my private office i will face a pretty hard time with that :-) So my idea is that somebody "closer" to the raw data would basically do the aggregation and filtering for me... Will somebody (please) ? Thanx, JJ Am 06.03.2017 um 11:14 schrieb Marcel Ruiz Forns: > Hi Jörg, :] > > Do you mean top 250K most viewed *articles* in de.wikipedia.org <http://de.wikipedia.org> > <http://de.wikipedia.org>? > > If so, I think you can get that from the dumps indeed. You can find 2016 > hourly pageview stats by article for all wikis > here: https://dumps.wikimedia.org/other/pageviews/2016/ <https://dumps.wikimedia.org/other/pageviews/2016/> > > Note that the wiki codes (first column) you're interested in are: /de/, > /de.m/ and /de.zero/. > The third column holds the number of pageviews you're after. > Also, this data set does not include bot traffic as recognized by the > pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view <https://meta.wikimedia.org/wiki/Research:Page_view>>. > As files are hourly and contain data for all wikis, you'll need some > aggregation and filtering. > > Cheers! > > On Mon, Mar 6, 2017 at 2:59 AM, Jörg Jung <joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> > <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>>> wrote: > > Ladies, gents, > > for a project i plan i'd need the following data: > > Top 250K sites for 2016 in project de.wikipedia.org <http://de.wikipedia.org> > <http://de.wikipedia.org>, user-access. > > I only need the name of the site and the corrsponding number of > user-accesses (all channels) for 2016 (sum over the year). > > As far as i can see i can't get that data via REST or by aggegating > dumps. > > So i'd like to ask here, if someone likes to helpout. > > Thanx, cheers, JJ > > -- > Jörg Jung, Dipl. Inf. (FH) > Hasendriesch 2 > D-53639 Königswinter > E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> > Web: www.retevastum.de <http://www.retevastum.de> <http://www.retevastum.de> > www.datengraphie.de <http://www.datengraphie.de> <http://www.datengraphie.de> > www.digitaletat.de <http://www.digitaletat.de> <http://www.digitaletat.de> > www.olfaktum.de <http://www.olfaktum.de> <http://www.olfaktum.de> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > > > > -- > *Marcel Ruiz Forns* > Analytics Developer > Wikimedia Foundation > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > -- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> Web: www.retevastum.de <http://www.retevastum.de> www.datengraphie.de <http://www.datengraphie.de> www.digitaletat.de <http://www.digitaletat.de> www.olfaktum.de <http://www.olfaktum.de> _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Jorg, the project abbreviations are explained in depth here: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageviews
On Mon, Mar 6, 2017 at 11:15 AM, Jörg Jung joerg.jung@retevastum.de wrote:
Yeah, Dan, that will work, thanx.
Just out of curiosity: Why are there three projects for "de" and what is the difference between them ? /de/,/de.m/ and /de.zero/
Cheers, JJ
Am 06.03.2017 um 15:45 schrieb Dan Andreescu:
Jorg, take a look at https://dumps.wikimedia.org/other/pagecounts-ez/ which has compressed data without losing granularity. You can get monthly files here and download a lot less data.
On Mon, Mar 6, 2017 at 5:40 AM, Jörg Jung <joerg.jung@retevastum.de mailto:joerg.jung@retevastum.de> wrote:
Marcel, thanx for ur quick answer. My main issue with dumps (or i don't get something) is: I need to download them first to be able to aggregate and filter. Which for the year 2016 would be: 40MB(middle) * 24h * 30d * 12m =
about
350TB As i am not sitting directly at DE-CIX but in my private office i
will
face a pretty hard time with that :-) So my idea is that somebody "closer" to the raw data would basically
do
the aggregation and filtering for me... Will somebody (please) ? Thanx, JJ Am 06.03.2017 um 11:14 schrieb Marcel Ruiz Forns: > Hi Jörg, :] > > Do you mean top 250K most viewed *articles* in de.wikipedia.org <http://de.wikipedia.org> > <http://de.wikipedia.org>? > > If so, I think you can get that from the dumps indeed. You can
find 2016
> hourly pageview stats by article for all wikis > here: https://dumps.wikimedia.org/other/pageviews/2016/ <https://dumps.wikimedia.org/other/pageviews/2016/> > > Note that the wiki codes (first column) you're interested in are: /de/, > /de.m/ and /de.zero/. > The third column holds the number of pageviews you're after. > Also, this data set does not include bot traffic as recognized by
the
> pageview definition <https://meta.wikimedia.org/wiki/Research:Page_view <https://meta.wikimedia.org/wiki/Research:Page_view>>. > As files are hourly and contain data for all wikis, you'll need
some
> aggregation and filtering. > > Cheers! > > On Mon, Mar 6, 2017 at 2:59 AM, Jörg Jung <
joerg.jung@retevastum.de mailto:joerg.jung@retevastum.de
> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>>>
wrote:
> > Ladies, gents, > > for a project i plan i'd need the following data: > > Top 250K sites for 2016 in project de.wikipedia.org <
> <http://de.wikipedia.org>, user-access. > > I only need the name of the site and the corrsponding number of > user-accesses (all channels) for 2016 (sum over the year). > > As far as i can see i can't get that data via REST or by
aggegating
> dumps. > > So i'd like to ask here, if someone likes to helpout. > > Thanx, cheers, JJ > > -- > Jörg Jung, Dipl. Inf. (FH) > Hasendriesch 2 > D-53639 Königswinter > E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> > Web: www.retevastum.de <http://www.retevastum.de> <http://www.retevastum.de> > www.datengraphie.de <http://www.datengraphie.de> <http://www.datengraphie.de> > www.digitaletat.de <http://www.digitaletat.de> <http://www.digitaletat.de> > www.olfaktum.de <http://www.olfaktum.de> <http://www.olfaktum.de> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > > > > -- > *Marcel Ruiz Forns* > Analytics Developer > Wikimedia Foundation > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.
wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > -- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.
de>
Web: www.retevastum.de <http://www.retevastum.de> www.datengraphie.de <http://www.datengraphie.de> www.digitaletat.de <http://www.digitaletat.de> www.olfaktum.de <http://www.olfaktum.de> _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de Web: www.retevastum.de www.datengraphie.de www.digitaletat.de www.olfaktum.de
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Ok, guys, thanx alot !
Am 06.03.2017 um 17:33 schrieb Dan Andreescu:
Jorg, the project abbreviations are explained in depth here: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageviews
On Mon, Mar 6, 2017 at 11:15 AM, Jörg Jung <joerg.jung@retevastum.de mailto:joerg.jung@retevastum.de> wrote:
Yeah, Dan, that will work, thanx. Just out of curiosity: Why are there three projects for "de" and what is the difference between them ? /de/,/de.m/ and /de.zero/ Cheers, JJ Am 06.03.2017 um 15:45 schrieb Dan Andreescu: > Jorg, take a look at https://dumps.wikimedia.org/other/pagecounts-ez/ <https://dumps.wikimedia.org/other/pagecounts-ez/> > which has compressed data without losing granularity. You can get > monthly files here and download a lot less data. > > On Mon, Mar 6, 2017 at 5:40 AM, Jörg Jung <joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> > <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>>> wrote: > > Marcel, > > thanx for ur quick answer. > My main issue with dumps (or i don't get something) is: > > I need to download them first to be able to aggregate and filter. > Which for the year 2016 would be: 40MB(middle) * 24h * 30d * 12m = about > 350TB > > As i am not sitting directly at DE-CIX but in my private office i will > face a pretty hard time with that :-) > > So my idea is that somebody "closer" to the raw data would basically do > the aggregation and filtering for me... > > Will somebody (please) ? > > Thanx, JJ > > Am 06.03.2017 um 11:14 schrieb Marcel Ruiz Forns: > > Hi Jörg, :] > > > > Do you mean top 250K most viewed *articles* in de.wikipedia.org <http://de.wikipedia.org> > <http://de.wikipedia.org> > > <http://de.wikipedia.org>? > > > > If so, I think you can get that from the dumps indeed. You can find 2016 > > hourly pageview stats by article for all wikis > > here: https://dumps.wikimedia.org/other/pageviews/2016/ <https://dumps.wikimedia.org/other/pageviews/2016/> > <https://dumps.wikimedia.org/other/pageviews/2016/ <https://dumps.wikimedia.org/other/pageviews/2016/>> > > > > Note that the wiki codes (first column) you're interested in are: > /de/, > > /de.m/ and /de.zero/. > > The third column holds the number of pageviews you're after. > > Also, this data set does not include bot traffic as recognized by the > > pageview definition > <https://meta.wikimedia.org/wiki/Research:Page_view <https://meta.wikimedia.org/wiki/Research:Page_view> > <https://meta.wikimedia.org/wiki/Research:Page_view <https://meta.wikimedia.org/wiki/Research:Page_view>>>. > > As files are hourly and contain data for all wikis, you'll need some > > aggregation and filtering. > > > > Cheers! > > > > On Mon, Mar 6, 2017 at 2:59 AM, Jörg Jung <joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> > > <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>>>> wrote: > > > > Ladies, gents, > > > > for a project i plan i'd need the following data: > > > > Top 250K sites for 2016 in project de.wikipedia.org <http://de.wikipedia.org> <http://de.wikipedia.org> > > <http://de.wikipedia.org>, user-access. > > > > I only need the name of the site and the corrsponding number of > > user-accesses (all channels) for 2016 (sum over the year). > > > > As far as i can see i can't get that data via REST or by aggegating > > dumps. > > > > So i'd like to ask here, if someone likes to helpout. > > > > Thanx, cheers, JJ > > > > -- > > Jörg Jung, Dipl. Inf. (FH) > > Hasendriesch 2 > > D-53639 Königswinter > > E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> > <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> > <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>>> > > Web: www.retevastum.de <http://www.retevastum.de> <http://www.retevastum.de> > <http://www.retevastum.de> > > www.datengraphie.de <http://www.datengraphie.de> <http://www.datengraphie.de> > <http://www.datengraphie.de> > > www.digitaletat.de <http://www.digitaletat.de> <http://www.digitaletat.de> > <http://www.digitaletat.de> > > www.olfaktum.de <http://www.olfaktum.de> <http://www.olfaktum.de> > <http://www.olfaktum.de> > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>>> > > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>>> > > > > > > > > > > -- > > *Marcel Ruiz Forns* > > Analytics Developer > > Wikimedia Foundation > > > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > > > -- > Jörg Jung, Dipl. Inf. (FH) > Hasendriesch 2 > D-53639 Königswinter > E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> > Web: www.retevastum.de <http://www.retevastum.de> <http://www.retevastum.de> > www.datengraphie.de <http://www.datengraphie.de> <http://www.datengraphie.de> > www.digitaletat.de <http://www.digitaletat.de> <http://www.digitaletat.de> > www.olfaktum.de <http://www.olfaktum.de> <http://www.olfaktum.de> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > -- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> Web: www.retevastum.de <http://www.retevastum.de> www.datengraphie.de <http://www.datengraphie.de> www.digitaletat.de <http://www.digitaletat.de> www.olfaktum.de <http://www.olfaktum.de> _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Dan, guys, me again.
I crosschecked the numbers from for example
pagecounts-2017-02-views-ge-5-totals.bz2
with the tools at tools.wmflabs.org (here for page "Falco").
It seems, that the dump only has "Desktop" numbers, not "Mobile Web" and "Mobile App" when it comes to the platform.
Is that correct ?
Is there a way to get a sum over all three platforms ?
Thanx, Cheers, JJ
Am 06.03.2017 um 17:38 schrieb Jörg Jung:
Ok, guys, thanx alot !
Am 06.03.2017 um 17:33 schrieb Dan Andreescu:
Jorg, the project abbreviations are explained in depth here: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageviews
On Mon, Mar 6, 2017 at 11:15 AM, Jörg Jung <joerg.jung@retevastum.de mailto:joerg.jung@retevastum.de> wrote:
Yeah, Dan, that will work, thanx. Just out of curiosity: Why are there three projects for "de" and what is the difference between them ? /de/,/de.m/ and /de.zero/ Cheers, JJ Am 06.03.2017 um 15:45 schrieb Dan Andreescu: > Jorg, take a look at https://dumps.wikimedia.org/other/pagecounts-ez/ <https://dumps.wikimedia.org/other/pagecounts-ez/> > which has compressed data without losing granularity. You can get > monthly files here and download a lot less data. > > On Mon, Mar 6, 2017 at 5:40 AM, Jörg Jung <joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> > <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>>> wrote: > > Marcel, > > thanx for ur quick answer. > My main issue with dumps (or i don't get something) is: > > I need to download them first to be able to aggregate and filter. > Which for the year 2016 would be: 40MB(middle) * 24h * 30d * 12m = about > 350TB > > As i am not sitting directly at DE-CIX but in my private office i will > face a pretty hard time with that :-) > > So my idea is that somebody "closer" to the raw data would basically do > the aggregation and filtering for me... > > Will somebody (please) ? > > Thanx, JJ > > Am 06.03.2017 um 11:14 schrieb Marcel Ruiz Forns: > > Hi Jörg, :] > > > > Do you mean top 250K most viewed *articles* in de.wikipedia.org <http://de.wikipedia.org> > <http://de.wikipedia.org> > > <http://de.wikipedia.org>? > > > > If so, I think you can get that from the dumps indeed. You can find 2016 > > hourly pageview stats by article for all wikis > > here: https://dumps.wikimedia.org/other/pageviews/2016/ <https://dumps.wikimedia.org/other/pageviews/2016/> > <https://dumps.wikimedia.org/other/pageviews/2016/ <https://dumps.wikimedia.org/other/pageviews/2016/>> > > > > Note that the wiki codes (first column) you're interested in are: > /de/, > > /de.m/ and /de.zero/. > > The third column holds the number of pageviews you're after. > > Also, this data set does not include bot traffic as recognized by the > > pageview definition > <https://meta.wikimedia.org/wiki/Research:Page_view <https://meta.wikimedia.org/wiki/Research:Page_view> > <https://meta.wikimedia.org/wiki/Research:Page_view <https://meta.wikimedia.org/wiki/Research:Page_view>>>. > > As files are hourly and contain data for all wikis, you'll need some > > aggregation and filtering. > > > > Cheers! > > > > On Mon, Mar 6, 2017 at 2:59 AM, Jörg Jung <joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> > > <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>>>> wrote: > > > > Ladies, gents, > > > > for a project i plan i'd need the following data: > > > > Top 250K sites for 2016 in project de.wikipedia.org <http://de.wikipedia.org> <http://de.wikipedia.org> > > <http://de.wikipedia.org>, user-access. > > > > I only need the name of the site and the corrsponding number of > > user-accesses (all channels) for 2016 (sum over the year). > > > > As far as i can see i can't get that data via REST or by aggegating > > dumps. > > > > So i'd like to ask here, if someone likes to helpout. > > > > Thanx, cheers, JJ > > > > -- > > Jörg Jung, Dipl. Inf. (FH) > > Hasendriesch 2 > > D-53639 Königswinter > > E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> > <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> > <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>>> > > Web: www.retevastum.de <http://www.retevastum.de> <http://www.retevastum.de> > <http://www.retevastum.de> > > www.datengraphie.de <http://www.datengraphie.de> <http://www.datengraphie.de> > <http://www.datengraphie.de> > > www.digitaletat.de <http://www.digitaletat.de> <http://www.digitaletat.de> > <http://www.digitaletat.de> > > www.olfaktum.de <http://www.olfaktum.de> <http://www.olfaktum.de> > <http://www.olfaktum.de> > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>>> > > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>>> > > > > > > > > > > -- > > *Marcel Ruiz Forns* > > Analytics Developer > > Wikimedia Foundation > > > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > > > -- > Jörg Jung, Dipl. Inf. (FH) > Hasendriesch 2 > D-53639 Königswinter > E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> > Web: www.retevastum.de <http://www.retevastum.de> <http://www.retevastum.de> > www.datengraphie.de <http://www.datengraphie.de> <http://www.datengraphie.de> > www.digitaletat.de <http://www.digitaletat.de> <http://www.digitaletat.de> > www.olfaktum.de <http://www.olfaktum.de> <http://www.olfaktum.de> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > -- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> Web: www.retevastum.de <http://www.retevastum.de> www.datengraphie.de <http://www.datengraphie.de> www.digitaletat.de <http://www.digitaletat.de> www.olfaktum.de <http://www.olfaktum.de> _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
Dumps files add .m to the project short name as specified in the documentation. So en is english wikipedia and en.m is mobile web english wikipedia. You're right that the numbers for app access aren't there, but those relatively small.
On Mon, Mar 6, 2017 at 9:50 PM, Jörg Jung joerg.jung@retevastum.de wrote:
Dan, guys, me again.
I crosschecked the numbers from for example
pagecounts-2017-02-views-ge-5-totals.bz2
with the tools at tools.wmflabs.org (here for page "Falco").
It seems, that the dump only has "Desktop" numbers, not "Mobile Web" and "Mobile App" when it comes to the platform.
Is that correct ?
Is there a way to get a sum over all three platforms ?
Thanx, Cheers, JJ
Am 06.03.2017 um 17:38 schrieb Jörg Jung:
Ok, guys, thanx alot !
Am 06.03.2017 um 17:33 schrieb Dan Andreescu:
Jorg, the project abbreviations are explained in depth here: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageviews
On Mon, Mar 6, 2017 at 11:15 AM, Jörg Jung <joerg.jung@retevastum.de mailto:joerg.jung@retevastum.de> wrote:
Yeah, Dan, that will work, thanx. Just out of curiosity: Why are there three projects for "de" and
what is
the difference between them ? /de/,/de.m/ and /de.zero/ Cheers, JJ Am 06.03.2017 um 15:45 schrieb Dan Andreescu: > Jorg, take a look at https://dumps.wikimedia.org/
other/pagecounts-ez/
<https://dumps.wikimedia.org/other/pagecounts-ez/> > which has compressed data without losing granularity. You can get > monthly files here and download a lot less data. > > On Mon, Mar 6, 2017 at 5:40 AM, Jörg Jung <
joerg.jung@retevastum.de mailto:joerg.jung@retevastum.de
> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>>> wrote: > > Marcel, > > thanx for ur quick answer. > My main issue with dumps (or i don't get something) is: > > I need to download them first to be able to aggregate and
filter.
> Which for the year 2016 would be: 40MB(middle) * 24h * 30d * 12m = about > 350TB > > As i am not sitting directly at DE-CIX but in my private office i will > face a pretty hard time with that :-) > > So my idea is that somebody "closer" to the raw data would basically do > the aggregation and filtering for me... > > Will somebody (please) ? > > Thanx, JJ > > Am 06.03.2017 um 11:14 schrieb Marcel Ruiz Forns: > > Hi Jörg, :] > > > > Do you mean top 250K most viewed *articles* in de.wikipedia.org <http://de.wikipedia.org> > <http://de.wikipedia.org> > > <http://de.wikipedia.org>? > > > > If so, I think you can get that from the dumps indeed. You can find 2016 > > hourly pageview stats by article for all wikis > > here: https://dumps.wikimedia.org/other/pageviews/2016/ <https://dumps.wikimedia.org/other/pageviews/2016/> > <https://dumps.wikimedia.org/other/pageviews/2016/ <https://dumps.wikimedia.org/other/pageviews/2016/>> > > > > Note that the wiki codes (first column) you're interested in are: > /de/, > > /de.m/ and /de.zero/. > > The third column holds the number of pageviews you're after. > > Also, this data set does not include bot traffic as recognized by the > > pageview definition > <https://meta.wikimedia.org/wiki/Research:Page_view <https://meta.wikimedia.org/wiki/Research:Page_view> > <https://meta.wikimedia.org/wiki/Research:Page_view <https://meta.wikimedia.org/wiki/Research:Page_view>>>. > > As files are hourly and contain data for all wikis, you'll need some > > aggregation and filtering. > > > > Cheers! > > > > On Mon, Mar 6, 2017 at 2:59 AM, Jörg Jung <joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> > > <mailto:joerg.jung@retevastum.de <mailto:
joerg.jung@retevastum.de>
<mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>>>> wrote: > > > > Ladies, gents, > > > > for a project i plan i'd need the following data: > > > > Top 250K sites for 2016 in project de.wikipedia.org <
http://de.wikipedia.org%3E http://de.wikipedia.org
> > <http://de.wikipedia.org>, user-access. > > > > I only need the name of the site and the corrsponding
number of
> > user-accesses (all channels) for 2016 (sum over the
year).
> > > > As far as i can see i can't get that data via REST or
by aggegating
> > dumps. > > > > So i'd like to ask here, if someone likes to helpout. > > > > Thanx, cheers, JJ > > > > -- > > Jörg Jung, Dipl. Inf. (FH) > > Hasendriesch 2 > > D-53639 Königswinter > > E-Mail: joerg.jung@retevastum.de <mailto:
joerg.jung@retevastum.de>
> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> > <mailto:joerg.jung@retevastum.de <mailto:
joerg.jung@retevastum.de>>>
> > Web: www.retevastum.de <http://www.retevastum.de>
> <http://www.retevastum.de> > > www.datengraphie.de <
<http://www.datengraphie.de> > <http://www.datengraphie.de> > > www.digitaletat.de <
<http://www.digitaletat.de> > <http://www.digitaletat.de> > > www.olfaktum.de <http://www.olfaktum.de> <
> <http://www.olfaktum.de> > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org <mailto:Analytics@lists.
wikimedia.org>
> <mailto:Analytics@lists.wikimedia.org <mailto:
Analytics@lists.wikimedia.org>>
> <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> > <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>>> > > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>>> > > > > > > > > > > -- > > *Marcel Ruiz Forns* > > Analytics Developer > > Wikimedia Foundation > > > > > > _______________________________________________ > > Analytics mailing list > > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > > > -- > Jörg Jung, Dipl. Inf. (FH) > Hasendriesch 2 > D-53639 Königswinter > E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de> <mailto:joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.de>> > Web: www.retevastum.de <http://www.retevastum.de> <http://www.retevastum.de> > www.datengraphie.de <http://www.datengraphie.de> <http://www.datengraphie.de> > www.digitaletat.de <http://www.digitaletat.de> <http://www.digitaletat.de> > www.olfaktum.de <http://www.olfaktum.de> <http://www.olfaktum.de> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org> <mailto:Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > <https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>> > > > > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org <mailto:Analytics@lists.
wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics> > -- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de <mailto:joerg.jung@retevastum.
de>
Web: www.retevastum.de <http://www.retevastum.de> www.datengraphie.de <http://www.datengraphie.de> www.digitaletat.de <http://www.digitaletat.de> www.olfaktum.de <http://www.olfaktum.de> _______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org <mailto:Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics <https://lists.wikimedia.org/mailman/listinfo/analytics>
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics
-- Jörg Jung, Dipl. Inf. (FH) Hasendriesch 2 D-53639 Königswinter E-Mail: joerg.jung@retevastum.de Web: www.retevastum.de www.datengraphie.de www.digitaletat.de www.olfaktum.de
Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics