Yes, we connection limit and bandwidth cap. Two or three simultaneous
Στις 19-06-2015, ημέρα Παρ, και ώρα 13:27 -0400, ο/η Ashok Rao έγραψε:
> Dario, thanks much.
> Ariel, Kevin – I think the 503 errors were a problem because I was
> running the grab process on parallel threads. It seems to work fine
> in serial – the only trouble with this being, of course, the time it
> takes to download all the available data.
> On Fri, Jun 19, 2015 at 12:17 PM, Kevin Leduc <kevin(a)wikimedia.org>
> > + Ariel
> > Hi Ariel, can you comment on the 503 errors happening sometimes
> > while trying to download data from the dumps?
> > On Fri, Jun 19, 2015 at 1:01 AM, Dario Taraborelli <
> > dtaraborelli(a)wikimedia.org> wrote:
> > > Forwarding a note from Ashok Rao (cc’ed), can anyone comment on
> > > the dumps server returning 503s?
> > >
> > > Ashok – we don’t have yet an in-house API to retrieve pageview
> > > data, but the Analytics team is working on one: see this thread.
> > > Depending on what you’re doing, http://stats.grok.se/ may also
> > > come in handy.
> > >
> > > Best,
> > > Dario
> > >
> > > > Begin forwarded message:
> > > >
> > > > From: Ashok Rao <raoashok(a)seas.upenn.edu>
> > > > Subject: Wikipedia Page views access
> > > > Date: June 18, 2015 at 5:53:12 PM GMT+2
> > > > To: dario(a)wikimedia.org
> > > >
> > > > Hi Dario,
> > > >
> > > > Good morning. I'm a student at the University of Pennsylvania
> > > > and I've been trying to perform a few analyses based on
> > > > Wikipedia page views data. I've written a script that grabs
> > > > data from the main dump site –
> > > > https://dumps.wikimedia.org/other/pagecounts-raw/ – but run
> > > > into many sporadic 503 errors (sometimes with the download
> > > > link, other times with the main page itself). I noticed some of
> > > > this data might be available directly on Wikimedia servers that
> > > > can be utilized for research purposes.
> > > >
> > > > I was hoping I could get access to this and appreciate your
> > > > help.
> > > >
> > > > Best,
> > > > Ashok
> > > >
> > > > --
> > > > Ashok M. Rao
> > > > The Rajendra and Neera Singh Program in Market and Social
> > > > Systems Engineering
> > > > School of Engineering and Applied Sciences
> > > > University of Pennsylvania | Class of '17
> > >
> > > _______________________________________________
> > > Analytics mailing list
> > > Analytics(a)lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/analytics
> > >
Forwarding a note from Ashok Rao (cc’ed), can anyone comment on the dumps server returning 503s?
Ashok – we don’t have yet an in-house API to retrieve pageview data, but the Analytics team is working on one: see this thread <https://phabricator.wikimedia.org/T44259#1341010>.
Depending on what you’re doing, http://stats.grok.se/ <http://stats.grok.se/> may also come in handy.
> Begin forwarded message:
> From: Ashok Rao <raoashok(a)seas.upenn.edu>
> Subject: Wikipedia Page views access
> Date: June 18, 2015 at 5:53:12 PM GMT+2
> To: dario(a)wikimedia.org
> Hi Dario,
> Good morning. I'm a student at the University of Pennsylvania and I've been trying to perform a few analyses based on Wikipedia page views data. I've written a script that grabs data from the main dump site – https://dumps.wikimedia.org/other/pagecounts-raw/ <https://dumps.wikimedia.org/other/pagecounts-raw/> – but run into many sporadic 503 errors (sometimes with the download link, other times with the main page itself). I noticed some of this data might be available directly on Wikimedia servers that can be utilized for research purposes.
> I was hoping I could get access to this and appreciate your help.
> Ashok M. Rao
> The Rajendra and Neera Singh Program in Market and Social Systems Engineering
> School of Engineering and Applied Sciences
> University of Pennsylvania | Class of '17
My username is rbaasland and I would like to contribute to the analytics
project. I was wondering if I could have access to the project, or how I go
about contributing to this project?
Thank you very much,
*This discussion is intended to be a branch of the thread: "[Analytics]
Pageview API Status update".*
We Analytics are trying to *choose a storage technology to keep the
pageview data* for analysis.
We don't want to get to a final system that covers all our needs yet (there
are still things to discuss), but have something *that implements the
current stats.grok.se <http://stats.grok.se> functionalities* as a first
step. This way we can have a better grasp of which will be our difficulties
and limitations regarding performance and privacy.
The objective of this thread is to *choose 3 storage technologies*. We will
later setup an fill each of them with 1 day of test data, evaluate them and
decide which one of them we will go for.
There are 2 blocks of data to be stored:
1. *Cube that represents the number of pageviews broken down by the
- day/hour (size: 24)
- project (size: 800)
- agent type (size: 2)
To test with an initial level of anonymity, all cube cells whose value is
less than k=100 have an undefined value. However, to be able to retrieve
aggregated values without loosing that undefined counts, all combinations
of slices and dices are precomputed before anonymization and belong to the
cube, too. Like this:
dim1, dim2, dim3, ..., dimN, val
a, null, null, ..., null, 15 // pv for dim1=a
a, x, null, ..., null, 34 // pv for dim1=a & dim2=x
a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1
a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... &
So the size of this dataset would be something between 100M and 200M
records per year, I think.
1. *Timeseries dataset that stores the number of pageviews per article
in time with*:
- maximum resolution: hourly
- diminishing resolution over time is accepted if there are
article (dialect.project/article), day/hour, value
en.wikipedia/Main_page, 2015-01-01 17, 123456
en.wiktionary/Bazinga, 2015-01-02 13, 23456
It's difficult to calculate the size of that. How many articles do we have?
But not all of them will have pageviews every hour...
*Note*: I guess we should consider that the storage system will presumably
have high volume batch inserts every hour or so, and queries that will be a
lot more frequent but also a lot lighter in data size.
And that is that.
*So please, feel free to suggest storage technologies, comment, etc!*
And if there is any assumption I made in which you do not agree, please
I will start the thread with 2 suggestions:
1) *PostgreSQL*: Seems to be able to handle the volume of the data and
knows how to implement diminishing resolution for timeseries.
2) *Project Voldemort*: As we are denormalizing the cube entirely for
anonymity, the db doesn't need to compute aggregations, so it may well be a
(adding Analytics, as a relevant group for this discussion.)
I think this is next to meaningless, because the differing bot policies and
practices on different wikis skew the data into incoherence.
The (already existing) metric of active-editors-per-million-speakers is, it
seems to me, a far more robust metric. Erik Z.'s stats.wikimedia.org is
offering that metric.
On Sun, Jun 7, 2015 at 3:23 PM, Milos Rancic <millosh(a)gmail.com> wrote:
> When you get data, at some point of time you start thinking about
> quite fringe comparisons. But that could actually give some useful
> conclusions, like this time it did .
> We did the next:
> * Used the number of primary speakers from Ethnologue. (Erik Zachte is
> using approximate number of primary + secondary speakers; that could
> be good for correction of this data.)
> * Categorized languages according to the logarithmic number of
> speakers: >=10k, >=100k, >=1M, >=10M, >=100M.
> * Took the number of articles of Wikipedia in particular language and
> created ration (number of articles / number of speakers).
> * This list is consisted just of languages with Ethnologue status 1
> (national), 2 (provincial) or 3 (wider communication). In fact, we
> have a lot of projects (more than 100) with worse language status; a
> number of them are actually threatened or even on the edge of
> Those are the preliminary results and I will definitely have to pass
> through all the numbers. I fixed manually some serious errors, like
> not having English Wikipedia itself inside of data :D
> Putting the languages into the logarithmic categories proved to be
> useful, as we are now able to compare the Wikipedias according to
> their gross capacity (numbers of speakers). I suppose somebody well
> introduced into statistics could even create the function which could
> be used to check how good one project stays, no matter of those strict
> It's obvious that as more speakers one language has, it's harder to
> the community to follow the ratio.
> So, the winners per category are:
> 1) >= 1k: Hawaiian, ratio 0.96900
> 2) >= 10k: Mirandese, ratio 0.18073
> 3) >= 100k: Basque, ratio 0.38061
> 4) >= 1M: Swedish, ratio 0.21381
> 5) >= 10M: Dutch, ratio 0.08305
> 6) >= 100M: English, ratio 0.01447
> However, keep in mind that we removed languages not inside categories
> 1, 2 or 3. That affected >=10k languages, as, for example, Upper
> Sorbian stays much better than Mirandese (0.67). (Will fix it while
> creating the full report. Obviously, in this case logarithmic
> categories of numbers of speakers are much more important than what's
> the state of the language.)
> It's obvious that we could draw the line between 1:1 for 1-10k
> speakers to 10:1 for >=100M speakers. But, again, I would like to get
> input of somebody more competent.
> One very important category is missing here and it's about the level
> of development of the speakers. That could be added: GDP/PPP per
> capita for spoken country or countries would be useful as measurement.
> And I suppose somebody with statistical knowledge would be able to
> give us the number which would have meaning "ability to create
> Wikipedia article".
> Completed in such way, we'd be able to measure the success of
> particular Wikimedia groups and organizations. OK. Articles per
> speaker are not the only way to do so, but we could use other
> parameters, as well: number of new/active/very active editors etc. And
> we could put it into time scale.
> I'll make some other results. And to remind: I'd like to have the
> formula to count "ability to create Wikipedia article" and then to
> produce "level of particular community success in creating Wikipedia
> articles". And, of course, to implement it for editors.
> Wikimedia-l mailing list, guidelines at:
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
Wikimedia Foundation <http://www.wikimediafoundation.org>
Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us make it a reality!
I have been asking this question informally for too long, so here goes the
Metrics about the external use of the Wikimedia APIs
We need them and, in fact, an outsider would be very surprised by the fact
that we don't have them today and we are not looking at them regularly,
just like we check page views and edits.
It is a vague goal in a bumpy road, but I'm happy contributing at east
questions about the metrics we need. The Engineering Community team wants
to have this metric as main measurement of success of our performance (the
more Wikimedia knowledge being spread and improved via our API, the better
we are doing working with developers).
Engineering Community Manager @ Wikimedia Foundation
Cross-posting to analytics. Props to Vibha for asking for the data.
---------- Forwarded message ----------
From: *Adam Baso* <abaso(a)wikimedia.org>
Date: Wednesday, June 10, 2015
Subject: Some data on apps and web
To: mobile-l <mobile-l(a)lists.wikimedia.org>
Hi all, thought I'd share some data from a few queries around apps uniques
and apps + web pageviews, etc. from recent history:
We're at the beginning of our FY2015-2016 Q1 planning, and are also
readying our thinking on Reading strategy for the longer haul, and I was
hoping this may be of some use.
Within the discovery team we are now looking into tracking dwell time and
bounce rate for the pages linked from the SERP's. To accurately track dwell
time we need to fire an event in the unload handler of article pages.
Poking around in the EventLogging code i see we are now using sendBeacon if
it is available, and thats great. The problem is all of the browsers that
do not have sendBeacon (many) will not send this event. They will inject an
img tag that will not be processed as the page is being unloaded.
Searching around I saw some discussion about this almost a year ago, in may
2014, before sendBeacon support was added (in nov 2014),
titled "[Analytics] Using EventLogging for funnel analysis". There it was
proposed to push the events into localStorage to be sent during a future
page view. I don't see any other viable options, so wondering if there is
any reason i shouldn't look into implementing this now (with the jStorage
wrapper of localStorage available in core)?