Analytics March 2015

analytics@lists.wikimedia.org

52 participants
49 discussions

Re: [Analytics] Limits/pagination for /page/title
by Gabriel Wicke 11 Mar '15

11 Mar '15

Thank you Dario for the heads-up. I just subscribed to the analytics list, so don't have the original mail to respond to. > I am trying to use this. Documentation for /page/title/ <https://rest.wikimedia.org/en.wikipedia.org/v1/?doc#!/Page_content/page_tit…> say "List all pages." but the request produce only <1000 titles. There should be other parameters like "limit=max" or something, shouldn't it? How I can get ALL titles in one or multiple requests? Where I can find documentation for this? This end point is experimental at this point. There are several things we might want to change about it, so (as the documentation says), don't rely on it just yet. We have also consciously chosen not to expose paging yet, but have internal support for that already. See https://phabricator.wikimedia.org/T85640 for the hardening work on the paging token. In the meantime, you can also get this information from the PHP API: http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gapfilter… Gabriel On Wed, Mar 11, 2015 at 9:48 AM, Dario Taraborelli < dtaraborelli(a)wikimedia.org> wrote: > Hey Gabriel, see here: > https://lists.wikimedia.org/pipermail/analytics/2015-March/003587.html > can you chime in? > > Thanks, > Dario

1 0

Fwd: [Engineering] Wikimedia REST content API is now available in beta
by Dario Taraborelli 11 Mar '15

11 Mar '15

Cross-posting from wikitech-l, this will definitely be of interest to those of you on this list who work with our APIs. Begin forwarded message: > From: Gabriel Wicke <gwicke(a)wikimedia.org> > Date: March 10, 2015 at 15:23:03 PDT > To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>, wikitech-ambassdors(a)lists.wikimedia.org, Development and Operations Engineers <engineering(a)lists.wikimedia.org>, mediawiki-api(a)lists.wikimedia.org > Subject: [Engineering] Wikimedia REST content API is now available in beta > > Hello all, > I am happy to announce the beta release of the Wikimedia REST Content API at > https://rest.wikimedia.org/ > Each domain has its own API documentation, which is auto-generated from Swagger API specs. For example, here is the link for the English Wikipedia: > https://rest.wikimedia.org/en.wikipedia.org/v1/?doc > At present, this API provides convenient and low-latency access to article HTML, page metadata and content conversions between HTML and wikitext. After extensive testing we are confident that these endpoints are ready for production use, but have marked them as 'unstable' until we have also validated this with production users. You can start writing applications that depend on it now, if you aren't afraid of possible minor changes before transitioning to 'stable' status. For the definition of the terms ‘stable’ and ‘unstable’ see https://www.mediawiki.org/wiki/API_versioning . > While general and not specific to VisualEditor, the selection of endpoints reflects this release's focus on speeding up VisualEditor. By storing private Parsoid round-trip information separately, we were able to reduce the HTML size by about 40%. This in turn reduces network transfer and processing times, which will make loading and saving with VisualEditor faster. We are also switching from a cache to actual storage, which will eliminate slow VisualEditor loads caused by cache misses. Other users of Parsoid HTML like Flow, HTML dumps, the OCG PDF renderer or Content translation will benefit similarly. > But, we are not done yet. In the medium term, we plan to further reduce the HTML size by separating out all read-write metadata. This should allow us to use Parsoid HTML with its semantic markup directly for both views and editing without increasing the HTML size over the current output. Combined with performance work in VisualEditor, this has the potential to make switching to visual editing instantaneous and free of any scrolling. > We are also investigating a sub-page-level edit API for micro-contributions and very fast VisualEditor saves. HTML saves don't necessarily have to wait for the page to re-render from wikitext, which means that we can potentially make them faster than wikitext saves. For this to work we'll need to minimize network transfer and processing time on both client and server. > More generally, this API is intended to be the beginning of a multi-purpose content API. Its implementation (RESTBase) is driven by a declarative Swagger API specification, which helps to make it straightforward to extend the API with new entry points. The same API spec is also used to auto-generate the aforementioned sandbox environment, complete with handy "try it" buttons. So, please give it a try and let us know what you think! > This API is currently unmetered; we recommend that users not perform more than 200 requests per second and may implement limitations if necessary. > I also want to use this opportunity to thank all contributors who made this possible: > - Marko Obrovac, Eric Evans, James Douglas and Hardik Juneja on the Services team worked hard to build RESTBase, and to make it as extensible and clean as it is now. > - Filippo Giunchedi, Alex Kosiaris, Andrew Otto, Faidon Liambotis, Rob Halsell and Mark Bergsma helped to procure and set up the Cassandra storage cluster backing this API. > - The Parsoid team with Subbu Sastry, Arlo Breault, C. Scott Ananian and Marc Ordinas i Llopis is solving the extremely difficult task of converting between wikitext and HTML, and built a new API that lets us retrieve and pass in metadata separately. > - On the MediaWiki core team, Brad Jorsch quickly created a minimal authorization API that will let us support private wikis, and Aaron Schulz, Alex Monk and Ori Livneh built and extended the VirtualRestService that lets VisualEditor and MediaWiki in general easily access external services. > > We welcome your feedback here: https://www.mediawiki.org/wiki/Talk:RESTBase - and in Phabricator. > > Sincerely -- > Gabriel Wicke > Principal Software Engineer, Wikimedia Foundation > _______________________________________________ > Engineering mailing list > Engineering(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/engineering

2 1

[Technical] parsing error in the pageview dumps
by Oliver Keyes 11 Mar '15

11 Mar '15

Hey, This may be a known, but just in case it isn't; the pageview dumps at http://dumps.wikimedia.org/other/pagecounts-all-sites/ are meant to follow the spec set out at http://dumps.wikimedia.org/other/pagecounts-all-sites/README.txt Instead, it appears that for (presumably, zero-rated) requests, we're ending up with lang_code.zero instead of lang_code.project_variant. Presumably it's a missed use case in the C/Perl...thing, we were using, that got ported to Hive? Check out pagecounts-20150301-000000 for an example. I've opened a phabricator ticket at https://phabricator.wikimedia.org/T92361 - this is just an advisory to analytics engineers (there is a bug) and to reusers (there is a bug. We're aware of the bug). Have fun, -- Oliver Keyes Research Analyst Wikimedia Foundation

1 1

[Technical] inaccuracy in our pageview dump documentation
by Oliver Keyes 10 Mar '15

10 Mar '15

I think. Well, I hope. The whitelist at http://dumps.wikimedia.org/other/pagecounts-all-sites/README.txt claims that meta.mediawiki.org is whitelisted. As is usability.mediawiki.org. As is...you get the picture ;) Unless I've had a stroke and am hallucinating the *.mediawiki.org, we mean wikimedia. At least it validates the people who grumble the two names are too similar? ;) Have fun, -- Oliver Keyes Research Analyst Wikimedia Foundation

3 5

[Cluster] Monitoring the impact Hive jobs have on the Analytics cluster
by Christian Aistleitner 10 Mar '15

10 Mar '15

Hi, around running jobs on the Analytics cluster, I've sometime seen people say in IRC: “Let's run this heavy job. I'll keep an eye on it”. But more often than not, this seems to have meant: “Let's just run this heavy job and wait. If QChris joins IRC, let's hope he doesn't ping us about having overloaded the cluster.” That's not nice^Wscalable ;-) So just in case someone is vague on how to “keep an eye on it”, I did a short write-up at: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Load which details on detecting how the cluster is doing on a very high level. Especially, it allows you to detect if the cluster got stalled, and if it did, it tells you what to do. Have fun, Christian P.S.: The above URL has diagrams! Click the URL! -- ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- Companies' registry: 360296y in Linz Christian Aistleitner Kefermarkterstrasze 6a/3 Email: christian(a)quelltextlich.at 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 Fax: +43 7946 / 20 5 81 Homepage: http://quelltextlich.at/ ---------------------------------------------------------------

7 12

[Release] explore the (parsed) (common) Wikimedia user agents
by Oliver Keyes 09 Mar '15

09 Mar '15

Hey all, A perennial request from WMF engineers/product people, as well as third-party developers, is an idea of what browsers people are using so we know what we have to support on the frontend side of things. With Legal/Analytics signoff and +2ing, I've built an exploratory tool at http://datavis.wmflabs.org/agents/ which allows people to look at the most prominently used user agents on our projects - editors, readers, mobile, desktop, whatever you want, we've got it! (unless you want a pony or something. I can't help with that, I'm afraid.) To answer the most obvious FAQ questions (read: the ones that have already come up ;p): "Will this be run regularly?" Not as of this moment. At least, not by me. This is an ad-hoc report in response to an ad-hoc request. "Who do I go to if I want that to change?" Analytics Engineering has this task on their backlog already. "Can I have it divided up by [country/operating system/what colour socks the users use/etc]?" An ad-hoc report in response to an ad-hoc request; adding additional dimensions/granularity would require additional legal review and further runs. -- Oliver Keyes Research Analyst Wikimedia Foundation

5 8

Re: [Analytics] Anomalies in pagecounts files?
by Roni Wiener 09 Mar '15

09 Mar '15

Thanks for the info, both your points can explain the anomalies I saw. The mirroring issue can explain the reason why I see many *.mp3 and .*_ep titles in the pagecounts files that do not correlate to any Wikipedia page, probably spammers monetizing music. How can I help resolving these issues? > On Mar 9, 2015, at 23:30, analytics-request(a)lists.wikimedia.org wrote: > > Re: Anomalies in pagecounts files?

2 1

Eventlogging backfilling for outage 02/04 02/10 done.
by Nuria Ruiz 09 Mar '15

09 Mar '15

Team: Eventlogging backfilling for outage 02/04 to 02/10 is done. Some events were filled from raw logs, some from processed logs. Because most of the "droppage" happened intermittently the backfilling just re-run the events from 02/04 to 02/10 one by one. Here are the descriptions of the two incident reports: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150205-EventLo… https://wikitech.wikimedia.org/wiki/Incident_documentation/20150206-EventLo… PS. If someone versed on 'the way of the wiki' could explain why this page https://wikitech.wikimedia.org/wiki/Incident_documentation/20150206-EventLo… does not appear in the incident report listing page despite having the right category it will be awesome: https://wikitech.wikimedia.org/wiki/Incident_documentation

1 0

[Technical] missing dialect subdomains in the new pageviews definition
by Oliver Keyes 09 Mar '15

09 Mar '15

Hey all, One of the big improvements of the new definition over the old one is that the old one is not limited to /wiki/. It includes all of the chinese and serbian dialects that have their own folder names and were not appearing, as a result, in the old pageview counts. James F (thanks James!) recently pointed out to me that there are other wikis that do this - see the list at https://meta.wikimedia.org/wiki/Wikipedias_in_multiple_writing_systems#With… . These need to be factored into the new pageviews definition to avoid culturally and nationally biased undercounting. Have fun, -- Oliver Keyes Research Analyst Wikimedia Foundation

2 3

Anomalies in pagecounts files?
by Roni Wiener 09 Mar '15

09 Mar '15

Hi I was goofing around with the Wikipedia page counts dumps and noticed some strange anomalies. For example: The page "Double-entry_bookkeeping_system" had 55921 page views on pagecounts-20150306-070000.gz Where it only had 54 views on pagecounts-20150306-100000.gz (3 hours later). Is there a bug in the page counting system? How likely is it to have a sharp peak of interest in Double-entry_bookkeeping_system? Best regards Roni Wiener, Keotic

2 2

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics March 2015