Analytics January 2016

analytics@lists.wikimedia.org

39 participants
34 discussions

by Nuria Ruiz

Trying again, adding analytics@ (public e-mail list) On Fri, Jan 15, 2016 at 5:22 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org> wrote: > I also think we should start with exposing the 3 api's endpoints in a GUI, > which - as Dan says - we know respond to community interests. And then ask > the community for more input, that could mean improvements to the tool, new > endpoints or completely new ideas... > > On Thu, Jan 14, 2016 at 10:45 PM, Dan Andreescu <dandreescu(a)wikimedia.org> > wrote: > >> I'm ok if people want to take an iterative approach, I just want to point >> out that the usage information is not very indicative of value at this >> point. The API is not widely used and the per-article endpoint is expected >> to be hit much much more than per-project or top simply because the queries >> are many orders of magnitude more granular. So we can't really judge >> importance from that comparison. >> >> On Thu, Jan 14, 2016 at 4:43 PM, Leila Zia <leila(a)wikimedia.org> wrote: >> >>> >>> On Thu, Jan 14, 2016 at 1:09 PM, Dan Andreescu <dandreescu(a)wikimedia.org >>> > wrote: >>> >>>> My question is: How are we going to define the requirements for the >>>>> tool? I was planning to get some community input on defining which stats >>>>> would help contributors the most. What do you think? >>>>> >>>> >>>> My opinion here is that we should just expose everything the pageview >>>> API is capable of. It's only 3 different end points and they were chosen >>>> based on what the community found useful. As we add more endpoints we can >>>> keep checking if visualization is important. But of course if others have >>>> other more specific plans, we can wait for those tools to be built and >>>> iterate. >>>> >>> >>> Building up on Dan's suggestion: I'd go with communicating and/or >>> discussing the following with the community: >>> >>> * the 3 different metrics we can offer a UI for >>> * what other metrics they find useful for their work. This will help us >>> collect information about what other kind of metrics we should offer as an >>> end-point if we decide to add to the end-points (pageview per article >>> by country has come up many times, for example) >>> * whether they consider the wish as satisfied if we offer a UI for the >>> 3 different metrics, and perhaps over time add more metrics to the UI >>> as they become available (not necessarily in 2016). >>> >>> Leila >>> >>> >>> >> > > > -- > *Marcel Ruiz Forns* > Analytics Developer > Wikimedia Foundation >

8 years, 4 months

How many times has a video been played?

by Pine W

Hi Analytics, How do I determine how many times this video <https://commons.wikimedia.org/wiki/File:Wikipedia_5_million_articles_milest…> has been played in the last 90 days? Thanks, Pine

8 years, 4 months

Re: [Analytics] [Ops] analytics-store AKA dbstore1002 (custom) replication lagging

by Andrew Otto

Adding Analytics list and Neil P Quinn, FYI, Nuria filed this ticket to track this issue: https://phabricator.wikimedia.org/T123634 Moving discussion there. On Thu, Jan 14, 2016 at 1:10 AM, Faidon Liambotis <faidon(a)wikimedia.org> wrote: > Well, this was a snapshot of the situation then. It doesn't preclude > other issues (possibly caused by other, similarly-sized queries) in the > previous hours/days. That said, tendril show definitely a correlation > between all kinds of metrics (page/disk I/O, write traffic etc.) and the > aforementioned timeline of the past 1.5 days. > > Plus, replag for s1 was at the time ~133,900 seconds and rising, which > matches the timeline of that large query too. Since I killed it it has > been steadily dropping, albeit slowly (currently at 132,445). It will > probably take a couple of days to recover. Since the server is both > backlogged by and I/O-saturated, it will depend a lot on how much load > the server will get by other queries (currently it's getting hammered by > two other large queries that have been running for over 27,000 and 4,000 > seconds respectively, for example). > > Faidon > > On Wed, Jan 13, 2016 at 09:44:50PM -0800, Oliver Keyes wrote: > > Indeed, but 1.5 days is <half the time the problem has been occurring > for. > > > > On 13 January 2016 at 21:01, Faidon Liambotis <faidon(a)wikimedia.org> > wrote: > > > "SELECT * FROM information_schema.processlist ORDER BY time DESC" > > > informs us of this: > > > > > > | 5599890 | research | 10.64.36.103:53669 | enwiki > | Query | 133527 | Queried about 890000 rows > | CREATE TEMPORARY TA > > > SELECT > > > page_id, > > > views > > > FROM ( > > > SELECT > > > page_namespace, > > > page_title, > > > SUM(views) AS views > > > FROM staging.page_name_views_dupes > > > WHERE page_namespace = 0 > > > GROUP BY 1,2 > > > ) AS group_page_name_views > > > INNER JOIN enwiki.page USING (page_namespace, page_title) > > > > > > Column 6 is "time", i.e. this query was running for 133527 seconds at > > > the time (i.e. ~1.5 days!), which is obviously Not Good™. I just ran > > > "KILL QUERY 5599890;", hopefully this will help. > > > > > > The second-next long-standing query has been running for over 6 hours > > > now and it way too long to paste (87 lines, starts with "INSERT INTO > > > editor_month_global", inline comments, all kinds of subselects in inner > > > joins etc., queried "about 2470000 rows"). I left it be for now, we'll > > > see how that goes and I may eventually kill it too, as I/O is still > > > pegged at 100%. > > > > > > I believe long-running queries targetted at the research slaves isn't > > > particularly new but is often the source of such problems, so it's a > > > good place to start when investigating such issues. There is only so > > > much a poor database server (and software) can do :) > > > > > > Regards, > > > Faidon > > > > > > On Wed, Jan 13, 2016 at 06:55:26PM -0500, Andrew Otto wrote: > > >> Hi all, > > >> > > >> Replication to dbstore1002 is having a lot of trouble. From > > >> https://tendril.wikimedia.org/host/view/dbstore1002.eqiad.wmnet/3306, > we > > >> see that normal replication is about 9 hours behind at the moment. > > >> However, the EventLogging `log` database is not replicated with usual > MySQL > > >> replication. Instead, a custom bash script[1] periodically uses > mysqldump > > >> to copy data from m4-master (dbproxy1004) into dbstore1002. (I just > > >> recently found out that this wasn’t regular replication, and I’m not > > >> familiar with the reasoning behind using a custom script.) > > >> > > >> The EventLogging `log` custom replication has been lagging for days > now. > > >> Also, at about 18:00 UTC today (Jan 13), we can see a huge increase in > > >> write traffic on dbstore1002. I looked at each of the normal > replication > > >> masters, and don’t see this write traffic there. EventLogging > traffic also > > >> seems to be about the same over the last week or so (although there > was an > > >> in increase in events being produced by the MobileWebSectionUsage > schema > > >> starting Dec 18, but I don’t think this is the problem). > > >> > > >> Disk util is around 100%, but this has been the case for a while now. > > >> Today I filed https://phabricator.wikimedia.org/T123546 for a bad > mem chip > > >> or slot on dbproxy1004, but this also seems to have been the status > quo for > > >> quite a while, and doesn’t correlate with this lag. > > >> > > >> I’m not sure where else to look at the moment, and I need to run for > the > > >> day. I’ll try to look at this more tomorrow in my morning. > > >> > > >> -AO > > >> > > >> [1] > > >> > https://github.com/wikimedia/operations-puppet/blob/f0df1ec45b3f70a5c041cef… > > > > > >> _______________________________________________ > > >> Ops mailing list > > >> Ops(a)lists.wikimedia.org > > >> https://lists.wikimedia.org/mailman/listinfo/ops > > > > > > > > > _______________________________________________ > > > Ops mailing list > > > Ops(a)lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/ops > > > > > > > > -- > > Oliver Keyes > > Count Logula > > Wikimedia Foundation >

8 years, 4 months

Public YARN ResourceManager HTTP UI disabled

by Andrew Otto

Hi all, Due to a recent security report, I’ve decided to disable public access to the Yarn HTTP UI. There was no security breach, but I was made aware of the ability to do more with the HTTP interface than I had previously known about, and I wasn’t comfortable with it being public anymore. The YARN ResourceManaager HTTP interface is still accessible from within the analytics cluster. I’ve just disabled the public proxying at yarn.wikimedia.org. If you want to access the ResourceManager job browser, you’ll have to ssh tunnel into the cluster first. Instructions are here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Access#ssh_tunnel.28s… -Andrew Otto

8 years, 4 months

MobileWebSectionUsage table dropped from Eventlogging mysql db. Data is available on hadoop.

by Nuria Ruiz

Team: We had to take some desperate measures to unblock replication on Eventlogging in the absence of our DBA. We had to drop the MobileWebSectionUsage table. Table is blacklisted as stream of events was to great for the system to sustain it, we will whitelist it again once the new sampling rate takes effect. Data is not available in mysql but it is present on hadoop so , to be clear, no data has been lost. We will report on replication issues as we have more knowledge. Sorry again about the late notice. Thanks, Nuria

8 years, 4 months

Objective Revision Evaluation Service client for R

by Oliver Keyes

Hey all, I'm happy to announce the release of a robust, tested client for Yuvi and Aaron's "ORES" system for the R statistical programming language. It can be obtained from https://github.com/Ironholds/ores or CRAN and contains a long-form vignette to explain the ORES system and use of the client, along with the standard documentation for individual pieces of code. Thanks, -- Oliver Keyes Count Logula Wikimedia Foundation

8 years, 4 months

Backlinks TO Wikipedia

by Edison Nica

Hi, I am interested to know if wikipedia makes public how many backlinks each page gets. I am working on a search for wikipedia, and I as you would expect, it sucks. So I went and tested same searches directly on wikipedia, and no offence, they suck even more. So I went on Google, and performed same searches, with the added site:wikipedia.org, and Google was a little bit better (although not much compared with my 1-day-development-seach-engine). I want to make my wikipedia search better, and having a table that would tell me how many non-wikipedia pages point to a certain wikipedia page, might improve my algorithm. Anyone knows if wikipedia publishes such data? Thank you! Edison Nica Http://www.0pii.com Edisonn(a)0pii.com Sent from my T-Mobile 4G LTE Device

8 years, 4 months

Edits per month for Mathematics articles

by Paul Keeler

Hi all, My colleagues and I are interested in getting the statistics/data for edits per month for all articles that form part of the WikiProject Mathematics, as well as some other WikiProjects for comparison such as WikiProject Computer Science and WikiProject Statistics. I was was wondering if that's possible? We want to write and publish a non-technical article in a mathematical gazette encouraging mathematical inclined people to contribute more to Wikipedia. Thank you for your time. Best, Paul

8 years, 4 months

Recent EventLogging schemas with UUIDs?

by Oliver Keyes

Hey y'all I'm working on a piece of research (largely recreational) on the old problem of fingerprinting users with minimal information - namely the combination of a user agent and an IP address. Basically I'm looking to put together a piece of work showing: 1. How sub-standard it is; 2. How fast it decays; 3. How the sub-standardness varies by (platform|location) This would be pretty doable with internal data; basically I'd need a schema with IP, user agent and a per-user UUID that's got a decent (>=24 hours) expiry time. My question: does anyone know of a table with recent data that meets these requirements? And, if not, anyone with EventLogging experience interested in working on the problem with me? -- Oliver Keyes Count Logula Wikimedia Foundation

8 years, 4 months

MobileWikiAppShareAFact event stream was: [WikimediaMobile] Stopping eventlogging events into MobileWikiAppShareAFact table

by Nuria Ruiz

Team: This schema MobileWikiAppShareAFact is sending a lot of events, maybe is worth thinking whether we need that many. It is again a case where tables are becoming huge and hard to query fast. cc-ing Jon as schema owner. Can this data be sampled at a higher sampling rate? I have filed a ticket to this fact: https://phabricator.wikimedia.org/T122224 Thanks, Nuria On Tue, Dec 22, 2015 at 8:35 AM, Adam Baso <abaso(a)wikimedia.org> wrote: > Replacing mobile-tech with mobile-l (internal mobile-tech list > discontinued). > > > On Tuesday, December 22, 2015, Nuria Ruiz <nuria(a)wikimedia.org> wrote: > >> Team: >> >> As part of our effort of converting eventlogging mysql database to the >> tokudb engine we need to stop eventlogging events from flowing into the MobileWikiAppShareAFact >> table, we are using this one table to see how long the conversion will take >> in order to plan for a larger outage window. >> >> >> Let us know if data should be backfilled as it can be, we anticipate >> events will not flow into table for the better part of one day. >> >> >> Thanks, >> >> Nuria >> >> >> > _______________________________________________ > Mobile-l mailing list > Mobile-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/mobile-l > >

8 years, 4 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Analytics January 2016