Trying again, adding analytics@ (public e-mail list)
On Fri, Jan 15, 2016 at 5:22 AM, Marcel Ruiz Forns <mforns(a)wikimedia.org>
wrote:
> I also think we should start with exposing the 3 api's endpoints in a GUI,
> which - as Dan says - we know respond to community interests. And then ask
> the community for more input, that could mean improvements to the tool, new
> endpoints or completely new ideas...
>
> On Thu, Jan 14, 2016 at 10:45 PM, Dan Andreescu <dandreescu(a)wikimedia.org>
> wrote:
>
>> I'm ok if people want to take an iterative approach, I just want to point
>> out that the usage information is not very indicative of value at this
>> point. The API is not widely used and the per-article endpoint is expected
>> to be hit much much more than per-project or top simply because the queries
>> are many orders of magnitude more granular. So we can't really judge
>> importance from that comparison.
>>
>> On Thu, Jan 14, 2016 at 4:43 PM, Leila Zia <leila(a)wikimedia.org> wrote:
>>
>>>
>>> On Thu, Jan 14, 2016 at 1:09 PM, Dan Andreescu <dandreescu(a)wikimedia.org
>>> > wrote:
>>>
>>>> My question is: How are we going to define the requirements for the
>>>>> tool? I was planning to get some community input on defining which stats
>>>>> would help contributors the most. What do you think?
>>>>>
>>>>
>>>> My opinion here is that we should just expose everything the pageview
>>>> API is capable of. It's only 3 different end points and they were chosen
>>>> based on what the community found useful. As we add more endpoints we can
>>>> keep checking if visualization is important. But of course if others have
>>>> other more specific plans, we can wait for those tools to be built and
>>>> iterate.
>>>>
>>>
>>> Building up on Dan's suggestion: I'd go with communicating and/or
>>> discussing the following with the community:
>>>
>>> * the 3 different metrics we can offer a UI for
>>> * what other metrics they find useful for their work. This will help us
>>> collect information about what other kind of metrics we should offer as an
>>> end-point if we decide to add to the end-points (pageview per article
>>> by country has come up many times, for example)
>>> * whether they consider the wish as satisfied if we offer a UI for the
>>> 3 different metrics, and perhaps over time add more metrics to the UI
>>> as they become available (not necessarily in 2016).
>>>
>>> Leila
>>>
>>>
>>>
>>
>
>
> --
> *Marcel Ruiz Forns*
> Analytics Developer
> Wikimedia Foundation
>
Adding Analytics list and Neil P Quinn,
FYI, Nuria filed this ticket to track this issue:
https://phabricator.wikimedia.org/T123634 Moving discussion there.
On Thu, Jan 14, 2016 at 1:10 AM, Faidon Liambotis <faidon(a)wikimedia.org>
wrote:
> Well, this was a snapshot of the situation then. It doesn't preclude
> other issues (possibly caused by other, similarly-sized queries) in the
> previous hours/days. That said, tendril show definitely a correlation
> between all kinds of metrics (page/disk I/O, write traffic etc.) and the
> aforementioned timeline of the past 1.5 days.
>
> Plus, replag for s1 was at the time ~133,900 seconds and rising, which
> matches the timeline of that large query too. Since I killed it it has
> been steadily dropping, albeit slowly (currently at 132,445). It will
> probably take a couple of days to recover. Since the server is both
> backlogged by and I/O-saturated, it will depend a lot on how much load
> the server will get by other queries (currently it's getting hammered by
> two other large queries that have been running for over 27,000 and 4,000
> seconds respectively, for example).
>
> Faidon
>
> On Wed, Jan 13, 2016 at 09:44:50PM -0800, Oliver Keyes wrote:
> > Indeed, but 1.5 days is <half the time the problem has been occurring
> for.
> >
> > On 13 January 2016 at 21:01, Faidon Liambotis <faidon(a)wikimedia.org>
> wrote:
> > > "SELECT * FROM information_schema.processlist ORDER BY time DESC"
> > > informs us of this:
> > >
> > > | 5599890 | research | 10.64.36.103:53669 | enwiki
> | Query | 133527 | Queried about 890000 rows
> | CREATE TEMPORARY TA
> > > SELECT
> > > page_id,
> > > views
> > > FROM (
> > > SELECT
> > > page_namespace,
> > > page_title,
> > > SUM(views) AS views
> > > FROM staging.page_name_views_dupes
> > > WHERE page_namespace = 0
> > > GROUP BY 1,2
> > > ) AS group_page_name_views
> > > INNER JOIN enwiki.page USING (page_namespace, page_title)
> > >
> > > Column 6 is "time", i.e. this query was running for 133527 seconds at
> > > the time (i.e. ~1.5 days!), which is obviously Not Good™. I just ran
> > > "KILL QUERY 5599890;", hopefully this will help.
> > >
> > > The second-next long-standing query has been running for over 6 hours
> > > now and it way too long to paste (87 lines, starts with "INSERT INTO
> > > editor_month_global", inline comments, all kinds of subselects in inner
> > > joins etc., queried "about 2470000 rows"). I left it be for now, we'll
> > > see how that goes and I may eventually kill it too, as I/O is still
> > > pegged at 100%.
> > >
> > > I believe long-running queries targetted at the research slaves isn't
> > > particularly new but is often the source of such problems, so it's a
> > > good place to start when investigating such issues. There is only so
> > > much a poor database server (and software) can do :)
> > >
> > > Regards,
> > > Faidon
> > >
> > > On Wed, Jan 13, 2016 at 06:55:26PM -0500, Andrew Otto wrote:
> > >> Hi all,
> > >>
> > >> Replication to dbstore1002 is having a lot of trouble. From
> > >> https://tendril.wikimedia.org/host/view/dbstore1002.eqiad.wmnet/3306,
> we
> > >> see that normal replication is about 9 hours behind at the moment.
> > >> However, the EventLogging `log` database is not replicated with usual
> MySQL
> > >> replication. Instead, a custom bash script[1] periodically uses
> mysqldump
> > >> to copy data from m4-master (dbproxy1004) into dbstore1002. (I just
> > >> recently found out that this wasn’t regular replication, and I’m not
> > >> familiar with the reasoning behind using a custom script.)
> > >>
> > >> The EventLogging `log` custom replication has been lagging for days
> now.
> > >> Also, at about 18:00 UTC today (Jan 13), we can see a huge increase in
> > >> write traffic on dbstore1002. I looked at each of the normal
> replication
> > >> masters, and don’t see this write traffic there. EventLogging
> traffic also
> > >> seems to be about the same over the last week or so (although there
> was an
> > >> in increase in events being produced by the MobileWebSectionUsage
> schema
> > >> starting Dec 18, but I don’t think this is the problem).
> > >>
> > >> Disk util is around 100%, but this has been the case for a while now.
> > >> Today I filed https://phabricator.wikimedia.org/T123546 for a bad
> mem chip
> > >> or slot on dbproxy1004, but this also seems to have been the status
> quo for
> > >> quite a while, and doesn’t correlate with this lag.
> > >>
> > >> I’m not sure where else to look at the moment, and I need to run for
> the
> > >> day. I’ll try to look at this more tomorrow in my morning.
> > >>
> > >> -AO
> > >>
> > >> [1]
> > >>
> https://github.com/wikimedia/operations-puppet/blob/f0df1ec45b3f70a5c041cef…
> > >
> > >> _______________________________________________
> > >> Ops mailing list
> > >> Ops(a)lists.wikimedia.org
> > >> https://lists.wikimedia.org/mailman/listinfo/ops
> > >
> > >
> > > _______________________________________________
> > > Ops mailing list
> > > Ops(a)lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/ops
> >
> >
> >
> > --
> > Oliver Keyes
> > Count Logula
> > Wikimedia Foundation
>
Hi all,
Due to a recent security report, I’ve decided to disable public access to
the Yarn HTTP UI. There was no security breach, but I was made aware of
the ability to do more with the HTTP interface than I had previously known
about, and I wasn’t comfortable with it being public anymore. The YARN
ResourceManaager HTTP interface is still accessible from within the
analytics cluster. I’ve just disabled the public proxying at
yarn.wikimedia.org.
If you want to access the ResourceManager job browser, you’ll have to ssh
tunnel into the cluster first. Instructions are here:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Access#ssh_tunnel.28s…
-Andrew Otto
Team:
We had to take some desperate measures to unblock replication on
Eventlogging in the absence of our DBA.
We had to drop the MobileWebSectionUsage table. Table is blacklisted as
stream of events was to great for the system to sustain it, we will
whitelist it again once the new sampling rate takes effect.
Data is not available in mysql but it is present on hadoop so , to be
clear, no data has been lost.
We will report on replication issues as we have more knowledge. Sorry again
about the late notice.
Thanks,
Nuria
Hey all,
I'm happy to announce the release of a robust, tested client for Yuvi
and Aaron's "ORES" system for the R statistical programming language.
It can be obtained from https://github.com/Ironholds/ores or CRAN and
contains a long-form vignette to explain the ORES system and use of
the client, along with the standard documentation for individual
pieces of code.
Thanks,
--
Oliver Keyes
Count Logula
Wikimedia Foundation
Hi,
I am interested to know if wikipedia makes public how many backlinks each page gets.
I am working on a search for wikipedia, and I as you would expect, it sucks.
So I went and tested same searches directly on wikipedia, and no offence, they suck even more.
So I went on Google, and performed same searches, with the added site:wikipedia.org, and Google was a little bit better (although not much compared with my 1-day-development-seach-engine).
I want to make my wikipedia search better, and having a table that would tell me how many non-wikipedia pages point to a certain wikipedia page, might improve my algorithm.
Anyone knows if wikipedia publishes such data?
Thank you!
Edison Nica
Http://www.0pii.com
Edisonn(a)0pii.com
Sent from my T-Mobile 4G LTE Device
Hi all,
My colleagues and I are interested in getting the statistics/data for
edits per month for all articles that form part of the WikiProject
Mathematics, as well as some other WikiProjects for comparison such as
WikiProject Computer Science and WikiProject Statistics. I was was
wondering if that's possible?
We want to write and publish a non-technical article in a mathematical
gazette encouraging mathematical inclined people to contribute more to
Wikipedia.
Thank you for your time.
Best,
Paul
Hey y'all
I'm working on a piece of research (largely recreational) on the old
problem of fingerprinting users with minimal information - namely the
combination of a user agent and an IP address. Basically I'm looking
to put together a piece of work showing:
1. How sub-standard it is;
2. How fast it decays;
3. How the sub-standardness varies by (platform|location)
This would be pretty doable with internal data; basically I'd need a
schema with IP, user agent and a per-user UUID that's got a decent
(>=24 hours) expiry time. My question: does anyone know of a table
with recent data that meets these requirements? And, if not, anyone
with EventLogging experience interested in working on the problem with
me?
--
Oliver Keyes
Count Logula
Wikimedia Foundation
Team:
This schema MobileWikiAppShareAFact is sending a lot of events, maybe is
worth thinking whether we need that many. It is again a case where tables
are becoming huge and hard to query fast.
cc-ing Jon as schema owner.
Can this data be sampled at a higher sampling rate? I have filed a ticket
to this fact:
https://phabricator.wikimedia.org/T122224
Thanks,
Nuria
On Tue, Dec 22, 2015 at 8:35 AM, Adam Baso <abaso(a)wikimedia.org> wrote:
> Replacing mobile-tech with mobile-l (internal mobile-tech list
> discontinued).
>
>
> On Tuesday, December 22, 2015, Nuria Ruiz <nuria(a)wikimedia.org> wrote:
>
>> Team:
>>
>> As part of our effort of converting eventlogging mysql database to the
>> tokudb engine we need to stop eventlogging events from flowing into the MobileWikiAppShareAFact
>> table, we are using this one table to see how long the conversion will take
>> in order to plan for a larger outage window.
>>
>>
>> Let us know if data should be backfilled as it can be, we anticipate
>> events will not flow into table for the better part of one day.
>>
>>
>> Thanks,
>>
>> Nuria
>>
>>
>>
> _______________________________________________
> Mobile-l mailing list
> Mobile-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/mobile-l
>
>