Hey guys,
I reached out to this guy yesterday about the bug I ran into in Scribe. He had posted on the scribe-server google group that he had fixed this bug, and I also wanted to let him know about our potential efforts to standardize Scribe packaging.
Here's his opinion on Scribe:
> I personally gave up on Scribe, I'd recommend that you
> consider Flume as a better replacement, that is more supported and
> developed. Scribe has never been really well written or maintained,
> it's just one of the many hacks that Facebook released.
In general, it does seem to be pretty given up on. There have been a couple of pull requests merged in the last year, but beyond that there isn't much activity: https://github.com/facebook/scribe/commits/master
It would be really interesting to know how (and if?) Facebook still uses Scribe internally. I'm pretty sure they've done a lot more with Hadoop since 2008-2010 when Scribe was being more actively promoted. Maybe they're using Flume instead now? We need a Facebook insider, anyone know one?
-Ao
Begin forwarded message:
> From: tsuna <tsunanet(a)gmail.com>
> Subject: Re: Scribe Packaging Effort
> Date: July 26, 2012 12:45:30 AM EDT
> To: Andrew Otto <otto(a)wikimedia.org>
>
> On Wed, Jul 25, 2012 at 9:05 AM, Andrew Otto <otto(a)wikimedia.org> wrote:
>> Hi Benoît,
>
> Hi Andrew,
>
>> In the meantime, I have another question! I just ran into a problem that
>> you say you fixed in this thread:
>> https://groups.google.com/group/scribe-server/tree/browse_frm/month/2010-01…
>
> Are you referring to this?
>> [Thu Nov 19 18:29:59 2009] "[hdfs] Connecting to HDFS"
>> *** glibc detected *** ./scribed: munmap_chunk(): invalid pointer: 0x0000000001ea19c3 ***
>
>> However, the commit you link to 404s. I'm willing to rebuild scribe with
>> whatever fix or release version is necessary. Can you point me in the right
>> direction? What source should I use to build scribe to fix this bug?
>
> If you're referring to the bug above, it's a very old bug, it must be
> fixed upstream already. I can't believe you're running into the same
> bug almost 3 years later, it must be a different issue.
>
> Either way, I personally gave up on Scribe, I'd recommend that you
> consider Flume as a better replacement, that is more supported and
> developed. Scribe has never been really well written or maintained,
> it's just one of the many hacks that Facebook released.
>
> Good luck.
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
Belated cross-post for those who are interested in Git analytics.
-Sumana
-------- Original Message --------
Subject: [Wikitech-l] Git code review metrics
Date: Fri, 20 Apr 2012 16:42:23 -0700
From: Erik Moeller <erik(a)wikimedia.org>
Reply-To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
To: Wikimedia developers <wikitech-l(a)lists.wikimedia.org>
Following up on the earlier thread by Rob [1], Rob and I kicked around
the question what metrics/targets for code review we want to surface
on an ongoing basis. We're not going to invest in a huge dashboard
project right now, but we'll try to get at least some of the key
metrics generated and visualized automatically. Help is appreciated,
starting with what the metrics are that we should look at.
Here's what we came up with, by priority:
1) Most important: Time series graph of # of open changesets
Target: Numer of open changesets should not exceed 200.
Optional breakdown:
- mediawiki/core
- mediawiki/extensions
- WMF-deployed extensions
- specific repos
2) Important: Aging trends.
- Time series graph of # open changesets older than a, b, c days
(to indicate troubling aging trends, e.g. a=3, b=5, c=7)
- Target: There should be 0 changes that haven't been looked at
all for more than 7 days.
- Including only: Changes which have not received a -1 review, -1
verification, or -2
- Optional breakdown as above
- Rationale: We're looking for tendencies of complete neglect of
submissions here, which is why we have to exclude -1s or -2s.
3) Possibly useful:
- Per-reviewer or reviewee(?) statistics regarding merge activity,
number of -1s, neglected code, etc.
Any obvious thinking errors in the above / do the targets make sense /
should we look at other metrics or approaches?
Erik
[1] http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/059940.html
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation
Support Free Knowledge: https://wikimediafoundation.org/wiki/Donate
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Hey Jan. Yes this tool looks like almost what I need. The issue is it
only does 7 languages. We are currently translating content into
nearly 40. Is it possible to expand it to all languages of Wikipedia?
James Heilman
On Tue, Jul 24, 2012 at 6:00 AM, <analytics-request(a)lists.wikimedia.org> wrote:
> Send Analytics mailing list submissions to
> analytics(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/analytics
> or, via email, send a message with subject or body 'help' to
> analytics-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> analytics-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Analytics digest..."
>
>
> Today's Topics:
>
> 1. Re: Calculating page views for projects in other languages
> (Jan Ainali)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 23 Jul 2012 15:02:53 +0200
> From: Jan Ainali <jan.ainali(a)wikimedia.se>
> To: "A mailinglist for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics."
> <analytics(a)lists.wikimedia.org>
> Subject: Re: [Analytics] Calculating page views for projects in other
> languages
> Message-ID:
> <CAKwu9WHOiC34E1uZwJiAEaCdHpyL_fQH4jnZozG9j7ayhxhMHQ(a)mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I just wanted to let you know of a tool that Holger Motzkau
> (User:Prolineserver) did. It does not really solve your problem though, but
> it is close to. It lists the page views from a category on one Wikipedia
> (and all the interwikilinks and the QRpedia statistics). It shouldn't be
> too hard too feed it with a list of articles with a certain template I
> guess.
>
> http://toolserver.org/~prolineserver/glamorous/glamorous_cats.php
>
> --
> Best,
> Jan Ainali
> Chairman, Wikimedia Sverige <http://se.wikimedia.org/wiki/Huvudsida>
>
>
> 2012/7/23 Erik Zachte <ezachte(a)wikimedia.org>
>
>> James seeks a one page overview of most read articles for any wiki/project.
>>
>> A list of articles per project could be retrieved from the mediawiki API.
>> I did something similar with list of articles per category (incl subcat, x
>> levels deep).
>> Perl script on request.
>>
>> Then the machine readable version of grok could be used to retrieve article
>> counts. (see Dario's comment)
>> However this might now scale well to 1(0),000's of projects and 100,000's
>> pages.
>>
>> In the somewhat longer run I see two developments that might warrant
>> putting
>> this on hold:
>>
>> 1)
>> The new analytics cluster will be used to aggregate page and image views.
>> (Another use case would be aggregating image views per donating GLAM
>> institute)
>> Which aggregations precisely better be determined once the infrastructure
>> is
>> available, and capacity is known.
>>
>> 2)
>> There are scripts to aggregate Domas' hourly page view feeds into monthly
>> files.
>> These aggregates are so much smaller, after cruft removal only 2Gb per
>> month, without losing hourly resolution, easy to download and
>> archive/process somewhere else.
>> http://lists.wikimedia.org/pipermail/wikitech-l/2011-August/054644.html
>> These need final work, I spoke to long time dev. wikimedian EMW at
>> Wikimania
>> and he might be interested to take this upon him, starting October. From
>> these aggregates the 1(0),000 of projects overviews could be generated in a
>> batch process, mind you after month completed.
>>
>> Erik Zachte
>>
>>
>> -----Original Message-----
>> From: analytics-bounces(a)lists.wikimedia.org
>> [mailto:analytics-bounces@lists.wikimedia.org] On Behalf Of Dario
>> Taraborelli
>> Sent: Friday, July 20, 2012 6:56 PM
>> To: A mailinglist for the Analytics Team at WMF and everybody who has an
>> interest in Wikipedia and analytics.
>> Subject: Re: [Analytics] Calculating page views for projects in other
>> languages
>>
>> James,
>>
>> can you expand on this request? If you are interested in per-article
>> pageview stats you can use: http://stats.grok.se/
>>
>> For example: http://stats.grok.se/fr/201207/Paris
>> A machine readable version: http://stats.grok.se/json/fr/201207/Paris
>>
>> Dario
>>
>> On Jul 20, 2012, at 9:38 AM, Sumana Harihareswara wrote:
>>
>> > James: Good places to add your requests to:
>> >
>> > https://lists.wikimedia.org/mailman/listinfo/toolserver-l
>> >
>> > https://www.mediawiki.org/wiki/Annoying_large_bugs
>> >
>> >
>> > --
>> > Sumana Harihareswara
>> > Engineering Community Manager
>> > Wikimedia Foundation
>> >
>> >
>> >
>> > On 07/20/2012 12:38 PM, Diederik van Liere wrote:
>> >> It is always tricky to convince someone to start working on a request
>> for
>> yourself. Given the fact that there is an existing code base I would say
>> that your best bet is to study that and tweak it your own requirements. If
>> you have specific technical questions then there are enough people within
>> the different wikimedia communities that can help you.
>> >>
>> >> Good luck!
>> >>
>> >> Diederik
>> >>
>> >> Sent from my iPhone
>> >>
>> >> On 2012-07-20, at 11:47, James Heilman <jmh649(a)gmail.com> wrote:
>> >>
>> >>> This is something I was hoping to convince someone with programming
>> >>> skills to take on. What prevents me from doing it is my complete
>> >>> lack of programming skills thus the request here.
>> >>>
>> >>> --
>> >>> James Heilman
>> >>> MD, CCFP-EM, Wikipedian
>> >>>
>> >>> The Wikipedia Open Textbook of Medicine
>> >>> www.opentextbookofmedicine.com
>> >>>
>> >>> _______________________________________________
>> >>> Analytics mailing list
>> >>> Analytics(a)lists.wikimedia.org
>> >>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> >>
>> >> _______________________________________________
>> >> Analytics mailing list
>> >> Analytics(a)lists.wikimedia.org
>> >> https://lists.wikimedia.org/mailman/listinfo/analytics
>> >>
>> >
>> >
>> > _______________________________________________
>> > Analytics mailing list
>> > Analytics(a)lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics(a)lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
This is something I was hoping to convince someone with programming
skills to take on. What prevents me from doing it is my complete lack
of programming skills thus the request here.
--
James Heilman
MD, CCFP-EM, Wikipedian
The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
FYI
---------- Forwarded message ----------
From: Douglas Moore <douglas.moore(a)thinkbiganalytics.com>
Date: Tue, Jul 3, 2012 at 12:05 PM
Subject: SECURITY exposure issue - HADOOP
To: noc(a)wikimedia.org
Hello,
While searching for Hadoop related material on Google, I found an
administrative page on your Hadoop server, and that it is exposed to the
Internet and indexed by Google.
Hadoop is not intended to run directly on the Internet so we believe this
situation represents a potential security risk to your fine organization
and think you should investigate further (and close public access to this
research cluster).
Here is one of the open URLs: http://analytics1001.wikimedia.org:50070
Please kindly acknowledge the receipt of this email.
Thanks,
--
Douglas Moore
781-454-5971
@Douglas_MA
skype: dmoore247
Douglas.Moore(a)thinkbiganalytics.com
http://www.thinkbiganalytics.com
This is a reminder that you're invited to the pre-Wikimania hackathon,
10-11 July in Washington, DC, USA:
https://wikimania2012.wikimedia.org/wiki/Hackathon
In order to come, you have to register for the Wikimania conference:
https://wikimania2012.wikimedia.org/wiki/Registration
(Unfortunately, the period for requesting scholarships is now over.)
At the hackathon, we'll have trainings and projects for novices, and we
welcome creators of all Wikimedia technologies -- MediaWiki, gadgets,
bots, mobile apps, you name it -- to hack on stuff together and teach
each other.
Hope to see you!
--
Sumana Harihareswara
Engineering Community Manager
Wikimedia Foundation
This weekend, TechWeek Chicago starts: http://techweek.com/
The Foundation's Peter Gehres is copresenting the analytics presentation
"How Wikipedia Doubled its Online Fundraising" this Saturday. If you're
at TechWeek, he and other Wikimedians want to meet with you and talk shop!
http://schedule.techweek.com/event/003fc017e0530c08eb34f08033c50f86
Saturday June 23, 2012 4:00pm - 4:45pm @ 1 - Main Stage (222 Merchandise
Mart Plaza, Chicago, IL)
"In 2010, online donations to Wikipedia more than doubled, from $7.5
million to $16 million and, in 2011, increased another 33%. Much of this
increase was driven by user research conducted in Chicago. Design
researcher Billy Belchev from Webitects will get into the nitty-gritty
of form design and testing, user interviews. Do one-step forms work
better than multi-step? Does PayPal help or hurt your numbers? What are
the effect of “Jimmy” banners? The answers are based on data from the
fifth most trafficked website in the world."
--
Sumana Harihareswara
Engineering Community Manager
Wikimedia Foundation