Re: [Wikimedia-search] HTTP request logging

26 Jun 2015


      I discovered a presentation from last year that has answered most of my
initial questions.  Here are my notes:
Notes from Hadoop and Beyond
https://www.youtube.com/watch?v=tx1pagZOsiM June
25, 2015
- Video https://www.youtube.com/watch?v=tx1pagZOsiM
   - Slides
   https://docs.google.com/presentation/d/1ZPmfN-kmfqWEJUMIRg2feSstFPY45js4AnYaf3NbLNE/
webrequest logs
This is a log for every WMF HTTP request. It can max out beyond 200k
requests per second, which is *a lot*.
udp2log
Doesn't scale, because every instance must process every message, every
packet.
Because it uses UDP, it's not guaranteed not to drop data.
Wikimedia Statistics
*http://stats.wikimedia.org http://stats.wikimedia.org*
Most data here is generated by udp2log-collected data.
It's sampled because there's too much traffic for our storage/processing
capacity.
Analytics cluster
Uses Hadoop for batch processing of logs, and (mostly) uses Hive to expose
the data to analysts.
This diagram is a useful (and frequent) reference:
 [image: Analytics cluster diagram]
Analytics cluster diagram
Note the loopy-back lines in the diagram -- these are batch jobs to do
various things (geocoding, anonymizing, etc.).
Hadoop
Hadoop = a distributed file system + a framework for distributed computation
Hive
*Hypothetical analyst question: How to get the top referrals for an
article?*
Hive maps a SQL-like language onto Hadoop MapReduce jobs.
*Example Hive query to answer the above question:*
SELECT
  SUBSTR(referer,30) AS SOURCE, COUNT(DISTINCT ip) AS hitsFROM
  webrequestWHERE
  uri_path = "/wiki/London" AND uri_host = "en.wikipedia.org" AND
             referer LIKE "http://en.wikipedia.org/wiki/%" AND
             http_status = 200 AND webrequest_source = ‘text’ AND
             year = 2014 AND month= 07 AND day = 14 AND hour = 14GROUP BY
  SUBSTR(referer,30)ORDER BY hits DESC LIMIT 50;
This is nice because it lets you run SQL on top of text data.
Kafka cluster
This serves as a replacement for udp2log.
Kafka is a reliable, horizontally scalable, distributed pub/sub buffer.
Processes up to 200k messages per second at 30 MB per second.
Data is consumed every ten minutes into Hadoop
Camus
A job that runs on Hadoop to consume from Kafka, and write the data into
HDFS.
Launches a MapReduce job every hour(?).
Can inspect the data as it's coming in -- lets it handle data based on
time/content.
Oozie
Oozie is a Hadoop job scheduler that allows the composition of complex
workflows.
Jobs are launched based on the existence of new data sets, rather than
simply based on time or periodic intervals. This lets us can trigger Oozie
whenever a Camus job completes.
Hue
Web GUI for interacting with Hadoop, Hive, Oozie, etc.
Provides a Hive query interface, a Pig script interface, a way to launch
jobs, browse the file system, and install add-ons.
A command-line interface is also available for all of the above.
MediaWiki Vagrant
To play with this in Vagrant:
- Edit your *.settings.yaml* file and add:
   - vagrant_ram: 2048
   - Comment out include role::mediawiki in puppet/manifests/site.pp
   (unless you really need this on your VM)
   - Run: vagrant enable-role analytics
   - Run: vagrant up
Hive and Hadoop should now be available in Vagrant!
On Thu, Jun 25, 2015 at 10:41 AM, James Douglas jdouglas@wikimedia.org
wrote:
...
Ooh!
https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled
On Thu, Jun 25, 2015 at 10:28 AM, James Douglas jdouglas@wikimedia.org
wrote:
...
This looks possibly relevant:
https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Overview
On Thu, Jun 25, 2015 at 10:03 AM, James Douglas jdouglas@wikimedia.org
wrote:
...
...
The varnish logs == request logs == also in HDFS.
Ah ha, thanks!
...
To get access you'll need a phabricator ticket asking for stat1002 and
analytics cluster access, with Ottomata CCd to make the patch and Dan CCd
to confirm you need it.
Cool, I'll get on that.  In the meantime, where can I learn about the
infrastructure?
On Thu, Jun 25, 2015 at 10:01 AM, Oliver Keyes okeyes@wikimedia.org
wrote:
...
The varnish logs == request logs == also in HDFS. To get access you'll
need a phabricator ticket asking for stat1002 and analytics cluster
access, with Ottomata CCd to make the patch and Dan CCd to confirm you
need it.
On 25 June 2015 at 12:53, James Douglas jdouglas@wikimedia.org wrote:
...
From IRC, it sounds like this information ought to be available in the
Varnish logs.  What's the story there?
On Thu, Jun 25, 2015 at 9:52 AM, James Douglas <
jdouglas@wikimedia.org>
...
wrote:
...
I misspoke: we're looking for HTTP requests coming from users who are
leaving the Portal, not retrieving the portal.
e.g. Clicking on enwiki, using one of the search forms, etc.
On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes okeyes@wikimedia.org
wrote:
>
> * Nope :(
> * It's in HDFS!
>
> On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org
wrote:
...
...
> > Let's say, hypothetically, that I wanted to measure information
about
...
...
> > HTTP
> > requests coming into the Wikipedia Portal (www.wikipedia.org).
> >
> > * Do we record this information?
> >   * If so, is it accessible via analytical tools?
> >     * If so, how do I get my mitts on it?
> >   * If not, is it accessible from a database or similar?
> >
> > Context: https://phabricator.wikimedia.org/T100673
> >
> > _______________________________________________
> > Wikimedia-search mailing list
> > Wikimedia-search@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
> >
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>
> _______________________________________________
> Wikimedia-search mailing list
> Wikimedia-search@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
--
Oliver Keyes
Research Analyst
Wikimedia Foundation

Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Wikimedia-search] HTTP request logging