[Wikimedia-search] HTTP request logging

List overview All Threads
Download

newer

older

[Wikimedia-search] Quick process...

[Wikimedia-search] Upgrade Beta to...

James Douglas

26 Jun 2015 26 Jun '15

2:05 a.m.

Let's say, hypothetically, that I wanted to measure information about HTTP requests coming into the Wikipedia Portal (www.wikipedia.org).

* Do we record this information? * If so, is it accessible via analytical tools? * If so, how do I get my mitts on it? * If not, is it accessible from a database or similar?

Context: https://phabricator.wikimedia.org/T100673

Attachments:

attachment.htm (text/html — 578 bytes)

Show replies by date

Oliver Keyes

26 Jun 26 Jun

2:50 a.m.

* Nope :( * It's in HDFS!

On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org wrote:

...

Let's say, hypothetically, that I wanted to measure information about HTTP requests coming into the Wikipedia Portal (www.wikipedia.org).

Do we record this information?

If so, is it accessible via analytical tools?

If so, how do I get my mitts on it?

If not, is it accessible from a database or similar?

Context: https://phabricator.wikimedia.org/T100673

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

James Douglas

2:52 a.m.

I misspoke: we're looking for HTTP requests coming from users who are leaving the Portal, not retrieving the portal.

e.g. Clicking on enwiki, using one of the search forms, etc.

On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Nope :(

It's in HDFS!

On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org wrote:

...
Let's say, hypothetically, that I wanted to measure information about

HTTP

...
requests coming into the Wikipedia Portal (www.wikipedia.org).

Do we record this information?

If so, is it accessible via analytical tools?

If so, how do I get my mitts on it?

If not, is it accessible from a database or similar?

Context: https://phabricator.wikimedia.org/T100673

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

James Douglas

2:53 a.m.

...

From IRC, it sounds like this information ought to be available in the

Varnish logs. What's the story there?

On Thu, Jun 25, 2015 at 9:52 AM, James Douglas jdouglas@wikimedia.org wrote:

...

I misspoke: we're looking for HTTP requests coming from users who are leaving the Portal, not retrieving the portal.

e.g. Clicking on enwiki, using one of the search forms, etc.

On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Nope :(

It's in HDFS!

On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org wrote:

...
Let's say, hypothetically, that I wanted to measure information about

HTTP

...
requests coming into the Wikipedia Portal (www.wikipedia.org).

Do we record this information?

If so, is it accessible via analytical tools?

If so, how do I get my mitts on it?

If not, is it accessible from a database or similar?

Context: https://phabricator.wikimedia.org/T100673

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

James Douglas

3:01 a.m.

Where can I learn about the production Varnish configuration, with respect to request logging?

This seems to be a bit more general: https://www.mediawiki.org/wiki/Manual:Varnish_caching

On Thu, Jun 25, 2015 at 9:53 AM, James Douglas jdouglas@wikimedia.org wrote:

...

From IRC, it sounds like this information ought to be available in the Varnish logs. What's the story there?

On Thu, Jun 25, 2015 at 9:52 AM, James Douglas jdouglas@wikimedia.org wrote:

...
I misspoke: we're looking for HTTP requests coming from users who are leaving the Portal, not retrieving the portal.

e.g. Clicking on enwiki, using one of the search forms, etc.

On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Nope :(

It's in HDFS!

On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org wrote:

...
Let's say, hypothetically, that I wanted to measure information about

HTTP

...
requests coming into the Wikipedia Portal (www.wikipedia.org).

Do we record this information?

If so, is it accessible via analytical tools?

If so, how do I get my mitts on it?

If not, is it accessible from a database or similar?

Context: https://phabricator.wikimedia.org/T100673

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Oliver Keyes

3:01 a.m.

The varnish logs == request logs == also in HDFS. To get access you'll need a phabricator ticket asking for stat1002 and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

On 25 June 2015 at 12:53, James Douglas jdouglas@wikimedia.org wrote:

...

From IRC, it sounds like this information ought to be available in the Varnish logs. What's the story there?

On Thu, Jun 25, 2015 at 9:52 AM, James Douglas jdouglas@wikimedia.org wrote:

...
I misspoke: we're looking for HTTP requests coming from users who are leaving the Portal, not retrieving the portal.

e.g. Clicking on enwiki, using one of the search forms, etc.

On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Nope :(

It's in HDFS!

On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org wrote:

...
Let's say, hypothetically, that I wanted to measure information about HTTP requests coming into the Wikipedia Portal (www.wikipedia.org).

Do we record this information?

If so, is it accessible via analytical tools?

If so, how do I get my mitts on it?

If not, is it accessible from a database or similar?

Context: https://phabricator.wikimedia.org/T100673

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

James Douglas

3:03 a.m.

...

The varnish logs == request logs == also in HDFS.

Ah ha, thanks!

...

To get access you'll need a phabricator ticket asking for stat1002 and

analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

Cool, I'll get on that. In the meantime, where can I learn about the infrastructure?

On Thu, Jun 25, 2015 at 10:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

The varnish logs == request logs == also in HDFS. To get access you'll need a phabricator ticket asking for stat1002 and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

On 25 June 2015 at 12:53, James Douglas jdouglas@wikimedia.org wrote:

...
From IRC, it sounds like this information ought to be available in the Varnish logs. What's the story there?

On Thu, Jun 25, 2015 at 9:52 AM, James Douglas jdouglas@wikimedia.org wrote:

...
I misspoke: we're looking for HTTP requests coming from users who are leaving the Portal, not retrieving the portal.

e.g. Clicking on enwiki, using one of the search forms, etc.

On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Nope :(

It's in HDFS!

On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org

wrote:

...
...
...
...
Let's say, hypothetically, that I wanted to measure information about HTTP requests coming into the Wikipedia Portal (www.wikipedia.org).

Do we record this information?

If so, is it accessible via analytical tools?

If so, how do I get my mitts on it?

If not, is it accessible from a database or similar?

Context: https://phabricator.wikimedia.org/T100673

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

James Douglas

3:28 a.m.

This looks possibly relevant: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Overview

On Thu, Jun 25, 2015 at 10:03 AM, James Douglas jdouglas@wikimedia.org wrote:

...

...
The varnish logs == request logs == also in HDFS.

Ah ha, thanks!

...
To get access you'll need a phabricator ticket asking for stat1002 and

analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

Cool, I'll get on that. In the meantime, where can I learn about the infrastructure?

On Thu, Jun 25, 2015 at 10:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
The varnish logs == request logs == also in HDFS. To get access you'll need a phabricator ticket asking for stat1002 and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

On 25 June 2015 at 12:53, James Douglas jdouglas@wikimedia.org wrote:

...
From IRC, it sounds like this information ought to be available in the Varnish logs. What's the story there?

On Thu, Jun 25, 2015 at 9:52 AM, James Douglas jdouglas@wikimedia.org wrote:

...
I misspoke: we're looking for HTTP requests coming from users who are leaving the Portal, not retrieving the portal.

e.g. Clicking on enwiki, using one of the search forms, etc.

On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Nope :(

It's in HDFS!

On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org

wrote:

...
...
...
...
Let's say, hypothetically, that I wanted to measure information

about

...
...
...
...
HTTP requests coming into the Wikipedia Portal (www.wikipedia.org).

Do we record this information?

If so, is it accessible via analytical tools?

If so, how do I get my mitts on it?

If not, is it accessible from a database or similar?

Context: https://phabricator.wikimedia.org/T100673

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

James Douglas

3:41 a.m.

Ooh!

https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled

On Thu, Jun 25, 2015 at 10:28 AM, James Douglas jdouglas@wikimedia.org wrote:

...

This looks possibly relevant: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Overview

On Thu, Jun 25, 2015 at 10:03 AM, James Douglas jdouglas@wikimedia.org wrote:

...
...
The varnish logs == request logs == also in HDFS.

Ah ha, thanks!

...
To get access you'll need a phabricator ticket asking for stat1002 and

analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

Cool, I'll get on that. In the meantime, where can I learn about the infrastructure?

On Thu, Jun 25, 2015 at 10:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
The varnish logs == request logs == also in HDFS. To get access you'll need a phabricator ticket asking for stat1002 and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

On 25 June 2015 at 12:53, James Douglas jdouglas@wikimedia.org wrote:

...
From IRC, it sounds like this information ought to be available in the Varnish logs. What's the story there?

On Thu, Jun 25, 2015 at 9:52 AM, James Douglas <jdouglas@wikimedia.org

wrote:

...
I misspoke: we're looking for HTTP requests coming from users who are leaving the Portal, not retrieving the portal.

e.g. Clicking on enwiki, using one of the search forms, etc.

On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...

Nope :(

It's in HDFS!

On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org

wrote:

...
...
...
> Let's say, hypothetically, that I wanted to measure information

about

...
...
...
> HTTP > requests coming into the Wikipedia Portal (www.wikipedia.org). > > * Do we record this information? > * If so, is it accessible via analytical tools? > * If so, how do I get my mitts on it? > * If not, is it accessible from a database or similar? > > Context: https://phabricator.wikimedia.org/T100673 > > _______________________________________________ > Wikimedia-search mailing list > Wikimedia-search@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Oliver Keyes

4:51 a.m.

That's the sampled logs on stat1002; you do not, under any circumstances, want to deal with those. I've been doing this for 2+ years - if I'm pointing yinz to the HDFS-stored logs there's a reason for it ;)

On 25 June 2015 at 13:41, James Douglas jdouglas@wikimedia.org wrote:

...

Ooh!

https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled

On Thu, Jun 25, 2015 at 10:28 AM, James Douglas jdouglas@wikimedia.org wrote:

...
This looks possibly relevant: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Overview

On Thu, Jun 25, 2015 at 10:03 AM, James Douglas jdouglas@wikimedia.org wrote:

...
...
The varnish logs == request logs == also in HDFS.

Ah ha, thanks!

...
To get access you'll need a phabricator ticket asking for stat1002 and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

Cool, I'll get on that. In the meantime, where can I learn about the infrastructure?

On Thu, Jun 25, 2015 at 10:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
The varnish logs == request logs == also in HDFS. To get access you'll need a phabricator ticket asking for stat1002 and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

On 25 June 2015 at 12:53, James Douglas jdouglas@wikimedia.org wrote:

...
From IRC, it sounds like this information ought to be available in the Varnish logs. What's the story there?

On Thu, Jun 25, 2015 at 9:52 AM, James Douglas jdouglas@wikimedia.org wrote:

...
I misspoke: we're looking for HTTP requests coming from users who are leaving the Portal, not retrieving the portal.

e.g. Clicking on enwiki, using one of the search forms, etc.

On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote: > > * Nope :( > * It's in HDFS! > > On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org > wrote: > > Let's say, hypothetically, that I wanted to measure information > > about > > HTTP > > requests coming into the Wikipedia Portal (www.wikipedia.org). > > > > * Do we record this information? > > * If so, is it accessible via analytical tools? > > * If so, how do I get my mitts on it? > > * If not, is it accessible from a database or similar? > > > > Context: https://phabricator.wikimedia.org/T100673 > > > > _______________________________________________ > > Wikimedia-search mailing list > > Wikimedia-search@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search > > > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Wikimedia-search mailing list > Wikimedia-search@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

James Douglas

4:54 a.m.

I discovered a presentation from last year that has answered most of my initial questions. Here are my notes:

Notes from Hadoop and Beyond https://www.youtube.com/watch?v=tx1pagZOsiM June 25, 2015

- Video https://www.youtube.com/watch?v=tx1pagZOsiM - Slides https://docs.google.com/presentation/d/1ZPmfN-kmfqWEJUMIRg2feSstFPY45js4AnYaf3NbLNE/

webrequest logs

This is a log for every WMF HTTP request. It can max out beyond 200k requests per second, which is *a lot*. udp2log

Doesn't scale, because every instance must process every message, every packet.

Because it uses UDP, it's not guaranteed not to drop data. Wikimedia Statistics

*http://stats.wikimedia.org http://stats.wikimedia.org*

Most data here is generated by udp2log-collected data.

It's sampled because there's too much traffic for our storage/processing capacity. Analytics cluster

Uses Hadoop for batch processing of logs, and (mostly) uses Hive to expose the data to analysts.

This diagram is a useful (and frequent) reference: [image: Analytics cluster diagram]

Analytics cluster diagram

Note the loopy-back lines in the diagram -- these are batch jobs to do various things (geocoding, anonymizing, etc.). Hadoop

Hadoop = a distributed file system + a framework for distributed computation Hive

*Hypothetical analyst question: How to get the top referrals for an article?*

Hive maps a SQL-like language onto Hadoop MapReduce jobs.

*Example Hive query to answer the above question:*

SELECT SUBSTR(referer,30) AS SOURCE, COUNT(DISTINCT ip) AS hitsFROM webrequestWHERE uri_path = "/wiki/London" AND uri_host = "en.wikipedia.org" AND referer LIKE "http://en.wikipedia.org/wiki/%" AND http_status = 200 AND webrequest_source = ‘text’ AND year = 2014 AND month= 07 AND day = 14 AND hour = 14GROUP BY SUBSTR(referer,30)ORDER BY hits DESC LIMIT 50;

This is nice because it lets you run SQL on top of text data. Kafka cluster

This serves as a replacement for udp2log.

Kafka is a reliable, horizontally scalable, distributed pub/sub buffer.

Processes up to 200k messages per second at 30 MB per second.

Data is consumed every ten minutes into Hadoop Camus

A job that runs on Hadoop to consume from Kafka, and write the data into HDFS.

Launches a MapReduce job every hour(?).

Can inspect the data as it's coming in -- lets it handle data based on time/content. Oozie

Oozie is a Hadoop job scheduler that allows the composition of complex workflows.

Jobs are launched based on the existence of new data sets, rather than simply based on time or periodic intervals. This lets us can trigger Oozie whenever a Camus job completes. Hue

Web GUI for interacting with Hadoop, Hive, Oozie, etc.

Provides a Hive query interface, a Pig script interface, a way to launch jobs, browse the file system, and install add-ons.

A command-line interface is also available for all of the above. MediaWiki Vagrant

To play with this in Vagrant:

- Edit your *.settings.yaml* file and add: - vagrant_ram: 2048 - Comment out include role::mediawiki in puppet/manifests/site.pp (unless you really need this on your VM) - Run: vagrant enable-role analytics - Run: vagrant up

Hive and Hadoop should now be available in Vagrant!

On Thu, Jun 25, 2015 at 10:41 AM, James Douglas jdouglas@wikimedia.org wrote:

...

Ooh!

https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled

On Thu, Jun 25, 2015 at 10:28 AM, James Douglas jdouglas@wikimedia.org wrote:

...
This looks possibly relevant: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Overview

On Thu, Jun 25, 2015 at 10:03 AM, James Douglas jdouglas@wikimedia.org wrote:

...
...
The varnish logs == request logs == also in HDFS.

Ah ha, thanks!

...
To get access you'll need a phabricator ticket asking for stat1002 and

analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

Cool, I'll get on that. In the meantime, where can I learn about the infrastructure?

On Thu, Jun 25, 2015 at 10:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
The varnish logs == request logs == also in HDFS. To get access you'll need a phabricator ticket asking for stat1002 and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

On 25 June 2015 at 12:53, James Douglas jdouglas@wikimedia.org wrote:

...
From IRC, it sounds like this information ought to be available in the Varnish logs. What's the story there?

On Thu, Jun 25, 2015 at 9:52 AM, James Douglas <

jdouglas@wikimedia.org>

...
wrote:

...
I misspoke: we're looking for HTTP requests coming from users who are leaving the Portal, not retrieving the portal.

e.g. Clicking on enwiki, using one of the search forms, etc.

On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes okeyes@wikimedia.org wrote: > > * Nope :( > * It's in HDFS! > > On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org

wrote:

...
...
> > Let's say, hypothetically, that I wanted to measure information

about

...
...
> > HTTP > > requests coming into the Wikipedia Portal (www.wikipedia.org). > > > > * Do we record this information? > > * If so, is it accessible via analytical tools? > > * If so, how do I get my mitts on it? > > * If not, is it accessible from a database or similar? > > > > Context: https://phabricator.wikimedia.org/T100673 > > > > _______________________________________________ > > Wikimedia-search mailing list > > Wikimedia-search@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search > > > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Wikimedia-search mailing list > Wikimedia-search@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Oliver Keyes

5:31 a.m.

What exactly are you looking for or trying to do? There is, as you've seen, lotsa stuff to learn ;p

On 25 June 2015 at 14:54, James Douglas jdouglas@wikimedia.org wrote:

...

I discovered a presentation from last year that has answered most of my initial questions. Here are my notes:

Notes from Hadoop and Beyond https://www.youtube.com/watch?v=tx1pagZOsiM June 25, 2015

Video https://www.youtube.com/watch?v=tx1pagZOsiM

Slides

https://docs.google.com/presentation/d/1ZPmfN-kmfqWEJUMIRg2feSstFPY45js4AnYaf3NbLNE/

webrequest logs

This is a log for every WMF HTTP request. It can max out beyond 200k requests per second, which is *a lot*. udp2log

Doesn't scale, because every instance must process every message, every packet.

Because it uses UDP, it's not guaranteed not to drop data. Wikimedia Statistics

*http://stats.wikimedia.org http://stats.wikimedia.org*

Most data here is generated by udp2log-collected data.

It's sampled because there's too much traffic for our storage/processing capacity. Analytics cluster

Uses Hadoop for batch processing of logs, and (mostly) uses Hive to expose the data to analysts.

This diagram is a useful (and frequent) reference: [image: Analytics cluster diagram]

Analytics cluster diagram

Note the loopy-back lines in the diagram -- these are batch jobs to do various things (geocoding, anonymizing, etc.). Hadoop

Hadoop = a distributed file system + a framework for distributed computation Hive

*Hypothetical analyst question: How to get the top referrals for an article?*

Hive maps a SQL-like language onto Hadoop MapReduce jobs.

*Example Hive query to answer the above question:*

SELECT SUBSTR(referer,30) AS SOURCE, COUNT(DISTINCT ip) AS hitsFROM webrequestWHERE uri_path = "/wiki/London" AND uri_host = "en.wikipedia.org" AND referer LIKE "http://en.wikipedia.org/wiki/%" AND http_status = 200 AND webrequest_source = ‘text’ AND year = 2014 AND month= 07 AND day = 14 AND hour = 14GROUP BY SUBSTR(referer,30)ORDER BY hits DESC LIMIT 50;

This is nice because it lets you run SQL on top of text data. Kafka cluster

This serves as a replacement for udp2log.

Kafka is a reliable, horizontally scalable, distributed pub/sub buffer.

Processes up to 200k messages per second at 30 MB per second.

Data is consumed every ten minutes into Hadoop Camus

A job that runs on Hadoop to consume from Kafka, and write the data into HDFS.

Launches a MapReduce job every hour(?).

Can inspect the data as it's coming in -- lets it handle data based on time/content. Oozie

Oozie is a Hadoop job scheduler that allows the composition of complex workflows.

Jobs are launched based on the existence of new data sets, rather than simply based on time or periodic intervals. This lets us can trigger Oozie whenever a Camus job completes. Hue

Web GUI for interacting with Hadoop, Hive, Oozie, etc.

Provides a Hive query interface, a Pig script interface, a way to launch jobs, browse the file system, and install add-ons.

A command-line interface is also available for all of the above. MediaWiki Vagrant

To play with this in Vagrant:

Edit your *.settings.yaml* file and add:

vagrant_ram: 2048

Comment out include role::mediawiki in puppet/manifests/site.pp

(unless you really need this on your VM)

Run: vagrant enable-role analytics

Run: vagrant up

Hive and Hadoop should now be available in Vagrant!

On Thu, Jun 25, 2015 at 10:41 AM, James Douglas jdouglas@wikimedia.org wrote:

...
Ooh!

https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled

On Thu, Jun 25, 2015 at 10:28 AM, James Douglas jdouglas@wikimedia.org wrote:

...
This looks possibly relevant: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Overview

On Thu, Jun 25, 2015 at 10:03 AM, James Douglas jdouglas@wikimedia.org wrote:

...
...
The varnish logs == request logs == also in HDFS.

Ah ha, thanks!

...
To get access you'll need a phabricator ticket asking for stat1002

and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

Cool, I'll get on that. In the meantime, where can I learn about the infrastructure?

On Thu, Jun 25, 2015 at 10:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
The varnish logs == request logs == also in HDFS. To get access you'll need a phabricator ticket asking for stat1002 and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

On 25 June 2015 at 12:53, James Douglas jdouglas@wikimedia.org wrote:

...
From IRC, it sounds like this information ought to be available in

the

...
Varnish logs. What's the story there?

On Thu, Jun 25, 2015 at 9:52 AM, James Douglas <

jdouglas@wikimedia.org>

...
wrote: > > I misspoke: we're looking for HTTP requests coming from users who

are

...
> leaving the Portal, not retrieving the portal. > > e.g. Clicking on enwiki, using one of the search forms, etc. > > On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes <okeyes@wikimedia.org

> wrote: >> >> * Nope :( >> * It's in HDFS! >> >> On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org

wrote:

...
>> > Let's say, hypothetically, that I wanted to measure information

about

...
>> > HTTP >> > requests coming into the Wikipedia Portal (www.wikipedia.org). >> > >> > * Do we record this information? >> > * If so, is it accessible via analytical tools? >> > * If so, how do I get my mitts on it? >> > * If not, is it accessible from a database or similar? >> > >> > Context: https://phabricator.wikimedia.org/T100673 >> > >> > _______________________________________________ >> > Wikimedia-search mailing list >> > Wikimedia-search@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >> > >> >> >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> >> _______________________________________________ >> Wikimedia-search mailing list >> Wikimedia-search@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search > >

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

James Douglas

5:50 a.m.

We're looking for HTTP requests coming from users who are leaving the Portal, heading to some other WMF site.

e.g. Clicking on enwiki, using one of the search forms, etc.

On Thu, Jun 25, 2015 at 12:31 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...

What exactly are you looking for or trying to do? There is, as you've seen, lotsa stuff to learn ;p

On 25 June 2015 at 14:54, James Douglas jdouglas@wikimedia.org wrote:

...
I discovered a presentation from last year that has answered most of my initial questions. Here are my notes:

Notes from Hadoop and Beyond https://www.youtube.com/watch?v=tx1pagZOsiM June 25, 2015

Video https://www.youtube.com/watch?v=tx1pagZOsiM

Slides

https://docs.google.com/presentation/d/1ZPmfN-kmfqWEJUMIRg2feSstFPY45js4AnYaf3NbLNE/

webrequest logs

This is a log for every WMF HTTP request. It can max out beyond 200k requests per second, which is *a lot*. udp2log

Doesn't scale, because every instance must process every message, every packet.

Because it uses UDP, it's not guaranteed not to drop data. Wikimedia Statistics

*http://stats.wikimedia.org http://stats.wikimedia.org*

Most data here is generated by udp2log-collected data.

It's sampled because there's too much traffic for our storage/processing capacity. Analytics cluster

Uses Hadoop for batch processing of logs, and (mostly) uses Hive to expose the data to analysts.

This diagram is a useful (and frequent) reference: [image: Analytics cluster diagram]

Analytics cluster diagram

Note the loopy-back lines in the diagram -- these are batch jobs to do various things (geocoding, anonymizing, etc.). Hadoop

Hadoop = a distributed file system + a framework for distributed computation Hive

*Hypothetical analyst question: How to get the top referrals for an article?*

Hive maps a SQL-like language onto Hadoop MapReduce jobs.

*Example Hive query to answer the above question:*

SELECT SUBSTR(referer,30) AS SOURCE, COUNT(DISTINCT ip) AS hitsFROM webrequestWHERE uri_path = "/wiki/London" AND uri_host = "en.wikipedia.org" AND referer LIKE "http://en.wikipedia.org/wiki/%" AND http_status = 200 AND webrequest_source = ‘text’ AND year = 2014 AND month= 07 AND day = 14 AND hour = 14GROUP BY SUBSTR(referer,30)ORDER BY hits DESC LIMIT 50;

This is nice because it lets you run SQL on top of text data. Kafka cluster

This serves as a replacement for udp2log.

Kafka is a reliable, horizontally scalable, distributed pub/sub buffer.

Processes up to 200k messages per second at 30 MB per second.

Data is consumed every ten minutes into Hadoop Camus

A job that runs on Hadoop to consume from Kafka, and write the data into HDFS.

Launches a MapReduce job every hour(?).

Can inspect the data as it's coming in -- lets it handle data based on time/content. Oozie

Oozie is a Hadoop job scheduler that allows the composition of complex workflows.

Jobs are launched based on the existence of new data sets, rather than simply based on time or periodic intervals. This lets us can trigger Oozie whenever a Camus job completes. Hue

Web GUI for interacting with Hadoop, Hive, Oozie, etc.

Provides a Hive query interface, a Pig script interface, a way to launch jobs, browse the file system, and install add-ons.

A command-line interface is also available for all of the above. MediaWiki Vagrant

To play with this in Vagrant:

Edit your *.settings.yaml* file and add:

vagrant_ram: 2048

Comment out include role::mediawiki in puppet/manifests/site.pp

(unless you really need this on your VM)

Run: vagrant enable-role analytics

Run: vagrant up

Hive and Hadoop should now be available in Vagrant!

On Thu, Jun 25, 2015 at 10:41 AM, James Douglas jdouglas@wikimedia.org wrote:

...
Ooh!

https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled

On Thu, Jun 25, 2015 at 10:28 AM, James Douglas jdouglas@wikimedia.org wrote:

...
This looks possibly relevant: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Overview

On Thu, Jun 25, 2015 at 10:03 AM, James Douglas <jdouglas@wikimedia.org

...
wrote:

...
...
The varnish logs == request logs == also in HDFS.

Ah ha, thanks!

...
To get access you'll need a phabricator ticket asking for stat1002

and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

Cool, I'll get on that. In the meantime, where can I learn about the infrastructure?

On Thu, Jun 25, 2015 at 10:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:

...
The varnish logs == request logs == also in HDFS. To get access you'll need a phabricator ticket asking for stat1002 and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

On 25 June 2015 at 12:53, James Douglas jdouglas@wikimedia.org wrote: > From IRC, it sounds like this information ought to be available in the > Varnish logs. What's the story there? > > On Thu, Jun 25, 2015 at 9:52 AM, James Douglas < jdouglas@wikimedia.org> > wrote: >> >> I misspoke: we're looking for HTTP requests coming from users who are >> leaving the Portal, not retrieving the portal. >> >> e.g. Clicking on enwiki, using one of the search forms, etc. >> >> On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes < okeyes@wikimedia.org> >> wrote: >>> >>> * Nope :( >>> * It's in HDFS! >>> >>> On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org wrote: >>> > Let's say, hypothetically, that I wanted to measure information about >>> > HTTP >>> > requests coming into the Wikipedia Portal (www.wikipedia.org). >>> > >>> > * Do we record this information? >>> > * If so, is it accessible via analytical tools? >>> > * If so, how do I get my mitts on it? >>> > * If not, is it accessible from a database or similar? >>> > >>> > Context: https://phabricator.wikimedia.org/T100673 >>> > >>> > _______________________________________________ >>> > Wikimedia-search mailing list >>> > Wikimedia-search@lists.wikimedia.org >>> > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >>> > >>> >>> >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Wikimedia-search mailing list >>> Wikimedia-search@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >> >> > > > _______________________________________________ > Wikimedia-search mailing list > Wikimedia-search@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Oliver Keyes

5:55 a.m.

then yeah, the unsampled logs will have those and be less of a colossal PITA to deal with, too

On 25 June 2015 at 15:50, James Douglas jdouglas@wikimedia.org wrote:

...

We're looking for HTTP requests coming from users who are leaving the Portal, heading to some other WMF site.

e.g. Clicking on enwiki, using one of the search forms, etc.

See also: https://phabricator.wikimedia.org/T100673

On Thu, Jun 25, 2015 at 12:31 PM, Oliver Keyes okeyes@wikimedia.org wrote:

...
What exactly are you looking for or trying to do? There is, as you've seen, lotsa stuff to learn ;p

On 25 June 2015 at 14:54, James Douglas jdouglas@wikimedia.org wrote:

...
I discovered a presentation from last year that has answered most of my initial questions. Here are my notes:

Notes from Hadoop and Beyond https://www.youtube.com/watch?v=tx1pagZOsiM June 25, 2015

Video https://www.youtube.com/watch?v=tx1pagZOsiM

Slides

https://docs.google.com/presentation/d/1ZPmfN-kmfqWEJUMIRg2feSstFPY45js4AnYaf3NbLNE/

webrequest logs

This is a log for every WMF HTTP request. It can max out beyond 200k requests per second, which is *a lot*. udp2log

Doesn't scale, because every instance must process every message, every packet.

Because it uses UDP, it's not guaranteed not to drop data. Wikimedia Statistics

*http://stats.wikimedia.org http://stats.wikimedia.org*

Most data here is generated by udp2log-collected data.

It's sampled because there's too much traffic for our storage/processing capacity. Analytics cluster

Uses Hadoop for batch processing of logs, and (mostly) uses Hive to expose the data to analysts.

This diagram is a useful (and frequent) reference: [image: Analytics cluster diagram]

Analytics cluster diagram

Note the loopy-back lines in the diagram -- these are batch jobs to do various things (geocoding, anonymizing, etc.). Hadoop

Hadoop = a distributed file system + a framework for distributed computation Hive

*Hypothetical analyst question: How to get the top referrals for an article?*

Hive maps a SQL-like language onto Hadoop MapReduce jobs.

*Example Hive query to answer the above question:*

SELECT SUBSTR(referer,30) AS SOURCE, COUNT(DISTINCT ip) AS hitsFROM webrequestWHERE uri_path = "/wiki/London" AND uri_host = "en.wikipedia.org" AND referer LIKE "http://en.wikipedia.org/wiki/%" AND http_status = 200 AND webrequest_source = ‘text’ AND year = 2014 AND month= 07 AND day = 14 AND hour = 14GROUP BY SUBSTR(referer,30)ORDER BY hits DESC LIMIT 50;

This is nice because it lets you run SQL on top of text data. Kafka cluster

This serves as a replacement for udp2log.

Kafka is a reliable, horizontally scalable, distributed pub/sub buffer.

Processes up to 200k messages per second at 30 MB per second.

Data is consumed every ten minutes into Hadoop Camus

A job that runs on Hadoop to consume from Kafka, and write the data into HDFS.

Launches a MapReduce job every hour(?).

Can inspect the data as it's coming in -- lets it handle data based on time/content. Oozie

Oozie is a Hadoop job scheduler that allows the composition of complex workflows.

Jobs are launched based on the existence of new data sets, rather than simply based on time or periodic intervals. This lets us can trigger Oozie whenever a Camus job completes. Hue

Web GUI for interacting with Hadoop, Hive, Oozie, etc.

Provides a Hive query interface, a Pig script interface, a way to launch jobs, browse the file system, and install add-ons.

A command-line interface is also available for all of the above. MediaWiki Vagrant

To play with this in Vagrant:

Edit your *.settings.yaml* file and add:

vagrant_ram: 2048

Comment out include role::mediawiki in puppet/manifests/site.pp

(unless you really need this on your VM)

Run: vagrant enable-role analytics

Run: vagrant up

Hive and Hadoop should now be available in Vagrant!

On Thu, Jun 25, 2015 at 10:41 AM, James Douglas jdouglas@wikimedia.org wrote:

...
Ooh!

https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequests_sampled

On Thu, Jun 25, 2015 at 10:28 AM, James Douglas <jdouglas@wikimedia.org

...
wrote:

...
This looks possibly relevant: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Overview

On Thu, Jun 25, 2015 at 10:03 AM, James Douglas < jdouglas@wikimedia.org> wrote:

...
> The varnish logs == request logs == also in HDFS.

Ah ha, thanks!

> To get access you'll need a phabricator ticket asking for stat1002 and analytics cluster access, with Ottomata CCd to make the patch and Dan CCd to confirm you need it.

Cool, I'll get on that. In the meantime, where can I learn about the infrastructure?

On Thu, Jun 25, 2015 at 10:01 AM, Oliver Keyes okeyes@wikimedia.org wrote:

> The varnish logs == request logs == also in HDFS. To get access > you'll > need a phabricator ticket asking for stat1002 and analytics cluster > access, with Ottomata CCd to make the patch and Dan CCd to confirm > you > need it. > > On 25 June 2015 at 12:53, James Douglas jdouglas@wikimedia.org > wrote: > > From IRC, it sounds like this information ought to be available in > the > > Varnish logs. What's the story there? > > > > On Thu, Jun 25, 2015 at 9:52 AM, James Douglas < > jdouglas@wikimedia.org> > > wrote: > >> > >> I misspoke: we're looking for HTTP requests coming from users who > are > >> leaving the Portal, not retrieving the portal. > >> > >> e.g. Clicking on enwiki, using one of the search forms, etc. > >> > >> On Thu, Jun 25, 2015 at 9:50 AM, Oliver Keyes < > okeyes@wikimedia.org> > >> wrote: > >>> > >>> * Nope :( > >>> * It's in HDFS! > >>> > >>> On 25 June 2015 at 12:05, James Douglas jdouglas@wikimedia.org > wrote: > >>> > Let's say, hypothetically, that I wanted to measure > information about > >>> > HTTP > >>> > requests coming into the Wikipedia Portal (www.wikipedia.org). > >>> > > >>> > * Do we record this information? > >>> > * If so, is it accessible via analytical tools? > >>> > * If so, how do I get my mitts on it? > >>> > * If not, is it accessible from a database or similar? > >>> > > >>> > Context: https://phabricator.wikimedia.org/T100673 > >>> > > >>> > _______________________________________________ > >>> > Wikimedia-search mailing list > >>> > Wikimedia-search@lists.wikimedia.org > >>> > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search > >>> > > >>> > >>> > >>> > >>> -- > >>> Oliver Keyes > >>> Research Analyst > >>> Wikimedia Foundation > >>> > >>> _______________________________________________ > >>> Wikimedia-search mailing list > >>> Wikimedia-search@lists.wikimedia.org > >>> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search > >> > >> > > > > > > _______________________________________________ > > Wikimedia-search mailing list > > Wikimedia-search@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search > > > > > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation > > _______________________________________________ > Wikimedia-search mailing list > Wikimedia-search@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Wikimedia-search mailing list Wikimedia-search@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

-- Oliver Keyes Research Analyst Wikimedia Foundation

3442

Age (days ago)

3442

Last active (days ago)

discovery@lists.wikimedia.org

13 comments

2 participants

tags (0)

participants (2)

James Douglas
Oliver Keyes