Apologies for crossposting
Heya,
The Analytics Team is planning to deploy "tab as field delimiter" to replace the current space as fielddelimiter on the varnish/squid/nginx servers. We would like to do this on February 1st. The reason for this change is that we need to have a consistent number of fields in each webrequest log line. Right now, some fields contain spaces and that require a lot of post-processing cleanup and slows down the generation of reports.
What is affected and maintained by Analytics
* udp-filter already has support for the tab character * webstatscollector: we compiled a new version of filter to add support for the tab character * wikistats: we will fix the scripts on an ongoing basis. * udp2log: we have a patch ready for inserting sequence numbers separated by tab.
In particular, I would like to have feedback to three questions:
1) Are there important reasons not to use tab as field delimiter?
2) Are there important pieces of logging that expect a space instead of a tab and that need to be fixed and that I did not mention in this email?
3) Is February 1st a good date to deploy this change? (Assuming that all preps are finished)
Best,
Diederik
Just to clarify, will this affect the stats at http://dumps.wikimedia.org/other/pagecounts-raw/ ? Changing the format of that will probably break third party scripts. -- -bawolff
On Fri, Jan 25, 2013 at 1:41 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Apologies for crossposting
Heya,
The Analytics Team is planning to deploy "tab as field delimiter" to replace the current space as fielddelimiter on the varnish/squid/nginx servers. We would like to do this on February 1st. The reason for this change is that we need to have a consistent number of fields in each webrequest log line. Right now, some fields contain spaces and that require a lot of post-processing cleanup and slows down the generation of reports.
What is affected and maintained by Analytics
- udp-filter already has support for the tab character
- webstatscollector: we compiled a new version of filter to add support for
the tab character
- wikistats: we will fix the scripts on an ongoing basis.
- udp2log: we have a patch ready for inserting sequence numbers separated
by tab.
In particular, I would like to have feedback to three questions:
Are there important reasons not to use tab as field delimiter?
Are there important pieces of logging that expect a space instead of a
tab and that need to be fixed and that I did not mention in this email?
- Is February 1st a good date to deploy this change? (Assuming that all
preps are finished)
Best,
Diederik _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
No, the output format of http://dumps.wikimedia.org/other/pagecounts-raw/ will stay the same. Best, Diederik
On Fri, Jan 25, 2013 at 12:48 PM, bawolff bawolff+wn@gmail.com wrote:
Just to clarify, will this affect the stats at http://dumps.wikimedia.org/other/pagecounts-raw/ ? Changing the format of that will probably break third party scripts. -- -bawolff
On Fri, Jan 25, 2013 at 1:41 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
Apologies for crossposting
Heya,
The Analytics Team is planning to deploy "tab as field delimiter" to replace the current space as fielddelimiter on the varnish/squid/nginx servers. We would like to do this on February 1st. The reason for this change is that we need to have a consistent number of fields in each webrequest log line. Right now, some fields contain spaces and that
require
a lot of post-processing cleanup and slows down the generation of
reports.
What is affected and maintained by Analytics
- udp-filter already has support for the tab character
- webstatscollector: we compiled a new version of filter to add support
for
the tab character
- wikistats: we will fix the scripts on an ongoing basis.
- udp2log: we have a patch ready for inserting sequence numbers separated
by tab.
In particular, I would like to have feedback to three questions:
Are there important reasons not to use tab as field delimiter?
Are there important pieces of logging that expect a space instead of a
tab and that need to be fixed and that I did not mention in this email?
- Is February 1st a good date to deploy this change? (Assuming that all
preps are finished)
Best,
Diederik _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Jan 25, 2013 at 12:51 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
No, the output format of http://dumps.wikimedia.org/other/pagecounts-raw/ will stay the same.
It seems that page names are coming through with spaces now, where they didn't before. See https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Format_...
This bug has been fixed, see https://bugzilla.wikimedia.org/show_bug.cgi?id=45178
I will post a message on the Village Pump as well.
Best, Diederik
On Sun, Feb 3, 2013 at 3:44 PM, Brad Jorsch bjorsch@wikimedia.org wrote:
On Fri, Jan 25, 2013 at 12:51 PM, Diederik van Liere dvanliere@wikimedia.org wrote:
No, the output format of
http://dumps.wikimedia.org/other/pagecounts-raw/
will stay the same.
It seems that page names are coming through with spaces now, where they didn't before. See
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Format_...
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org