On Wed, Oct 20, 2010 at 5:51 AM, Domas Mituzas <midom.lists(a)gmail.com> wrote:
It seems like
an awful lot of trouble to teach every software author
that they need to follow a particular convention just so the stats
engine will work as intended. It would seem like it would be much
simpler to teach the stats engine to simply detect and ignore this
special case. Or is there a reason that doing so is not possible?
Heh, apparently stats became a big deal lately, so one with powers to
change that can feel important! ;-)
Anyway, there're few choices to resolve it on the stats side:
1) Implement pulling of a namespace map for each project, build out
an efficient rules engine (in C) for dealing with this (do note, every
project will have different namespace for this URL). Also, make it
extensible, so each developer tells about which names will be
not-a-pageview ;-) There's nothing as fun as writing that kind of
code, and do note, it won't be just five (or fifty) lines.
<snip>
3) Not care about inflated per-project numbers, or
have people adjust
the numbers, as the source data is there (They can filter out banner
loader themselves!)
I think my comment about "stats engine" may have been confusing. I
tend to think of the entire process chain as part of the stats engine,
even though it is implemented as distinct collection and
interpretation bits.
There is no reason that the filtering has to be done in the stats
collector. It could be done there, but given the language variants
that is likely to be hard to code and slow, as you rightly point out.
I think I had more in mind that it be filtered at the interpretation
side of the stats process. In other words, that Zachte (or whoever)
generate a list of pages that are ignored for the purposes of counting
stats. That would seem to be an easier place to deal with an
exclusion list and to pull all language versions of those page names,
and such. Having such an exclusion list for interpretation will be
necessary anyway if we plan to reprocess the existing logs that don't
follow the suggested convention. (I'm assuming we don't want to
simply throw out three weeks of logs.)
-Robert Rohde