hitcount stats

List overview All Threads
Download

newer

older

automated email

On hit counters

Steve Summit

30 Aug 2006 30 Aug '06

2:07 p.m.

One significant potential source of error in Leon's (marvelous!) new hitcount stats is the possibility that one reader is for whatever reason fetching the same page multiple times (perhaps due to nothing more than a prolonged edit).

Obviously it would be best to filter out multiple fetches of the same page from the same IP address over some interval, probably one day. (Yes, this could then undercount hits from behind NAT firewalls and proxies, but I think it'd still be worth it overall.)

I know that Leon's scheme is currently not logging IP addresses, and given AOL's recent high-profile screwup I have to agree that not logging IP addresses in this context is probably a good idea. But what if we logged a one-way hash of the IP address, that couldn't be correlated with anything else?

Show replies by date

Simetrical

30 Aug 30 Aug

3:05 p.m.

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...

But what if we logged a one-way hash of the IP address, that couldn't be correlated with anything else?

There are only about four billion possible IP addresses. Anyone could just do a brute-force execution of whatever hashing algorithm we use on every IP address. Really, though, there's no harm in storing IP address-pageview links for a short period of time, like a day.

However, this wouldn't require that, and indeed, a server-side solution would be impossible: 99.9% of page hits won't go to the server to start with. Since JavaScript is being used anyway, you can just have the script only run the first time you visit a given page per session.

Steve Bennett

3:24 p.m.

On 8/30/06, Simetrical Simetrical+wikitech@gmail.com wrote:

...

However, this wouldn't require that, and indeed, a server-side solution would be impossible: 99.9% of page hits won't go to the server to start with. Since JavaScript is being used anyway, you can just have the script only run the first time you visit a given page per session.

Actually now that I think about this, does this actually sufficiently model the data we want to collect? Are we interested only in "how many people visit a certain page" and not also in "how many times a certain page is viewed"? If 5 users spend a whole day arguing back on forth on Wikipedia talk:Pokémon, is 5 or 200 a more interesting/useful/relevant metric for that page?

We should probably start thinking about exactly why we want this data, and what we should do with the results of it.

Steve

Jay R. Ashworth

3:34 p.m.

On Wed, Aug 30, 2006 at 05:24:51PM +0200, Steve Bennett wrote:

...

On 8/30/06, Simetrical Simetrical+wikitech@gmail.com wrote:

...
However, this wouldn't require that, and indeed, a server-side solution would be impossible: 99.9% of page hits won't go to the server to start with. Since JavaScript is being used anyway, you can just have the script only run the first time you visit a given page per session.

Actually now that I think about this, does this actually sufficiently model the data we want to collect? Are we interested only in "how many people visit a certain page" and not also in "how many times a certain page is viewed"? If 5 users spend a whole day arguing back on forth on Wikipedia talk:Pokémon, is 5 or 200 a more interesting/useful/relevant metric for that page?

Yes.

...

We should probably start thinking about exactly why we want this data, and what we should do with the results of it.

Indeed; they're two separate, and both useful, measurements needed by different audiences.

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274 The Internet: We paved paradise, and put up a snarking lot.

Steve Summit

3:52 p.m.

Steve Bennett wrote:

...

Actually now that I think about this, does this actually sufficiently model the data we want to collect? Are we interested only in "how many people visit a certain page" and not also in "how many times a certain page is viewed"? If 5 users spend a whole day arguing back on forth on Wikipedia talk:Pokémon, is 5 or 200 a more interesting/useful/relevant metric for that page?

Me, I think I'm much more interested in the former. Among other things, it's an objective measure of something at least vaguely akin to the elusive concept of "notability", and one big reason for filtering out multiple hits from the same browser is therefore to make it harder for people to deliberately skew the statistic.

The latter statistic -- assuming the argument takes the form of actual edits -- is already derivable directly from the page history, isn't it?

Steve Summit

3:45 p.m.

Simetrical wrote:

...

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...
But what if we logged a one-way hash of the IP address, that couldn't be correlated with anything else?

There are only about four billion possible IP addresses. Anyone could just do a brute-force execution of whatever hashing algorithm we use on every IP address.

Well, no, not just "anyone". :-)

...

Really, though, there's no harm in storing IP address-pageview links for a short period of time, like a day.

I would tend to agree. But three people at AOL lost their jobs because of something they honestly thought there was "no harm" in doing. And it's very difficult (if not impossible) to guarantee that something gets kept for only a day.

...

However, this wouldn't require that, and indeed, a server-side solution would be impossible: 99.9% of page hits won't go to the server to start with.

Not sure what you mean here.

...

Since JavaScript is being used anyway, you can just have the script only run the first time you visit a given page per session.

But that would be considerably more work to implement, and would require arbitrary amounts of state kept in the browser, and would break down if the browser were restarted (or perhaps just if the tab or window were closed).

Simetrical

3:51 p.m.

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...

Well, no, not just "anyone". :-)

Anyone *could*. Most people just wouldn't know *how*.

...

I would tend to agree. But three people at AOL lost their jobs because of something they honestly thought there was "no harm" in doing. And it's very difficult (if not impossible) to guarantee that something gets kept for only a day.

If it's possible to guarantee it gets kept, it's possible to guarantee it only gets kept for a day.

...

...
However, this wouldn't require that, and indeed, a server-side solution would be impossible: 99.9% of page hits won't go to the server to start with.

Not sure what you mean here.

What effect would it have if I reloaded the page fifty times? I wouldn't send fifty messages to the view-logging server instead of one; I would have a 4.88% chance of sending *one* message, rather than a 0.1% chance. The server doesn't know that I reloaded the page fifty times: it just knows that it was told I visited it an *average* of 1000 times (averaging it with non-hits). It can't, therefore, discard the extra 49 page loads; it never received them. The client has to discard them if anyone's going to.

...

But that would be considerably more work to implement, and would require arbitrary amounts of state kept in the browser, and would break down if the browser were restarted (or perhaps just if the tab or window were closed).

That's not a bug, it's a feature: it shouldn't be the same page hit if I leave and then return. And more work than "impossible" is rather difficult.

Steve Summit

4:05 p.m.

Simetrical wrote:

...

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...
Well, no, not just "anyone". :-)

Anyone *could*. Most people just wouldn't know *how*.

Ah. So you can high jump 8 feet, can you?

...

...
And it's very difficult (if not impossible) to guarantee that something gets kept for only a day.

If it's possible to guarantee it gets kept, it's possible to guarantee it only gets kept for a day.

False (unless you're splitting hairs).

...

...
...
However, this wouldn't require that, and indeed, a server-side solution would be impossible: 99.9% of page hits won't go to the server to start with.

Not sure what you mean here.

What effect would it have if I reloaded the page fifty times? I wouldn't send fifty messages to the view-logging server instead of one; I would have a 4.88% chance of sending *one* message, rather than a 0.1% chance.

Okay, but that's true only as long as (a) the stats factor is in the thousands, which it doesn't have to be (and isn't for some wikimedia projects, and (2) nobody's trying to deliberately skew the results. But also, it only *matters* if you're trying to keep (not discard) the extra hits, i.e. if you do want to say something like "M people viewed it N times" as opposed to "M people viewed it at least once". If you're interested in discarding redundant hits, it obviously doesn't matter whether the browser or the server does it.

Simetrical

4:24 p.m.

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...

...
If it's possible to guarantee it gets kept, it's possible to guarantee it only gets kept for a day.

False (unless you're splitting hairs).

// If you remove the line after the next or refactor this code, we will // flay the living flesh from your bones $db->write("$IP visited this page, yay") $db->check_if_stuff_is_over_a_day_old_and_deal_with_it(); // If you remove the above line or refactor this code, we will flay the // living flesh from your bones

...

Okay, but that's true only as long as (a) the stats factor is in the thousands,

No, it's true as long as it's above one. Even if it's just two, someone making two page views would have a 75% chance of getting one hit through, instead of a 50% chance: a major difference.

...

(2) nobody's trying to deliberately skew the results.

If anybody is, we're screwed anyway if we're doing sampling.

...

But also, it only *matters* if you're trying to keep (not discard) the extra hits, i.e. if you do want to say something like "M people viewed it N times" as opposed to "M people viewed it at least once".

Um, this entire discussion is about the latter.

...

If you're interested in discarding redundant hits, it obviously doesn't matter whether the browser or the server does it.

Except that the server can't do it.

On 8/30/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...

H(secret + ip) can only be inverted by exhaustive search of both the secret and the IP (or the secret if you happen to have some known H(), IP pairs)... and the secret can be much longer than 32 bits.

Except that presumably anyone with access to the actual encoded IPs will have access to the secret as well, yes? Or are we talking about letting *anyone* see the encoded IP-pageview correlations? In which case, that is kind of a privacy violation, in the AOL style.

(You could always change the secret, of course . . . first check if H(secret(1) + ip) exists, and if it does, use H(secret(2) + ip) instead if that doesn't exist, and so forth . . . but then there's no point in making it public, and we're back to the "anyone who knows the encoded IPs knows the secret anyway" thing.)

Gregory Maxwell

4:33 p.m.

On 8/30/06, Simetrical Simetrical+wikitech@gmail.com wrote:

...

On 8/30/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...
H(secret + ip) can only be inverted by exhaustive search of both the secret and the IP (or the secret if you happen to have some known H(), IP pairs)... and the secret can be much longer than 32 bits.

Except that presumably anyone with access to the actual encoded IPs will have access to the secret as well, yes? Or are we talking about letting *anyone* see the encoded IP-pageview correlations? In which case, that is kind of a privacy violation, in the AOL style.

It can be easily configured so that anyone with access to the secret has privileged access to the server and, already, anyone with privileged access to the server could be logging IPs.

Simetrical

4:40 p.m.

On 8/30/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...

It can be easily configured so that anyone with access to the secret has privileged access to the server and, already, anyone with privileged access to the server could be logging IPs.

Yes, but again, there's no good reason to allow anyone without privileged access to the server to see the IPs in the first place, encoded or not, so why bother encoding them for storage? *If* you're going to allow people to view the connections the way AOL did, you may as well assign arbitrary numbers (say, chronologically) rather than some encoded form of the IP, since that's easier to implement *and* more secure, if only marginally.

Gregory Maxwell

4:45 p.m.

On 8/30/06, Simetrical Simetrical+wikitech@gmail.com wrote:

...

On 8/30/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...
It can be easily configured so that anyone with access to the secret has privileged access to the server and, already, anyone with privileged access to the server could be logging IPs.

Yes, but again, there's no good reason to allow anyone without privileged access to the server to see the IPs in the first place, encoded or not, so why bother encoding them for storage? *If* you're going to allow people to view the connections the way AOL did, you may as well assign arbitrary numbers (say, chronologically) rather than some encoded form of the IP, since that's easier to implement *and* more secure, if only marginally.

It's not easier to impliment numbering IPs, actually. Hashing is memoryless.

The reason to use it for storage is the above mentioned paranoia about being able to make sure things are not retained too long....

It's all a silly and pointless argument in my view, and it's really off topic for this list.

Simetrical

4:47 p.m.

On 8/30/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...

It's not easier to impliment numbering IPs, actually.

Autoblock messages use the table row number. :)

Gregory Maxwell

4:53 p.m.

On 8/30/06, Simetrical Simetrical+wikitech@gmail.com wrote:

...

On 8/30/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...
It's not easier to impliment numbering IPs, actually.

Autoblock messages use the table row number. :)

Do you have any clue about the details of the counter's internal operations?

Why must you argue with someone who does?

-- Greg (Who is tired of this idiotic speculation)

Simetrical

5:01 p.m.

On 8/30/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...

Do you have any clue about the details of the counter's internal operations?

Why must you argue with someone who does?

1) No.

2) Because I didn't know that you did and was thinking hypothetically, not concretely. (But the counter currently doesn't track IPs, does it? So why is the current implementation relevant? . . . Well, never mind, it doesn't matter. Whatever works, really, I don't much care what happens behind the curtains unless I have to deal with it, which I don't.)

Steve Summit

4:48 p.m.

Simetrical wrote:

...

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...
...
Simetrical wrote: If it's possible to guarantee it gets kept, it's possible to guarantee it only gets kept for a day.

False (unless you're splitting hairs).

// If you remove the line after the next or refactor this code, // we will flay the living flesh from your bones $db->write("$IP visited this page, yay") $db->check_if_stuff_is_over_a_day_old_and_deal_with_it();

You forgot

// If you make a copy of $db on your laptop for testing, // we will flay the living flesh from your bones

and more importantly

// We are holding everyone on the planet hostage to ensure // that they never make backups of $db, but if something // happens to us, you're on the hook for that, too.

...

...
(2) nobody's trying to deliberately skew the results.

If anybody is, we're screwed anyway if we're doing sampling.

Actually, we were both wrong on that. It's "If anyone's trying to deliberately skew the results, we're screwed if we're doing hashing."

...

...
If you're interested in discarding redundant hits, it obviously doesn't matter whether the browser or the server does it.

Except that the server can't do it.

One of us is being very obtuse here, but I'm not sure which.

If I'm trying to count "approximate number of people who have viewed the page at least once", without possibly overcounting people who have viewed the page multiple times, then depending on the numbers it may be true that the server probably doesn't have to, but under other circumstances it might very well have to, and it's certainly strange to say that it "can't".

Saying "the server can't do it" is sort of like saying "Javier Sotomayor can't high jump seven feet".

Simetrical

4:55 p.m.

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...

You forgot

    // If you make a copy of $db on your laptop for testing,
    // we will flay the living flesh from your bones

and more importantly

    // We are holding everyone on the planet hostage to ensure
    // that they never make backups of $db, but if something
    // happens to us, you're on the hook for that, too.

Well, you'd have to give instructions to anyone with shell access, too, and keep careful track of all . . . okay, but it would be roughly possible. We already do it for checkuser IPs.

...

One of us is being very obtuse here, but I'm not sure which.

Neither am I . . .

...

If I'm trying to count "approximate number of people who have viewed the page at least once", without possibly overcounting people who have viewed the page multiple times, then depending on the numbers it may be true that the server probably doesn't have to, but under other circumstances it might very well have to, and it's certainly strange to say that it "can't".

If you're only sending one view in every X to the view-counting server, then it doesn't know how many times you've viewed the page. It therefore can't exclude multiple consecutive views by the same IP from its count. Yes? No?

Gregory Maxwell

5:01 p.m.

On 8/30/06, Simetrical Simetrical+wikitech@gmail.com wrote: [snip]

...

If you're only sending one view in every X to the view-counting server, then it doesn't know how many times you've viewed the page. It therefore can't exclude multiple consecutive views by the same IP from its count. Yes? No?

Correct.

It could exclude duplicate sampled hits, but then it's not clear what the heck it's measuring.. it's not measuring unique views, it's not measuring pure views. It would just be a strange and useless metric.

Which is what I said up the thread.

Steve Summit

5:25 p.m.

Greg Maxwell wrote:

...

It could exclude duplicate sampled hits, but then it's not clear what the heck it's measuring.. it's not measuring unique views, it's not measuring pure views.

Sampled unique hits. What's wrong with that? (No worse than sampled hits, it seems to me.)

But enough of this, at any rate.

Gregory Maxwell

5:43 p.m.

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...

Greg Maxwell wrote:

...
It could exclude duplicate sampled hits, but then it's not clear what the heck it's measuring.. it's not measuring unique views, it's not measuring pure views.

Sampled unique hits. What's wrong with that? (No worse than sampled hits, it seems to me.)

You can't sample unique hits, at least not without first capturing all hits.

You could have unique sampled hits, but that doesn't map into any metric anyone uses... and it will be misunderstood as sampled unique hits which is another thing entirely.

Leon Weber

31 Aug 31 Aug

4:34 a.m.

Gregory Maxwell schrieb:

...

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...
Greg Maxwell wrote:

...
It could exclude duplicate sampled hits, but then it's not clear what the heck it's measuring.. it's not measuring unique views, it's not measuring pure views.

Sampled unique hits. What's wrong with that? (No worse than sampled hits, it seems to me.)

You can't sample unique hits, at least not without first capturing all hits.

Possibly it already does, I think the index.png image gets cached in the browser. I wanted to avoid that once by adding a timestamp to the image URL. -- Leon

Gregory Maxwell

4:36 a.m.

On 8/31/06, Leon Weber leon.weber@leonweber.de wrote:

...

Possibly it already does, I think the index.png image gets cached in the browser. I wanted to avoid that once by adding a timestamp to the image URL.

Of course, their whole request might get cached as well. :) Unless they shift reload, and that should regrab the image as well.

Simetrical

5:15 a.m.

On 8/30/06, Gregory Maxwell gmaxwell@gmail.com wrote:

...

You can't sample unique hits, at least not without first capturing all hits.

Or checking for uniqueness on the JavaScript level. I doubt this is a big issue, though, at least for reasonably large articles; undoubtedly the vast majority of hits we get are readers, who won't reload the page a lot.

Jay R. Ashworth

30 Aug 30 Aug

5:11 p.m.

On Wed, Aug 30, 2006 at 12:24:50PM -0400, Simetrical wrote:

...

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...
...
If it's possible to guarantee it gets kept, it's possible to guarantee it only gets kept for a day.

False (unless you're splitting hairs).

// If you remove the line after the next or refactor this code, we will // flay the living flesh from your bones $db->write("$IP visited this page, yay") $db->check_if_stuff_is_over_a_day_old_and_deal_with_it(); // If you remove the above line or refactor this code, we will flay the // living flesh from your bones

Sim, you're discussing what *code* will do. Steve is discussing what *people* can do -- see also "capability creep" and "US Government".

...

...
If you're interested in discarding redundant hits, it obviously doesn't matter whether the browser or the server does it.

Except that the server can't do it.

And to clarify the misundertanding that I believe Steve has on this point:

<whisper> Squids </whisper>

Cheers, -- jra

Rob Church

3:59 p.m.

On 30/08/06, Steve Summit scs@eskimo.com wrote:

...

I would tend to agree. But three people at AOL lost their jobs because of something they honestly thought there was "no harm" in doing. And it's very difficult (if not impossible) to guarantee that something gets kept for only a day.

I plead ignorance, sir. Do provide a URL?

Wait, since when have AOL had ethics of any sort!?

Rob Church

Simetrical

4:01 p.m.

On 8/30/06, Rob Church robchur@gmail.com wrote:

...

I plead ignorance, sir. Do provide a URL?

I believe you introduced me to www.justfuckinggoogleit.com, Rob. ;)

http://news.google.com/news?q=AOL+logs

Rob Church

4:03 p.m.

On 30/08/06, Simetrical Simetrical+wikitech@gmail.com wrote:

...

On 8/30/06, Rob Church robchur@gmail.com wrote:

...
I plead ignorance, sir. Do provide a URL?

I believe you introduced me to www.justfuckinggoogleit.com, Rob. ;)

Yes, but because I'm special and so well loved, and so damn irritating when I don't get my own way, you'd never DREAM of using that on me, would you now?

Rob Church

Steve Summit

4:23 p.m.

Simetrical wrote:

...

On 8/30/06, Rob Church robchur@gmail.com wrote:

...
On 30/08/06, Steve Summit scs@eskimo.com wrote:

...
I would tend to agree. But three people at AOL lost their jobs because of something they honestly thought there was "no harm" in doing. And it's very difficult (if not impossible) to guarantee that something gets kept for only a day.

I plead ignorance, sir. Do provide a URL?

To the personnel implications of the AOL debacle:

http://www.theregister.co.uk/2006/08/30/online_anonymity/ http://www.ovum.com/news/euronews.asp?id=4770

To my claim about data retention:

http://catless.ncl.ac.uk/Risks/23.76.html#subj1

...

I believe you introduced me to www.justfuckinggoogleit.com, Rob. ;)

Heh. Hadn't come across that one before. Thanks.

Actually, even though the story was (it seemed) all over the media last week, it was surprisingly hard to find more than a couple of hits today. Besides "AOL" and "CTO" and "resign", another couple of keywords to use for anyone who wants to read further would be "Maureen McGovern".

Gregory Maxwell

4:12 p.m.

On 8/30/06, Simetrical Simetrical+wikitech@gmail.com wrote:

...

There are only about four billion possible IP addresses. Anyone could just do a brute-force execution of whatever hashing algorithm we use on every IP address. Really, though, there's no harm in storing IP address-pageview links for a short period of time, like a day.

[snip]

H(secret + ip) can only be inverted by exhaustive search of both the secret and the IP (or the secret if you happen to have some known H(), IP pairs)... and the secret can be much longer than 32 bits.

However the fuss about the AOL logs showed that, at least for search strings, mere correlation of requests was enough to leak too much data. I do not believe that the same is true for page hits, but thats the consideration.

To me it seems a bit foolish of an argument though... any one of our admins could add such a bug... any upstream ISP could sniff the traffic.... and we all know that the US Government is already doing so. ;) but it is what it is..... and for some reason people don't like the prospects of the world figuring out that they have a venereal disease. Silly people.

Tels

5:20 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin,

On Wednesday 30 August 2006 18:12, Gregory Maxwell wrote:

...

On 8/30/06, Simetrical Simetrical+wikitech@gmail.com wrote:

...
There are only about four billion possible IP addresses. Anyone could just do a brute-force execution of whatever hashing algorithm we use on every IP address. Really, though, there's no harm in storing IP address-pageview links for a short period of time, like a day.

[snip]

H(secret + ip) can only be inverted by exhaustive search of both the secret and the IP (or the secret if you happen to have some known H(), IP pairs)... and the secret can be much longer than 32 bits.

So, if you can't guarantee that the hashes of the IP (including the log) don't leak out, how can you guarantee that the secret doesn't leak out? Answer: You can't.

The only safe way to not leak these information out is not even to store them.

If you log this data, expect law inforcement knocking on your door next week and ask "for all information pertaining the view of pages X, Y, Z, ... (continue for 1000 more), or IP adresses U, V, W, ... (continue for 1000 more, in regard to $alleged_terrorist_attack_of_the_week".

...

However the fuss about the AOL logs showed that, at least for search strings, mere correlation of requests was enough to leak too much data. I do not believe that the same is true for page hits, but thats the consideration.

To me it seems a bit foolish of an argument though... any one of our admins could add such a bug... any upstream ISP could sniff the traffic.... and we all know that the US Government is already doing so. ;) but it is what it is..... and for some reason people don't like the prospects of the world figuring out that they have a venereal disease. Silly people.

Maybe they just don't want the whole [censored] world to know what they read, search, use, write, or like. See: AOL.

The next time you enter by accident your CC number, SSN, or any other data that identifies you into the seach box of mediawiki, consider how much better you would feel if you nobody recorded, logged, backed up, stored, processed, and then made public your data.

Just because someone _could_ collect the data already doesn't mean Mediawiki foundation should do, too.

Best wishes,

Tels

- -- Signed on Wed Aug 30 19:14:35 2006 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email.

"Boooooooring!" - Dot, the Warner sister

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) iQEVAwUBRPXI+3cLPEOTuEwVAQJ+Kgf+LCutkvBbTRwOW5mm5BwPgG8eXCPxb75V NS/BJprzDbVLOdsRIFl6f9nhd1qRjbNdl6Q6YTCJDhsFoSaW3QeMowWrvkWRs1b5 jHAclcfnhwukjYj9bQzJKCfw3FhU9DJLLDPENt5hNe4ZgR+XfNZbM/RWAtcoWqql aTW6FPl0jJCIq5lkR5jtB57+eRQ5FM9RpBe39iXaKwvP1G/jGonDkgL70guuVCzo nY3QzbRDRTN1qSg4cZzmFJrRE/kbO2IWLN6TBI0mbbDn75iyNPNu8WhI2jlmbMEJ fsU9Em25W6BX1feDLMxg5Ym5t8ccaIS8F8AYEgqPctGBLXo4sNim4w== =6+RK -----END PGP SIGNATURE-----

Gregory Maxwell

5:44 p.m.

On 8/30/06, Tels nospam-abuse@bloodgate.com wrote:

...

So, if you can't guarantee that the hashes of the IP (including the log) don't leak out, how can you guarantee that the secret doesn't leak out? Answer: You can't.

The only safe way to not leak these information out is not even to store them.

Silly, you store the hashes but not the secret.

Tels

31 Aug 31 Aug

5:47 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin,

On Wednesday 30 August 2006 19:44, Gregory Maxwell wrote:

...

On 8/30/06, Tels nospam-abuse@bloodgate.com wrote:

...
So, if you can't guarantee that the hashes of the IP (including the log) don't leak out, how can you guarantee that the secret doesn't leak out? Answer: You can't.

The only safe way to not leak these information out is not even to store them.

Silly, you store the hashes but not the secret.

The machine doing the hashes needs to know the secret and to make a meaningfull analysis, you can't change it. (Well, maybe you could change it once a month).

Still the secret is there and it can be leaked, subpoenaed or just plain be sent out by a SNAFU.

best wishes,

tels

- -- Signed on Thu Aug 31 07:46:25 2006 with key 0x93B84C15. Visit my photo gallery at http://bloodgate.com/photos/ PGP key on http://bloodgate.com/tels.asc or per email.

"Naturally the parameter and boundary of their respective position and magnitude are naturally determinable up to the limits of possible measurement as stated by the general quantum hypothesis and Heisenberg's uncertainty principle, but this indeterminacy in precise value is not a consequence of quantum uncertainty. What this illustrates is that in relation to indeterminacy in precise physical magnitude, the micro and macroscopic are inextricably linked, both being a part of the same parcel, rather than just a case of the former underlying and contributing to the latter." -- Peter Lynd

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) iQEVAwUBRPZ4GXcLPEOTuEwVAQIVhQf/R/sYTXxhV7EfnIa8M0izXy/Dg0/E0m+r w8W40Sqkgic//6MOFBt1mc3/OzPntFU/RL0UOGEhFYGItdD88GVpAMWNfB7zW0w1 Tq3KDar6agP5bTuoRzLghKcdzYB2/3a/aigidHcvCJoUI4B0TuYwXe10hgHGofXH OWKZPTQgAafzpMD4SjxyMm7hSCOPNtzMPsr0TZhTkVIiZfX+gKFlgz8W9HWzMsCH Bn91J/1eVfpnRfXH49ATPfmyGwcjPi3FpIU7czFeJjCzRDcP/UFIQyw0ysJoU+/O /F+dWN5eFYes/93pbp/dYHkqQRDHqPY5IS5npYEUzejEmT24sXu+Tw== =klKo -----END PGP SIGNATURE-----

Tim Starling

2:05 p.m.

Tels wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Moin,

On Wednesday 30 August 2006 19:44, Gregory Maxwell wrote:

...
On 8/30/06, Tels nospam-abuse@bloodgate.com wrote:

...
So, if you can't guarantee that the hashes of the IP (including the log) don't leak out, how can you guarantee that the secret doesn't leak out? Answer: You can't.

The only safe way to not leak these information out is not even to store them.

Silly, you store the hashes but not the secret.

The machine doing the hashes needs to know the secret and to make a meaningfull analysis, you can't change it. (Well, maybe you could change it once a month).

Still the secret is there and it can be leaked, subpoenaed or just plain be sent out by a SNAFU.

Store the secret on flash memory embedded on a chip with a standalone processor, like a smart card. Have the processor do the hashes itself, don't provide any interface to obtain the secret. Put the processor in a box with a tamper switch and a small incendiary device, nothing but a serial line leading out. Easy.

-- Tim Starling

Jay R. Ashworth

2:56 p.m.

On Fri, Sep 01, 2006 at 12:05:16AM +1000, Tim Starling wrote:

...

...
Still the secret is there and it can be leaked, subpoenaed or just plain be sent out by a SNAFU.

Store the secret on flash memory embedded on a chip with a standalone processor, like a smart card. Have the processor do the hashes itself, don't provide any interface to obtain the secret. Put the processor in a box with a tamper switch and a small incendiary device, nothing but a serial line leading out. Easy.

Thank you, Tim.

I needed a laugh to start my morning. :-)

Cheers, -- jra

Rob Church

3:41 p.m.

On 31/08/06, Jay R. Ashworth jra@baylink.com wrote:

...

On Fri, Sep 01, 2006 at 12:05:16AM +1000, Tim Starling wrote:

...
...
Still the secret is there and it can be leaked, subpoenaed or just plain be sent out by a SNAFU.

Store the secret on flash memory embedded on a chip with a standalone processor, like a smart card. Have the processor do the hashes itself, don't provide any interface to obtain the secret. Put the processor in a box with a tamper switch and a small incendiary device, nothing but a serial line leading out. Easy.

Thank you, Tim.

I needed a laugh to start my morning. :-)

I thought it was a nice whimsical response to the thread. But Tim, you forgot the armed guards and the obvious but overlooked loophole which allows Tom Cruise and his unwashed hippie minions to retrieve the secret in Mission Impossible IV: Project Xenu.

Rob Church

Jay R. Ashworth

4:21 p.m.

On Thu, Aug 31, 2006 at 04:41:47PM +0100, Rob Church wrote:

...

On 31/08/06, Jay R. Ashworth jra@baylink.com wrote:

...
On Fri, Sep 01, 2006 at 12:05:16AM +1000, Tim Starling wrote:

...
...
Still the secret is there and it can be leaked, subpoenaed or just plain be sent out by a SNAFU.

Store the secret on flash memory embedded on a chip with a standalone processor, like a smart card. Have the processor do the hashes itself, don't provide any interface to obtain the secret. Put the processor in a box with a tamper switch and a small incendiary device, nothing but a serial line leading out. Easy.

Thank you, Tim.

I needed a laugh to start my morning. :-)

I thought it was a nice whimsical response to the thread. But Tim, you forgot the armed guards and the obvious but overlooked loophole which allows Tom Cruise and his unwashed hippie minions to retrieve the secret in Mission Impossible IV: Project Xenu.

Two! Two laughs, ha, ha, ha, ha...

Cheers, -- jra

Steve Bennett

30 Aug 30 Aug

3:16 p.m.

I think it's unlikely to significantly skew the results. A few extra hits compared to the thousands received on popular pages isn't an issue. I even tried to artifically inflate the results for one page and had no luck whatsoever :) The only concern would be if certain "types" of pages encouraged rapid refreshing, like if for some reason pokemon pages were refreshed much faster than normal pages, they would be over-reported. But if it's just individual random editors who skew the results of whatever page they edit, there should be no overall bias.

Steve

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...

One significant potential source of error in Leon's (marvelous!) new hitcount stats is the possibility that one reader is for whatever reason fetching the same page multiple times (perhaps due to nothing more than a prolonged edit).

Obviously it would be best to filter out multiple fetches of the same page from the same IP address over some interval, probably one day. (Yes, this could then undercount hits from behind NAT firewalls and proxies, but I think it'd still be worth it overall.)

I know that Leon's scheme is currently not logging IP addresses, and given AOL's recent high-profile screwup I have to agree that not logging IP addresses in this context is probably a good idea. But what if we logged a one-way hash of the IP address, that couldn't be correlated with anything else?

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Gregory Maxwell

4 p.m.

On 8/30/06, Steve Summit scs@eskimo.com wrote:

...

One significant potential source of error in Leon's (marvelous!) new hitcount stats is the possibility that one reader is for whatever reason fetching the same page multiple times (perhaps due to nothing more than a prolonged edit).

[snip]

No.

The tool measures estimated views rather than unique impressions, thus this is not an error. Someone sitting around hitting reload over and over again is additional views, thus such actions should be sampled just as often as any other view.

The major source of error in this is that we are sampling at far too low a rate. We believe that page view distribution for wikipedia is power law. As such, course sampling will only be able to tell us useful data about the relative ranking of the most popular items, so it's good that the web interface only displays the top 1000 at most... However, with only 34,000 samples collected of enwiki viewing over four days much of the samples are scattered randomly around pages deep into the tail which we can't not speak about accurately.

Changes will be made so that we can substantially increase the sampling rate.

I pity the journalist who sees our data and runs an absolutely idiotic story based on it.

Jay R. Ashworth

5:20 p.m.

On Wed, Aug 30, 2006 at 12:00:11PM -0400, Gregory Maxwell wrote:

...

four days much of the samples are scattered randomly around pages deep into the tail which we can't not speak about accurately.

"deep into the tail".

Damn, that's a nice phrase...

Cheers, -- jr "Lucifer's Hammer?" a

6517

Age (days ago)

6518

Last active (days ago)

wikitech-l@lists.wikimedia.org

38 comments

9 participants

tags (0)

participants (9)

Gregory Maxwell
Jay R. Ashworth
Leon Weber
Rob Church
Simetrical
Steve Bennett
Steve Summit
Tels
Tim Starling