Re: [Wikitech-l] Unbreaking statistics

7 Jun 2009

Had to run and missed a couple of important items. One is that you can
calculate the likelihood that a link is missing. (Its similar to Googles
page rank) If the likelihood turns out to be to small you simply don't
report anything. You also can skip reporting if you don't have any
intervening search or search result. You can also analyze the link
structure at the log server and skip logging of uninteresting items, but
even more interesting, this can be done client side if sufficient
information is embedded on the pages. For this, note that a high
likelihood of a missing link implies few existing inbound links and then
they can simply be embedded on the page itself.

Analyzing a dumb log would be very costly indeed.

John

John at Darkstar skrev:
...
  I tried to convince myself to stay out of this thread,
but this was
 somewhat interesting. ;)

 I'm not quite sure this will work out for every case, but my gross idea
 is like this:

 Imagine an user trying to get an answare about some kind of problem. He
 searches with Google and dumps into the most obvious article about (for
 example) breweries, even if he really want to know something about a
 specific beer (Groelch or Mac or whatever). He can't find anything about
 it so he makes an additional search (at the page hopefully), gets a
 result list, reads through a lot of articles and then finally finds what
 he searches for. Then he leaves.

 Now, imagine a function that pushes new visited pages on a small page
 list and a function popping that list each time the search result page
 is visited. The page list is stored in a cookie. This small page list is
 then reported to a special logging server by a AJAX request. It can't
 just piggyback as the final page usually will not lead to a new request,
 - the user simply leaves.

 Later a lot of such page lists an be analyzed and compared to the known
 link structure. If a pair of pages consistently emerges in the log
 without having a parent - child relation then you know a link is missing.

 Some guestimates says that you need more than 100 page views before
 something like this can detect obvious missing links. For Norwegian
 (bokmål) Wikipedia that is about 2-3 months of statistics for half the
 article base, but note that the accumulated stats would be rectified by
 the page redirect information from the database as a link very seldom
 are dropped it is usually added.

 Well, something like that. I was wondering about running a test case but
 given some previous discussion I concluded that I would get a go on this.

 It is also possible to analyze the article relations where the user goes
 back to Googles result list but that is somewhat more evolved.

 John

 Platonides skrev:
  John at Darkstar wrote:
  If someone wants to work on this I have some
ideas to make something
 usefull out of this log, but I'm a bit short on time. Basically its two
 ideas that are really usefull; one is to figure out which articles are
 most interesting to show in a portal and the other is how to detect
 articles with missing linking between them.
 John  How are you planning to detect articles which 'should have links'
 between them?

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Unbreaking statistics