Had to run and missed a couple of important items. One is that you can
calculate the likelihood that a link is missing. (Its similar to Googles
page rank) If the likelihood turns out to be to small you simply don't
report anything. You also can skip reporting if you don't have any
intervening search or search result. You can also analyze the link
structure at the log server and skip logging of uninteresting items, but
even more interesting, this can be done client side if sufficient
information is embedded on the pages. For this, note that a high
likelihood of a missing link implies few existing inbound links and then
they can simply be embedded on the page itself.
Analyzing a dumb log would be very costly indeed.
John
John at Darkstar skrev:
I tried to convince myself to stay out of this thread,
but this was
somewhat interesting. ;)
I'm not quite sure this will work out for every case, but my gross idea
is like this:
Imagine an user trying to get an answare about some kind of problem. He
searches with Google and dumps into the most obvious article about (for
example) breweries, even if he really want to know something about a
specific beer (Groelch or Mac or whatever). He can't find anything about
it so he makes an additional search (at the page hopefully), gets a
result list, reads through a lot of articles and then finally finds what
he searches for. Then he leaves.
Now, imagine a function that pushes new visited pages on a small page
list and a function popping that list each time the search result page
is visited. The page list is stored in a cookie. This small page list is
then reported to a special logging server by a AJAX request. It can't
just piggyback as the final page usually will not lead to a new request,
- the user simply leaves.
Later a lot of such page lists an be analyzed and compared to the known
link structure. If a pair of pages consistently emerges in the log
without having a parent - child relation then you know a link is missing.
Some guestimates says that you need more than 100 page views before
something like this can detect obvious missing links. For Norwegian
(bokmål) Wikipedia that is about 2-3 months of statistics for half the
article base, but note that the accumulated stats would be rectified by
the page redirect information from the database as a link very seldom
are dropped it is usually added.
Well, something like that. I was wondering about running a test case but
given some previous discussion I concluded that I would get a go on this.
It is also possible to analyze the article relations where the user goes
back to Googles result list but that is somewhat more evolved.
John
Platonides skrev:
John at Darkstar wrote:
If someone wants to work on this I have some
ideas to make something
usefull out of this log, but I'm a bit short on time. Basically its two
ideas that are really usefull; one is to figure out which articles are
most interesting to show in a portal and the other is how to detect
articles with missing linking between them.
John
How are you planning to detect articles which 'should have links'
between them?
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l