I tried to convince myself to stay out of this thread, but this was somewhat interesting. ;)
I'm not quite sure this will work out for every case, but my gross idea is like this:
Imagine an user trying to get an answare about some kind of problem. He searches with Google and dumps into the most obvious article about (for example) breweries, even if he really want to know something about a specific beer (Groelch or Mac or whatever). He can't find anything about it so he makes an additional search (at the page hopefully), gets a result list, reads through a lot of articles and then finally finds what he searches for. Then he leaves.
Now, imagine a function that pushes new visited pages on a small page list and a function popping that list each time the search result page is visited. The page list is stored in a cookie. This small page list is then reported to a special logging server by a AJAX request. It can't just piggyback as the final page usually will not lead to a new request, - the user simply leaves.
Later a lot of such page lists an be analyzed and compared to the known link structure. If a pair of pages consistently emerges in the log without having a parent - child relation then you know a link is missing.
Some guestimates says that you need more than 100 page views before something like this can detect obvious missing links. For Norwegian (bokmål) Wikipedia that is about 2-3 months of statistics for half the article base, but note that the accumulated stats would be rectified by the page redirect information from the database as a link very seldom are dropped it is usually added.
Well, something like that. I was wondering about running a test case but given some previous discussion I concluded that I would get a go on this.
It is also possible to analyze the article relations where the user goes back to Googles result list but that is somewhat more evolved.
John
Platonides skrev:
John at Darkstar wrote:
If someone wants to work on this I have some ideas to make something usefull out of this log, but I'm a bit short on time. Basically its two ideas that are really usefull; one is to figure out which articles are most interesting to show in a portal and the other is how to detect articles with missing linking between them. John
How are you planning to detect articles which 'should have links' between them?
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l