I've noticed lately that the blurbs Google is generating for Wikipedia articles not only no longer reflect the article intros (for a long time they were putting out whatever tripe DMOZ had about the article) but are now selectively quoting the most opinionated piece of POV-tripe phrasing that exists in the article.
This seems to be most easily demonstrated with recent movie articles where google inevitable quotes some loud-mouth reviewer as the blurb for the article. For example:
http://www.google.com/search?q=the+box+movie+wikipedia And it's so rare that a movie is an F. I mean, if it's an F, it shouldn't even be released." On the topic of the negative reaction to The Box, Mintz blamed.
http://www.google.com/search?q=District+9+movie+wikipedia "Sara Vilkomerson of The New York Observer writes, "District 9 is the most exciting science fiction movie to come along in ages; definitely the most"
(at least that one attributes it, a lot of the ones I've seen just seem to cut to the POV)
For a while I thought it was just extracting text beginning at "Searchterm is something" looking backwards from the end of the article, but it seems to be more than that. Some older examples where I've seen this now seem to be returning different results, I don't know if its a timing thing or just chance.
Anyone know how to influence google's blurb generation to get more sensible results?
2009/11/16 Gregory Maxwell gmaxwell@gmail.com:
For a while I thought it was just extracting text beginning at "Searchterm is something" looking backwards from the end of the article, but it seems to be more than that. Some older examples where I've seen this now seem to be returning different results, I don't know if its a timing thing or just chance.
I think it's an unfortunate collision of your search terms and Wikipedia's preferred vocabulary. We've long standardised on "film" not "movie", so any incidence of the latter is likely to be in direct quotes, and is very *unlikely* to be in the lead section. Direct quotes tend to be reviews, pro or con, so when the algorithm tries to find extracts showing as many search terms as possible, it ends up apparently cherry-picking these.
The effect becomes clearer when we compare the results using "film" instead of "movie".
[the box film wikipedia]
The Box (2009 film) - Wikipedia, the free encyclopedia "The Box is a 2009 science fiction horror film based on the 1970 short story "Button, Button" by Richard Matheson, which was previously adapted into an ..."
[the box movie wikipedia]
The Box (2009 film) - Wikipedia, the free encyclopedia "And it's so rare that a movie is an F. I mean, if it's an F, it shouldn't even be released." On the topic of the negative reaction to The Box, Mintz blamed ..."
Note that the first has all its keywords in the header, so it shows the first line (which contains three of them anyway). The second has all the keywords *except* 'movie', so it looks for an extract specifically using that word. (I don't know how it chooses that extract, though)
We can test this by using a different keyword and seeing how it builds the extract:
[the box mintz wikipedia]
The Box (2009 film) - Wikipedia, the free encyclopedia "On the topic of the negative reaction to The Box, Mintz blamed the film's ending and was quoted as saying "People really thought this was a stinker". ..."
Might this explain the effect? No idea how to *solve* it, though...