So, on a lighter note, I recently got sick & tired of running site: search after site: -wiki search in Google, and began looking for some way to automate it.
I discovered that one can make a 'custom' Google search: https://secure.wikimedia.org/wikipedia/en/wiki/Google_Co-op
It allows one essentially to tell Google to increase the score of any hits in certain domains, and blacklist other domains. It has a number of neat features - for example, I can tell it to blacklist any domain on https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Mirrors_and_forks/A... . You might think that a parameter like '-wiki' or '-wikipedia' would do the same thing, but alas!
In particular, I've created a CSE focused on anime & manga topics: http://www.google.com/cse/home?cx=009114923999563836576:1eorkzz2gp4
I started with all the links listed in https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:WikiProject_Anime_a... and then began running searches on random topics and pruning based on that - chucking sites into the blacklist sinbin, or finding good sites omitted from the list and adding them to the whitelist. At last count, I had 200 sites on the nice list and 311 on the naughty list (but this counts things like the Mirrors page as a single link, though they ban dozens or hundreds of sites).
The results are *much* better. To take my most recent use, finding material on [[Amanchu!]] for its AFD (https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Articles_for_deleti...), compare the regular Google search: http://www.google.com/search?q=amanchu
with the CSE search: htp://www.google.com/cse?cx=009114923999563836576%3A1eorkzz2gp4&q=amanchu
All the blogs & scanlations & forums in the former are great for someone who just wants to read _Amanchu!_, but for a Wikipedian? It's terrible. Notice that the ANN launch article, which is apparently the most substantive English coverage in a RS*, is the first hit in the CSE but the fifth in the regular Google search, and you can keep scrolling down and find mostly chaff. And the weekly sales ranking that puts _Amanchu!_ at #8 nationally, that shows up in the first page in the CSE? I've no idea where it is in the regular Google hits.
Or take a critical classic: _The Wings of Honneamise_ (https://secure.wikimedia.org/wikipedia/en/wiki/Royal_Space_Force:_The_Wings_...).
Google: http://www.google.com/search?q=wings%20of%20honneamise CSE: http://www.google.com/cse?cx=009114923999563836576%3A1eorkzz2gp4&q=wings...
Google has on its first page WP, IMDb, Amazon, video links, Tucows (!), ads, and just 2 reviews a Wikipedian might find useful.
CSE has 9 or 10 good review sources from respectable publications like Ex.org or the New York Times, and even the questionable hits like RottenTomatoes have their good points - RT would lead one to the famous critic Roger Ebert's *very* flattering review of _Wings of Honneamise_. And it'll take you straight to Ebert's review on page 2, whereas in regular Google search, you have to go to page 7 or 8.
Further examples can be multiplied, but I hope this shows that CSEs can be very useful for finding online sources; I'm sure it would work as well for other subject-areas!
(And since I can't let recent events go, I'll mar my little essay with a final remark: *this* is the sort of thing that will lessen issues like BLPs - not fanaticism like "Caedite eos! Novit enim Dominus qui sunt eius".)
* Unsurprising, really. _Amanchu!_ is Japanese only and likely will stay that way for years; even the anime media can be very language-parochial.
On Sat, Jan 23, 2010 at 3:00 AM, Gwern Branwen gwern0@gmail.com wrote:
...snip... I started with all the links listed in https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:WikiProject_Anime_a... and then began running searches on random topics and pruning based on that - chucking sites into the blacklist sinbin, or finding good sites omitted from the list and adding them to the whitelist. At last count, I had 200 sites on the nice list and 311 on the naughty list (but this counts things like the Mirrors page as a single link, though they ban dozens or hundreds of sites). ...snip...
Perhaps we should encourage more WikiProjects to create lists like the one displayed then add them into a category and someone could work on a custom search that suitable to use across the project that is continuously updated with more allow/black lists.
-Peachey
On Fri, Jan 22, 2010 at 8:45 PM, K. Peachey p858snake@yahoo.com.au wrote:
On Sat, Jan 23, 2010 at 3:00 AM, Gwern Branwen gwern0@gmail.com wrote:
...snip... I started with all the links listed in https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:WikiProject_Anime_a... and then began running searches on random topics and pruning based on that - chucking sites into the blacklist sinbin, or finding good sites omitted from the list and adding them to the whitelist. At last count, I had 200 sites on the nice list and 311 on the naughty list (but this counts things like the Mirrors page as a single link, though they ban dozens or hundreds of sites). ...snip...
Perhaps we should encourage more WikiProjects to create lists like the one displayed then add them into a category and someone could work on a custom search that suitable to use across the project that is continuously updated with more allow/black lists.
-Peachey
That would be an excellent idea, especially if they could then all be {{subst}}ed into a single page - just as I can ban every site listed in the consolidated WP:MIRRO page, so too I can *include* every site listed on a page. It would probably be superior to the current AfD template with just some normal Google/Books/News searches.
On Sat, Jan 23, 2010 at 3:21 AM, Gwern Branwen gwern0@gmail.com wrote:
On Fri, Jan 22, 2010 at 8:45 PM, K. Peachey p858snake@yahoo.com.au wrote:
On Sat, Jan 23, 2010 at 3:00 AM, Gwern Branwen gwern0@gmail.com wrote:
...snip... I started with all the links listed in https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:WikiProject_Anime_a... and then began running searches on random topics and pruning based on that - chucking sites into the blacklist sinbin, or finding good sites omitted from the list and adding them to the whitelist. At last count, I had 200 sites on the nice list and 311 on the naughty list (but this counts things like the Mirrors page as a single link, though they ban dozens or hundreds of sites). ...snip...
Perhaps we should encourage more WikiProjects to create lists like the one displayed then add them into a category and someone could work on a custom search that suitable to use across the project that is continuously updated with more allow/black lists.
-Peachey
That would be an excellent idea, especially if they could then all be {{subst}}ed into a single page - just as I can ban every site listed in the consolidated WP:MIRRO page, so too I can *include* every site listed on a page. It would probably be superior to the current AfD template with just some normal Google/Books/News searches.
Does your custom search aggregate books, news, and scholar searches, as well as ordinary web searches? Those are the four Google searches I use most often, and it is interesting to see how some subjects get more coverage in one area of the information metasphere than other areas. It is all quite logical when you think about when the topic received most coverage. The one thing I still find that is lacking a lot is Google News - a lots of old newspapers still seem to need to be searched on separate databases. What is the best database out there for searching in old newspapers?
Carcharoth
On Sat, Jan 23, 2010 at 4:31 AM, Carcharoth carcharothwp@googlemail.com wrote:
On Sat, Jan 23, 2010 at 3:21 AM, Gwern Branwen gwern0@gmail.com wrote:
On Fri, Jan 22, 2010 at 8:45 PM, K. Peachey p858snake@yahoo.com.au wrote:
On Sat, Jan 23, 2010 at 3:00 AM, Gwern Branwen gwern0@gmail.com wrote:
...snip... I started with all the links listed in https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:WikiProject_Anime_a... and then began running searches on random topics and pruning based on that - chucking sites into the blacklist sinbin, or finding good sites omitted from the list and adding them to the whitelist. At last count, I had 200 sites on the nice list and 311 on the naughty list (but this counts things like the Mirrors page as a single link, though they ban dozens or hundreds of sites). ...snip...
Perhaps we should encourage more WikiProjects to create lists like the one displayed then add them into a category and someone could work on a custom search that suitable to use across the project that is continuously updated with more allow/black lists.
-Peachey
That would be an excellent idea, especially if they could then all be {{subst}}ed into a single page - just as I can ban every site listed in the consolidated WP:MIRRO page, so too I can *include* every site listed on a page. It would probably be superior to the current AfD template with just some normal Google/Books/News searches.
Does your custom search aggregate books, news, and scholar searches, as well as ordinary web searches?
I put in the Books/News/Scholar URLs, but I'm unsure it did anything. For example, AFAIK, a site search of Google books will only turn up the homepage for a book - the metadata, reviews, etc; the actual OCR page contents are part of the 'deep web' you can get at only through the actual Google search box. One might think that Google's custom search might recognize the Google service URLs and run the deep web queries and not just query the surface details - but that seems to be too much to expect. (So I am perhaps a little hasty in suggesting a universal CSE would replace the AfD searches.)
Those are the four Google searches I use most often, and it is interesting to see how some subjects get more coverage in one area of the information metasphere than other areas. It is all quite logical when you think about when the topic received most coverage. The one thing I still find that is lacking a lot is Google News - a lots of old newspapers still seem to need to be searched on separate databases. What is the best database out there for searching in old newspapers?
Carcharoth
I don't know of any good non-proprietary old newspaper database, personally.
On Fri, Jan 22, 2010 at 12:00 PM, Gwern Branwen gwern0@gmail.com wrote:
Further examples can be multiplied, but I hope this shows that CSEs can be very useful for finding online sources; I'm sure it would work as well for other subject-areas!
(And since I can't let recent events go, I'll mar my little essay with a final remark: *this* is the sort of thing that will lessen issues like BLPs - not fanaticism like "Caedite eos! Novit enim Dominus qui sunt eius".)
I withdraw my enthusiastic support of CSEs. Apparently Google will without warning or notice arbitrarily delete all but 20 URL filters:
http://www.google.com/support/forum/p/customsearch/thread?tid=29757bc2983d53...
This makes CSE utterly useless for Wikipedia. The obvious workaround, keeping a list of URLs on a subpage and having a CSE load that, will run afoul of the Wikipedia blacklist filter.
Even if I found a workaround, I am sufficiently angry that Google would unilaterally destroy approximately 10-20 hours of my work that I do not think I would use CSE anyway.
The idea is still good, however. The restricted searches proved their utility to me. But I currently don't know of any alternatives.