[WikiEN-l] Custom Google search engines for finding RSs for subject areas

Gwern Branwen gwern0 at gmail.com
Fri Jan 22 17:00:46 UTC 2010


So, on a lighter note, I recently got sick & tired of running site:
search after site: -wiki search in Google, and began looking for some
way to automate it.

I discovered that one can make a 'custom' Google search:
https://secure.wikimedia.org/wikipedia/en/wiki/Google_Co-op

It allows one essentially to tell Google to increase the score of any
hits in certain domains, and blacklist other domains. It has a number
of neat features - for example, I can tell it to blacklist any domain
on https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Mirrors_and_forks/All
. You might think that a parameter like '-wiki' or '-wikipedia' would
do the same thing, but alas!

In particular, I've created a CSE focused on anime & manga  topics:
http://www.google.com/cse/home?cx=009114923999563836576:1eorkzz2gp4

I started with all the links listed in
https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:WikiProject_Anime_and_manga/Online_reliable_sources
and then began running searches on random topics and pruning based on
that - chucking sites into the blacklist sinbin, or finding good sites
omitted from the list and adding them to the whitelist. At last count,
I had 200 sites on the nice list and 311 on the naughty list (but this
counts things like the Mirrors page as a single link, though they ban
dozens or hundreds of sites).

The results are *much* better. To take my most recent use, finding
material on [[Amanchu!]] for its AFD
(https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Articles_for_deletion/Amanchu!),
compare the regular Google search:
http://www.google.com/search?q=amanchu

with the CSE search:
htp://www.google.com/cse?cx=009114923999563836576%3A1eorkzz2gp4&q=amanchu

All the blogs & scanlations & forums in the former are great for
someone who just wants to read _Amanchu!_, but for a Wikipedian? It's
terrible. Notice that the ANN launch article, which is apparently the
most substantive English coverage in a RS*, is the first hit in the
CSE but the fifth in the regular Google search, and you can keep
scrolling down and find mostly chaff. And the weekly sales ranking
that puts _Amanchu!_ at #8 nationally, that shows up in the first page
in the CSE? I've no idea where it is in the regular Google hits.

Or take a critical classic: _The Wings of Honneamise_
(https://secure.wikimedia.org/wikipedia/en/wiki/Royal_Space_Force:_The_Wings_of_Honn%C3%AAamise).

Google:
http://www.google.com/search?q=wings%20of%20honneamise
CSE:
http://www.google.com/cse?cx=009114923999563836576%3A1eorkzz2gp4&q=wings+of+honneamise

Google has on its first page WP, IMDb, Amazon, video links, Tucows
(!), ads, and just 2 reviews a Wikipedian might find useful.

CSE has 9 or 10 good review sources from respectable publications like
Ex.org or the New York Times, and even the questionable hits like
RottenTomatoes have their good points - RT would lead one to the
famous critic Roger Ebert's *very* flattering review of _Wings of
Honneamise_. And it'll take you straight to Ebert's review on page 2,
whereas in regular Google search, you have to go to page 7 or 8.

Further examples can be multiplied, but I hope this shows that CSEs
can be very useful for finding online sources; I'm sure it would work
as well for other subject-areas!

(And since I can't let recent events go, I'll mar my little essay with
a final remark: *this* is the sort of thing that will lessen issues
like BLPs - not fanaticism like "Caedite eos! Novit enim Dominus qui
sunt eius".)

* Unsurprising, really. _Amanchu!_ is Japanese only and likely will
stay that way for years; even the anime media can be very
language-parochial.

-- 
gwern



More information about the WikiEN-l mailing list