[WikiEN-l] finding the "most recognizable" page names
Michael Katz
michaeladamkatz at yahoo.com
Fri Sep 30 03:17:01 UTC 2011
I'm making a crossword-style word game, and I'm trying to automate the process of creating the puzzles, at least somewhat.
I am hoping to find or create a list of English Wikipedia page titles, sorted roughly by how "recognizable" they are, where by recognizable I mean something like, "how likely it is that the average American on the street will be familiar with the name/phrase/subject".
For instance, just to take a random example, on a recognizability scale from 0 to 100, I might score (just guessing here):
Lady_Gaga = 90
Lady_Jane_Grey = 10
Lady_and_the_Tramp = 90
Lady_Antebellum = 5
Lady-in-waiting = 70
Lady_Bird_Johnson = 65
Lady_Marmalade = 10
Ladysmith_Black_Mambazo = 10
One suggestion would just be to use the page length (either number of characters or physical rendered page length) as a proxy for recognizability. That might work, but it feels kind of crude, and certainly would get many false positives, such as Bose-Einstein_condensation.
Someone suggested to me that I might count incoming page links, and referred me to http://dumps.wikimedia.org/enwiki/latest/ and in particular the file enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that file but couldn't understand whether/how the linking structure was represented.
So my questions are:
(1) Do you know if a list like I'm try to make already exists?
(2) If you were going to make a list like this how would you do it? If it was based on page length, which files would you download and process to make it as efficient as possible? If it was based on incoming links, which files specifically would you use, and how would you determine the link count?
Thanks for any help.
More information about the WikiEN-l
mailing list