I'm making a crossword-style word game, and I'm trying to automate the process of
creating the puzzles, at least somewhat.
I am hoping to find or create a list of English Wikipedia page titles, sorted roughly by
how "recognizable" they are, where by recognizable I mean something like,
"how likely it is that the average American on the street will be familiar with the
name/phrase/subject".
For instance, just to take a random example, on a recognizability scale from 0 to 100, I
might score (just guessing here):
Lady_Gaga = 90
Lady_Jane_Grey = 10
Lady_and_the_Tramp = 90
Lady_Antebellum = 5
Lady-in-waiting = 70
Lady_Bird_Johnson = 65
Lady_Marmalade = 10
Ladysmith_Black_Mambazo = 10
One suggestion would just be to use the page length (either number of characters or
physical rendered page length) as a proxy for recognizability. That might work, but it
feels kind of crude, and certainly would get many false positives, such as
Bose-Einstein_condensation.
Someone suggested to me that I might count incoming page links, and referred me to
http://dumps.wikimedia.org/enwiki/latest/ and in particular the file
enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that file but couldn't
understand whether/how the linking structure was represented.
So my questions are:
(1) Do you know if a list like I'm try to make already exists?
(2) If you were going to make a list like this how would you do it? If it was based on
page length, which files would you download and process to make it as efficient as
possible? If it was based on incoming links, which files specifically would you use, and
how would you determine the link count?
Thanks for any help.