[WikiEN-l] finding the "most recognizable" page names

30 Sep 2011

I'm making a crossword-style word game, and I'm trying to automate the process of
creating the puzzles, at least somewhat.

I am hoping to find or create a list of English Wikipedia page titles, sorted roughly by
how "recognizable" they are, where by recognizable I mean something like,
"how likely it is that the average American on the street will be familiar with the
name/phrase/subject".

For instance, just to take a random example, on a recognizability scale from 0 to 100, I
might score (just guessing here):

    Lady_Gaga = 90

    Lady_Jane_Grey = 10

    Lady_and_the_Tramp = 90

    Lady_Antebellum = 5

    Lady-in-waiting = 70

    Lady_Bird_Johnson = 65

    Lady_Marmalade = 10

    Ladysmith_Black_Mambazo = 10

One suggestion would just be to use the page length (either number of characters or
physical rendered page length) as a proxy for recognizability. That might work, but it
feels kind of crude, and certainly would get many false positives, such as
Bose-Einstein_condensation.

Someone suggested to me that I might count incoming page links, and referred me to
http://dumps.wikimedia.org/enwiki/latest/ and in particular the file
enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that file but couldn't
understand whether/how the linking structure was represented.

So my questions are:

(1) Do you know if a list like I'm try to make already exists?

(2) If you were going to make a list like this how would you do it? If it was based on
page length, which files would you download and process to make it as efficient as
possible? If it was based on incoming links, which files specifically would you use, and
how would you determine the link count?

Thanks for any help.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

[WikiEN-l] finding the "most recognizable" page names