WikiEN-l October 2011

wikien-l@lists.wikimedia.org

41 participants
31 discussions

finding the "most recognizable" page names
by Michael Katz 30 Sep '11

30 Sep '11

I'm making a crossword-style word game, and I'm trying to automate the process of creating the puzzles, at least somewhat. I am hoping to find or create a list of English Wikipedia page titles, sorted roughly by how "recognizable" they are, where by recognizable I mean something like, "how likely it is that the average American on the street will be familiar with the name/phrase/subject". For instance, just to take a random example, on a recognizability scale from 0 to 100, I might score (just guessing here): Lady_Gaga = 90 Lady_Jane_Grey = 10 Lady_and_the_Tramp = 90 Lady_Antebellum = 5 Lady-in-waiting = 70 Lady_Bird_Johnson = 65 Lady_Marmalade = 10 Ladysmith_Black_Mambazo = 10 One suggestion would just be to use the page length (either number of characters or physical rendered page length) as a proxy for recognizability. That might work, but it feels kind of crude, and certainly would get many false positives, such as Bose-Einstein_condensation. Someone suggested to me that I might count incoming page links, and referred me to http://dumps.wikimedia.org/enwiki/latest/ and in particular the file enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that file but couldn't understand whether/how the linking structure was represented. So my questions are: (1) Do you know if a list like I'm try to make already exists? (2) If you were going to make a list like this how would you do it? If it was based on page length, which files would you download and process to make it as efficient as possible? If it was based on incoming links, which files specifically would you use, and how would you determine the link count? Thanks for any help.

5 5

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

WikiEN-l October 2011