[WikiEN-l] finding the "most recognizable" page names
WereSpielChequers
werespielchequers at gmail.com
Fri Sep 30 09:55:59 UTC 2011
Hi Michael,
I don't know if such a list exists, other than lists by largest numbers of
views.
Size of article probably relates to interest of one or a few editors and
complexity of information, I doubt if it would closely relate to
recognisability. Incoming links is probably better but can get awfully
skewed by templates, and some links are more meaningful than others.
Recognisable in the USA is not necessarily the same as recognisable
globally. Ideally if you want a US specific list you need US specific data,
if you use a global list you could wind up asking Americans about Johnny
Vegas, Aby Titmuss, Jack Straw and Kevin Pietersen. You might also consider
the generation you are targeting. Lady_Bird_Johnson would be better known
among Americans and older people.
I'd suggest using metrics of page views per article, and if you want a
specifically US product screen out articles that don't use American English
spelling. Better still would be to get page views from the USA, or at least
page views ignoring the 6 hours when the US is most likely to be asleep.
WereSpielChequers
On 30 September 2011 04:17, Michael Katz <michaeladamkatz at yahoo.com> wrote:
> I'm making a crossword-style word game, and I'm trying to automate the
> process of creating the puzzles, at least somewhat.
>
> I am hoping to find or create a list of English Wikipedia page titles,
> sorted roughly by how "recognizable" they are, where by recognizable I mean
> something like, "how likely it is that the average American on the street
> will be familiar with the name/phrase/subject".
>
>
> For instance, just to take a random example, on a recognizability scale
> from 0 to 100, I might score (just guessing here):
>
>
> Lady_Gaga = 90
>
> Lady_Jane_Grey = 10
>
> Lady_and_the_Tramp = 90
>
> Lady_Antebellum = 5
>
> Lady-in-waiting = 70
>
> Lady_Bird_Johnson = 65
>
> Lady_Marmalade = 10
>
> Ladysmith_Black_Mambazo = 10
>
>
> One suggestion would just be to use the page length (either number of
> characters or physical rendered page length) as a proxy for recognizability.
> That might work, but it feels kind of crude, and certainly would get many
> false positives, such as Bose-Einstein_condensation.
>
> Someone suggested to me that I might count incoming page links, and
> referred me to http://dumps.wikimedia.org/enwiki/latest/ and in particular
> the file enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that
> file but couldn't understand whether/how the linking structure was
> represented.
>
> So my questions are:
>
> (1) Do you know if a list like I'm try to make already exists?
>
> (2) If you were going to make a list like this how would you do it? If it
> was based on page length, which files would you download and process to make
> it as efficient as possible? If it was based on incoming links, which files
> specifically would you use, and how would you determine the link count?
>
> Thanks for any help.
> _______________________________________________
> WikiEN-l mailing list
> WikiEN-l at lists.wikimedia.org
> To unsubscribe from this mailing list, visit:
> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>
More information about the WikiEN-l
mailing list