Thanks for the reply. Can you tell me exactly which dump files you'd look in to find
the number of page views, plus any information about finding the page views within those
files, if it's not obvious? Is there a way to distinguish between editor page views
and user page views? (Perhaps subtract the number of edits made? If so, how I can find the
number of edits made?)
Something about page views seems a little funny, because it seems like there are some very
recognizable things that just aren't looked up much. But perhaps it's my best
hope...
________________________________
From: WereSpielChequers <werespielchequers(a)gmail.com>
To: Michael Katz <michaeladamkatz(a)yahoo.com>om>; English Wikipedia
<wikien-l(a)lists.wikimedia.org>
Sent: Friday, September 30, 2011 2:55 AM
Subject: Re: [WikiEN-l] finding the "most recognizable" page names
Hi Michael,
I don't know if such a list exists, other than lists by largest numbers of views.
Size of article probably relates to interest of one or a few editors and complexity of
information, I doubt if it would closely relate to recognisability. Incoming links is
probably better but can get awfully skewed by templates, and some links are more
meaningful than others.
Recognisable in the USA is not necessarily the same as recognisable globally. Ideally if
you want a US specific list you need US specific data, if you use a global list you could
wind up asking Americans about Johnny Vegas, Aby Titmuss, Jack Straw and Kevin Pietersen.
You might also consider the generation you are targeting. Lady_Bird_Johnson would be
better known among Americans and older people.
I'd suggest using metrics of page views per article, and if you want a specifically US
product screen out articles that don't use American English spelling. Better still
would be to get page views from the USA, or at least page views ignoring the 6 hours when
the US is most likely to be asleep.
WereSpielChequers
On 30 September 2011 04:17, Michael Katz <michaeladamkatz(a)yahoo.com> wrote:
I'm making a crossword-style word game, and I'm trying to automate the process of
creating the puzzles, at least somewhat.
I am hoping to find or create a list of English Wikipedia page titles, sorted roughly by
how "recognizable" they are, where by recognizable I mean something like,
"how likely it is that the average American on the street will be familiar with the
name/phrase/subject".
For instance, just to take a random example, on a recognizability scale from 0 to 100, I
might score (just guessing here):
Lady_Gaga = 90
Lady_Jane_Grey = 10
Lady_and_the_Tramp = 90
Lady_Antebellum = 5
Lady-in-waiting = 70
Lady_Bird_Johnson = 65
Lady_Marmalade = 10
Ladysmith_Black_Mambazo = 10
One suggestion would just be to use the page length (either number of characters or
physical rendered page length) as a proxy for recognizability. That might work, but it
feels kind of crude, and certainly would get many false positives, such as
Bose-Einstein_condensation.
Someone suggested to me that I might count incoming page links, and referred me to
http://dumps.wikimedia.org/enwiki/latest/ and in particular the file
enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that file but couldn't
understand whether/how the linking structure was represented.
So my questions are:
(1) Do you know if a list like I'm try to make already exists?
(2) If you were going to make a list like this how would you do it? If it was based on
page length, which files would you download and process to make it as efficient as
possible? If it was based on incoming links, which files specifically would you use, and
how would you determine the link count?
Thanks for any help.
_______________________________________________
WikiEN-l mailing list
WikiEN-l(a)lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l