Dear all,
Your suggestions are needed on the ways in which one can construct
some sensible baselines, most likely based on data sets *external* to
Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability of
language users and content (some numbers based on external data sets), a
certain language version should have expected number of articles/active
users.
As previous research has suggested that Wikipedia activities need
mutually-reinforcing cycles of participation, content, and readership, it
is expected that the development of a Wikipedia language version is
conditioned by the availability of (digitally) literate users and (possibly
digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the world
such as Arabic, Chinese, Spanish, etc., may have different numbers of
Internet users and digital content. These numbers indicate the basis on
which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better
categorize/curate activities across Wikipedia language versions. We can
then better come up with expected values of Wikipedia development, and thus
categorize language versions accordingly based on the *external conditions*
of available/potential users and content.
Another use of this baseline measurement is to better compare the
development of different language versions. It should help answer questions
such as (1) whether Korean language version is *underdeveloped* on
Wikipedia platforms when compared with a language version that enjoys
similar number of available/potential users and content.
The current similar external baseline data is probably the number of
language speakers. My hunch is that it is not good enough in taking into
accounts the available/potential users and content, especially the
digitally-ready one.
So I welcome you to add to the following list, any external
indicators (and possibly data sources) that may help to construct such base
line.
==Indicators==
* Internet users for each language (probably approximate measurement based
on CLDR Territory-Language information and ITU internet penetration rates.
* Number of books published annually in different languages (suggested data
sources? Does ISBN have a database or stat report on published languages?)
* Number of web pages returned by major search engines on the queries of
"Wikipedia" in different languages, excluding results from Wikimedia
projects.
* Number of scholarly publications across languages (suggested data
sources?)
* Number of major newspaper publications across languages (suggested data
sources?)
Please share your thoughts!
--
han-teng liao
"[O]nce the Imperial Institute of France and the Royal Society of London
begin to work together on a new encyclopaedia, it will take less than a
year to achieve a lasting peace between France and England." - Henri
Saint-Simon (1810)
"A common ideology based on this Permanent World Encyclopaedia is a
possible means, to some it seems the only means, of dissolving human
conflict into unity." - H.G. Wells (1937)