Indeed, GerardM, I agree with you that a few good women or men with passions can kick start some Wikimedia projects, and different Wikimedia projects have different barriers or paths of development.

I also agree with you that the direction that I am pursuing may not be helpful to those languages in its incubation state. To be honest, I am not trying to measure the likelihood of success. 

What I am trying to measure is probably akin to the external *difficulty* to be overcome for success. Here I have to admit that I approach this question wearing a researcher hat more so than a Wikipedian hat. 

Having said that, I personally believe this approach can be very productive in generating outcomes for major world languages such as Mandarin, Spanish, Hindi, Arabic, Bengali, Russian, Japanese and Punjabi (all these languages have more native speakers than German, BTW). This way, researchers can make them more comparable because of the available external baselines. 

I can envision that the outcomes can help these communities to find their strengths and weakness to develop. Then the strategies can be made to increase/expand their reach of available external content or users.

This should also help sociolinguists to identify which languages (especially non-national languages such as Kurdish or Cantonese) that are more developed than others in the Wikipedia sphere, and seeks explanations for their relative success/failure by contrasting the Wikipedia sphere and offline/online sphere. These languages include many of the mid-size language versions of Wikipedias such as Catalan, Cantonese, Tamil, etc. 

Thus, I would argue that the analytical direction I want to take would be useful for many language versions which already have some user base and content. Again, I want them to be aware of both the internal and external state of each language versions, thereby contextualizing the differences among them.  The baseline stats based on external sources should make them more comparable, instead of just number games among different language groups of Wikipedians. 

Also, I have to agree with GerardM that the issue is both practical and political. I would like to add it is also political in terms of fund dissemination within the global Wikimedia/open knowledge movement. I personally believe that with the external numbers about potential available users and content outside Wikipedia, we can only realize how much is utilized/recruited from the external pool to the internal Wikimedia/Wikipedia projects. This should provide some sensible comparison bases on which Wikipedians can reflect upon.

Finally, may I point out the external environments for languages are also changing, which could be useful for the global Wikimedia/open knowledge movement. Based on my research on the competition of Baidu Baike and Chinese Wikiepdia in mainland China, I found that the windfall of fast growing internet users during the years of late 2005-2008 are crucial for any websites to thrive in mainland China, a windfall that Chinese Wikipedia missed because of the block by Beijing. From this, I argue that it makes strategic sense to catch the wave of rising internet users, esp. during the time when the penetration rates quickly rise from 12.8% to 40% for a given population.  The external time-series data points can help pointing out the rising language users on the Web (probably Indian languages when Chinese languages have reached 40-50%). 

Best,
han-teng liao



2014-07-08 12:03 GMT+01:00 Gerard Meijssen <gerard.meijssen@gmail.com>:
Hoi,
At the WMF language committee, the question if a language is viable for a Wikimedia project is a practical one. It is also very much a political one. One vitally important difference with your approach is that the distinction is between a first project and a subsequent project. In the latest iteration of the approach we do not consider Wikidata a first project. Relevance is that we do not require localisation of MediaWiki or an Incubator stage.

When the question is what it takes for a new project to work? .. the simple answer is "a few good men". There are a few projects that are alive and well that rely on no more than 3 people.

By not focussing on Wikipedia, it is possible that a Wikisource becomes the first project. When this is what those "few good men" want.. It is their party.

You may imagine that we thought about what are the likely success factors for a new project. We did come up with similar ideas that you have. The problem is that it does not help. So you determine the likelihood of success, it does not guarantee it. 

What we certainly do not consider is the number of data sources. Sourcing is very much a luxury in starting projects. Insisting on sourcing at all will kill most initiatives immediately. What is important is that people start writing, reading in their language.. With a Wikipedia that gets active participation / readership, there will be a move to a more consistent orthography. Those that write determine in the end.

Wikidata was given its exception because it represents the lowest level of participation with the most effect. Add one label to an item that is used a lot (human, male, female eg) and it can be used thousands of times. It is also very obvious to re-use dictionary information to make an impact.
Thanks,
      GerardM



On 8 July 2014 09:27, Han-Teng Liao (OII) <han-teng.liao@oii.ox.ac.uk> wrote:
Dear all,  

     Your suggestions are needed on the ways in which one can construct some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.

      Such baselines should ideally indicate, given the availability of language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users. 

      As previous research has suggested that Wikipedia activities need mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources. 

     So the assumption is:

Wikipedia Activities = Some function of (available users and content)

      For example, the major non-English writing languages in the world such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.

      One practical use of this baseline measurement is to better categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content. 

      Another use of this baseline measurement is to better compare the development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.

     The current similar external baseline data is probably the number of language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.

      So I welcome you to add to the following list, any external indicators (and possibly data sources) that may help to construct such base line.
 
==Indicators==
* Internet users for each language (probably approximate measurement based on CLDR Territory-Language information and ITU internet penetration rates.

* Number of books published annually in different languages (suggested data sources? Does ISBN have a database or stat report on published languages?)

* Number of web pages returned by major search engines on the queries of "Wikipedia" in different languages, excluding results from Wikimedia projects.

* Number of scholarly publications across languages (suggested data sources?) 

* Number of major newspaper publications across languages (suggested data sources?) 

 
    Please share your thoughts! 

-- 
han-teng liao

"[O]nce the Imperial Institute of France and the Royal Society of London begin to work together on a new encyclopaedia, it will take less than a year to achieve a lasting peace between France and England." - Henri Saint-Simon (1810)

"A common ideology based on this Permanent World Encyclopaedia is a possible means, to some it seems the only means, of dissolving human conflict into unity." - H.G. Wells (1937)

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l