Dear all,
Your suggestions are needed on the ways in which one can construct some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability of language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users.
As previous research has suggested that Wikipedia activities need mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the world such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content.
Another use of this baseline measurement is to better compare the development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.
The current similar external baseline data is probably the number of language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.
So I welcome you to add to the following list, any external indicators (and possibly data sources) that may help to construct such base line.
==Indicators== * Internet users for each language (probably approximate measurement based on CLDR Territory-Language information and ITU internet penetration rates.
* Number of books published annually in different languages (suggested data sources? Does ISBN have a database or stat report on published languages?)
* Number of web pages returned by major search engines on the queries of "Wikipedia" in different languages, excluding results from Wikimedia projects.
* Number of scholarly publications across languages (suggested data sources?)
* Number of major newspaper publications across languages (suggested data sources?)
Please share your thoughts!
Well as I see it, the state of any language version is a combination of the state of its content and community. Going back to the zero-state, in order to have permission to start a language version, there must be a "list of 10,000 important topics" that has to be registered somewhere (sorry, no idea where). This list for the English wikipedia includes an entry for the singer Michael Jackson, one of the many articles that gets lots and lots of page hits daily. Perhaps this is the case for all other languages in the world (I have no idea), but I would assume one measurement going forward from the zero-state would be the number of changes over time involving this list in the specific language, such as 1) The list itself (do these topics ever change?) 2) The average number of edits and page views of those pages in the specific language 3) The average number of blue links per page on those pages in the specific language 4) The average number of editors *ever* contributing per page on those pages in the specific language 5) The average number of active editors contributing per page on those pages in the specific language ...
Other important measurements could be the number of active editors over all, the number of edits appearing in the recent changes list per day/month/year, the number of pages created or deleted per day/month/year...
On Tue, Jul 8, 2014 at 9:27 AM, Han-Teng Liao (OII) < han-teng.liao@oii.ox.ac.uk> wrote:
Dear all,
Your suggestions are needed on the ways in which one can construct
some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability of
language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users.
As previous research has suggested that Wikipedia activities need
mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the world
such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better
categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content.
Another use of this baseline measurement is to better compare the
development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.
The current similar external baseline data is probably the number of
language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.
So I welcome you to add to the following list, any external
indicators (and possibly data sources) that may help to construct such base line.
==Indicators==
- Internet users for each language (probably approximate measurement
based on CLDR Territory-Language information and ITU internet penetration rates.
- Number of books published annually in different languages (suggested
data sources? Does ISBN have a database or stat report on published languages?)
- Number of web pages returned by major search engines on the queries of
"Wikipedia" in different languages, excluding results from Wikimedia projects.
- Number of scholarly publications across languages (suggested data
sources?)
- Number of major newspaper publications across languages (suggested data
sources?)
Please share your thoughts!
-- han-teng liao
"[O]nce the Imperial Institute of France and the Royal Society of London begin to work together on a new encyclopaedia, it will take less than a year to achieve a lasting peace between France and England." - Henri Saint-Simon (1810)
"A common ideology based on this Permanent World Encyclopaedia is a possible means, to some it seems the only means, of dissolving human conflict into unity." - H.G. Wells (1937)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thanks Jane for the comments and suggestions.
Correct me if I misread your comments/suggestions, Jane.
(1) Did you suggest measurements that are observable *inside* Wikipedia/Wikimedia websites? (2) If so, does it mean that your suggestion of measuring the current state of a language version as "a combination of the state of its content and community" describes only the *internal* state of that version? (3) When you said "zero-state", did you mean the state where the number of articles in a given language version is zero?
Your suggestions appear to me deal with a measurement of the current state of a language version. The use of "zero-state" suggests the equal grounds for any language version to develop on the Wikipedia platform.
However, my call for help focuses on the current external state out there external to Wikipedia platform. In this context, the term *baseline* suggests some languages are already *more equal* than the others because of the availability of language users and content out there. Since Wikipedia depends on reliable published secondary sources, some languages are *expected* to be more developed than the others. What I want to do is to come up such *expectation values* so that researchers and community members can see which language versions perform better/worse than expected, in comparison to other languages.
While I can agree that on the Wikipedia platform, any language may have equal groundings when they start from zero. It is my contestation that some languages are already *more equal* than the other.
In other words, I want to construct sensible baselines *against which* the development of language versions can be better understood. Such baselines thus should capture external factors that are likely to condition the development. Normalization of development metrics using such baselines can then control these external factors to see which language versions underperform even when the external availability content and users is not an issue. It can also help to see which language versions outperform even when the external conditions are not that great.
Hence, I really appreciate your suggestions as potential indicators of the (internal) development state of a language version of Wikipedia, but they do not appear to capture factors that are external to Wikipedia.
Best,
2014-07-08 10:09 GMT+01:00 Jane Darnell jane023@gmail.com:
Well as I see it, the state of any language version is a combination of the state of its content and community. Going back to the zero-state, in order to have permission to start a language version, there must be a "list of 10,000 important topics" that has to be registered somewhere (sorry, no idea where). This list for the English wikipedia includes an entry for the singer Michael Jackson, one of the many articles that gets lots and lots of page hits daily. Perhaps this is the case for all other languages in the world (I have no idea), but I would assume one measurement going forward from the zero-state would be the number of changes over time involving this list in the specific language, such as
- The list itself (do these topics ever change?)
- The average number of edits and page views of those pages in the
specific language 3) The average number of blue links per page on those pages in the specific language 4) The average number of editors *ever* contributing per page on those pages in the specific language 5) The average number of active editors contributing per page on those pages in the specific language ...
Other important measurements could be the number of active editors over all, the number of edits appearing in the recent changes list per day/month/year, the number of pages created or deleted per day/month/year...
On Tue, Jul 8, 2014 at 9:27 AM, Han-Teng Liao (OII) < han-teng.liao@oii.ox.ac.uk> wrote:
Dear all,
Your suggestions are needed on the ways in which one can construct
some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability of
language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users.
As previous research has suggested that Wikipedia activities need
mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the world
such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better
categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content.
Another use of this baseline measurement is to better compare the
development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.
The current similar external baseline data is probably the number of
language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.
So I welcome you to add to the following list, any external
indicators (and possibly data sources) that may help to construct such base line.
==Indicators==
- Internet users for each language (probably approximate measurement
based on CLDR Territory-Language information and ITU internet penetration rates.
- Number of books published annually in different languages (suggested
data sources? Does ISBN have a database or stat report on published languages?)
- Number of web pages returned by major search engines on the queries of
"Wikipedia" in different languages, excluding results from Wikimedia projects.
- Number of scholarly publications across languages (suggested data
sources?)
- Number of major newspaper publications across languages (suggested data
sources?)
Please share your thoughts!
-- han-teng liao
"[O]nce the Imperial Institute of France and the Royal Society of London begin to work together on a new encyclopaedia, it will take less than a year to achieve a lasting peace between France and England." - Henri Saint-Simon (1810)
"A common ideology based on this Permanent World Encyclopaedia is a possible means, to some it seems the only means, of dissolving human conflict into unity." - H.G. Wells (1937)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
I more or less tried to have a go at this on http://wikinewsreporter.wordpress.com/2014/06/30/determining-the-relative-qu... using both internal and external criteria for determining quality. (External being defined as what is considered good type of work on the topic using outside, non-Wikipedia specific definitions of quality.)
Sincerely, Laura Hale
On Tue, Jul 8, 2014 at 12:06 PM, Han-Teng Liao (OII) < han-teng.liao@oii.ox.ac.uk> wrote:
Thanks Jane for the comments and suggestions.
Correct me if I misread your comments/suggestions, Jane.
(1) Did you suggest measurements that are observable *inside* Wikipedia/Wikimedia websites? (2) If so, does it mean that your suggestion of measuring the current state of a language version as "a combination of the state of its content and community" describes only the *internal* state of that version? (3) When you said "zero-state", did you mean the state where the number of articles in a given language version is zero?
Your suggestions appear to me deal with a measurement of the current state of a language version. The use of "zero-state" suggests the equal grounds for any language version to develop on the Wikipedia platform.
However, my call for help focuses on the current external state out there external to Wikipedia platform. In this context, the term *baseline* suggests some languages are already *more equal* than the others because of the availability of language users and content out there. Since Wikipedia depends on reliable published secondary sources, some languages are *expected* to be more developed than the others. What I want to do is to come up such *expectation values* so that researchers and community members can see which language versions perform better/worse than expected, in comparison to other languages.
While I can agree that on the Wikipedia platform, any language may have equal groundings when they start from zero. It is my contestation that some languages are already *more equal* than the other.
In other words, I want to construct sensible baselines *against which* the development of language versions can be better understood. Such baselines thus should capture external factors that are likely to condition the development. Normalization of development metrics using such baselines can then control these external factors to see which language versions underperform even when the external availability content and users is not an issue. It can also help to see which language versions outperform even when the external conditions are not that great.
Hence, I really appreciate your suggestions as potential indicators of the (internal) development state of a language version of Wikipedia, but they do not appear to capture factors that are external to Wikipedia.
Best,
2014-07-08 10:09 GMT+01:00 Jane Darnell jane023@gmail.com:
Well as I see it, the state of any language version is a combination of the state of its content and community. Going back to the zero-state, in order to have permission to start a language version, there must be a "list of 10,000 important topics" that has to be registered somewhere (sorry, no idea where). This list for the English wikipedia includes an entry for the singer Michael Jackson, one of the many articles that gets lots and lots of page hits daily. Perhaps this is the case for all other languages in the world (I have no idea), but I would assume one measurement going forward from the zero-state would be the number of changes over time involving this list in the specific language, such as
- The list itself (do these topics ever change?)
- The average number of edits and page views of those pages in the
specific language 3) The average number of blue links per page on those pages in the specific language 4) The average number of editors *ever* contributing per page on those pages in the specific language 5) The average number of active editors contributing per page on those pages in the specific language ...
Other important measurements could be the number of active editors over all, the number of edits appearing in the recent changes list per day/month/year, the number of pages created or deleted per day/month/year...
On Tue, Jul 8, 2014 at 9:27 AM, Han-Teng Liao (OII) < han-teng.liao@oii.ox.ac.uk> wrote:
Dear all,
Your suggestions are needed on the ways in which one can construct
some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability of
language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users.
As previous research has suggested that Wikipedia activities need
mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the world
such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better
categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content.
Another use of this baseline measurement is to better compare the
development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.
The current similar external baseline data is probably the number
of language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.
So I welcome you to add to the following list, any external
indicators (and possibly data sources) that may help to construct such base line.
==Indicators==
- Internet users for each language (probably approximate measurement
based on CLDR Territory-Language information and ITU internet penetration rates.
- Number of books published annually in different languages (suggested
data sources? Does ISBN have a database or stat report on published languages?)
- Number of web pages returned by major search engines on the queries of
"Wikipedia" in different languages, excluding results from Wikimedia projects.
- Number of scholarly publications across languages (suggested data
sources?)
- Number of major newspaper publications across languages (suggested
data sources?)
Please share your thoughts!
-- han-teng liao
"[O]nce the Imperial Institute of France and the Royal Society of London begin to work together on a new encyclopaedia, it will take less than a year to achieve a lasting peace between France and England." - Henri Saint-Simon (1810)
"A common ideology based on this Permanent World Encyclopaedia is a possible means, to some it seems the only means, of dissolving human conflict into unity." - H.G. Wells (1937)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
(on Laura Hale's pilot study of measuring quality across several languages used in Spain)
Laura, I enjoy reading the report on your blog post, which also takes also the quantified approach to measuring quality.
If I did not misread your blogpost, you incorporated the measurement of general quality of external links (which I assume could be sources or suggested reading lists) as components of Wikipedia article quality. I also used similar research strategy for my thesis chapter.
I also like the initial pick of “Female MEPs for Spain.” (Women representing Spain who are or have been Members of the European Parliament for study. It would be interesting to see how the same methodology applies for male MEPs for another gender-minded study.
Can you tell me *whether* and *how* you measure or consider the availability of users/content for these languages in Spain. To me Spanish language is a world language whereas other languages (esp. Catalan) may have some kind of development in terms of their publishing market. It would be even more interesting to know, in terms of gender-minded external publications, for each language, and use this information to contextualize the sources used in these Wikipedia articles. It is not uncommon practices for a Wikipedia article to cite a more dominant or *published* language source. I would imagine some Catalan Wikipedia articles also cite Spanish sources, for instance.
This leads to the issue of language/knowledge dependency. Although I only addressed this issue superficially by visualizing interlanguage links before, it is on my mind, though it is a separate issue.
Best, han-teng liao
2014-07-08 11:13 GMT+01:00 Laura Hale laura@fanhistory.com:
I more or less tried to have a go at this on http://wikinewsreporter.wordpress.com/2014/06/30/determining-the-relative-qu... using both internal and external criteria for determining quality. (External being defined as what is considered good type of work on the topic using outside, non-Wikipedia specific definitions of quality.)
Sincerely, Laura Hale
On Tue, Jul 8, 2014 at 12:06 PM, Han-Teng Liao (OII) < han-teng.liao@oii.ox.ac.uk> wrote:
Thanks Jane for the comments and suggestions.
Correct me if I misread your comments/suggestions, Jane.
(1) Did you suggest measurements that are observable *inside* Wikipedia/Wikimedia websites? (2) If so, does it mean that your suggestion of measuring the current state of a language version as "a combination of the state of its content and community" describes only the *internal* state of that version? (3) When you said "zero-state", did you mean the state where the number of articles in a given language version is zero?
Your suggestions appear to me deal with a measurement of the current state of a language version. The use of "zero-state" suggests the equal grounds for any language version to develop on the Wikipedia platform.
However, my call for help focuses on the current external state out there external to Wikipedia platform. In this context, the term *baseline* suggests some languages are already *more equal* than the others because of the availability of language users and content out there. Since Wikipedia depends on reliable published secondary sources, some languages are *expected* to be more developed than the others. What I want to do is to come up such *expectation values* so that researchers and community members can see which language versions perform better/worse than expected, in comparison to other languages.
While I can agree that on the Wikipedia platform, any language may have equal groundings when they start from zero. It is my contestation that some languages are already *more equal* than the other.
In other words, I want to construct sensible baselines *against which* the development of language versions can be better understood. Such baselines thus should capture external factors that are likely to condition the development. Normalization of development metrics using such baselines can then control these external factors to see which language versions underperform even when the external availability content and users is not an issue. It can also help to see which language versions outperform even when the external conditions are not that great.
Hence, I really appreciate your suggestions as potential indicators of the (internal) development state of a language version of Wikipedia, but they do not appear to capture factors that are external to Wikipedia.
Best,
2014-07-08 10:09 GMT+01:00 Jane Darnell jane023@gmail.com:
Well as I see it, the state of any language version is a combination of the state of its content and community. Going back to the zero-state, in order to have permission to start a language version, there must be a "list of 10,000 important topics" that has to be registered somewhere (sorry, no idea where). This list for the English wikipedia includes an entry for the singer Michael Jackson, one of the many articles that gets lots and lots of page hits daily. Perhaps this is the case for all other languages in the world (I have no idea), but I would assume one measurement going forward from the zero-state would be the number of changes over time involving this list in the specific language, such as
- The list itself (do these topics ever change?)
- The average number of edits and page views of those pages in the
specific language 3) The average number of blue links per page on those pages in the specific language 4) The average number of editors *ever* contributing per page on those pages in the specific language 5) The average number of active editors contributing per page on those pages in the specific language ...
Other important measurements could be the number of active editors over all, the number of edits appearing in the recent changes list per day/month/year, the number of pages created or deleted per day/month/year...
On Tue, Jul 8, 2014 at 9:27 AM, Han-Teng Liao (OII) < han-teng.liao@oii.ox.ac.uk> wrote:
Dear all,
Your suggestions are needed on the ways in which one can construct
some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability of
language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users.
As previous research has suggested that Wikipedia activities need
mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the world
such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better
categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content.
Another use of this baseline measurement is to better compare the
development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.
The current similar external baseline data is probably the number
of language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.
So I welcome you to add to the following list, any external
indicators (and possibly data sources) that may help to construct such base line.
==Indicators==
- Internet users for each language (probably approximate measurement
based on CLDR Territory-Language information and ITU internet penetration rates.
- Number of books published annually in different languages (suggested
data sources? Does ISBN have a database or stat report on published languages?)
- Number of web pages returned by major search engines on the queries
of "Wikipedia" in different languages, excluding results from Wikimedia projects.
- Number of scholarly publications across languages (suggested data
sources?)
- Number of major newspaper publications across languages (suggested
data sources?)
Please share your thoughts!
-- han-teng liao
"[O]nce the Imperial Institute of France and the Royal Society of London begin to work together on a new encyclopaedia, it will take less than a year to achieve a lasting peace between France and England." - Henri Saint-Simon (1810)
"A common ideology based on this Permanent World Encyclopaedia is a possible means, to some it seems the only means, of dissolving human conflict into unity." - H.G. Wells (1937)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- twitter: purplepopple
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
han-teng liao, Sorry but I had to read your answer a couple times before I understood what you were getting at. I missed the previous conversation also. For information about the 10,000 things, I would just go to GerardM because he knows all about that stuff. As far as page stats on all the projects, you may want to talk to Erik Zachte about his Infodesiac charts.
The reason I was confused is because you can't make a comparison study unless you fix a few variables, and if you take an approach that is only including external sites that are only available in the specific languages, then I don't think you have variables that you can compare across languages. My point about the combination community-content, is that there is no quality content without all of the chaff, and there is no community without quality content.
Therefore, you need to look at both, so the social side of editor interactions (or lack thereof) are as equally important as looking solely at content. Don't forget that all edits result in one way or another from an internet search.
As far as the term "a few good men" goes, I think Gerard was referring to the success of the Dutch Wikipedia, which is pretty good in terms of "number of people in the world who speak and read the language", while 94% of all editors are male. I think Erik Zachte has gathered some numbers on the language-speakers per Wikipedia aspect of this issue. Gender information is only based on survey results, though I think our survey was pretty solid, and though 6% declined to specify their gender, even if you add this to the other 6% then the "few good men" statement still holds.
Jane Lezer van de Prullenbak van de Ingezonden Brieven
On Tue, Jul 8, 2014 at 2:12 PM, h hanteng@gmail.com wrote:
(on Laura Hale's pilot study of measuring quality across several languages used in Spain)
Laura, I enjoy reading the report on your blog post, which also takes also the quantified approach to measuring quality.
If I did not misread your blogpost, you incorporated the measurement of general quality of external links (which I assume could be sources or suggested reading lists) as components of Wikipedia article quality. I also used similar research strategy for my thesis chapter.
I also like the initial pick of “Female MEPs for Spain.” (Women representing Spain who are or have been Members of the European Parliament for study. It would be interesting to see how the same methodology applies for male MEPs for another gender-minded study.
Can you tell me *whether* and *how* you measure or consider the availability of users/content for these languages in Spain. To me Spanish language is a world language whereas other languages (esp. Catalan) may have some kind of development in terms of their publishing market. It would be even more interesting to know, in terms of gender-minded external publications, for each language, and use this information to contextualize the sources used in these Wikipedia articles. It is not uncommon practices for a Wikipedia article to cite a more dominant or *published* language source. I would imagine some Catalan Wikipedia articles also cite Spanish sources, for instance.
This leads to the issue of language/knowledge dependency. Although I only addressed this issue superficially by visualizing interlanguage links before, it is on my mind, though it is a separate issue.
Best, han-teng liao
2014-07-08 11:13 GMT+01:00 Laura Hale laura@fanhistory.com:
I more or less tried to have a go at this on
http://wikinewsreporter.wordpress.com/2014/06/30/determining-the-relative-qu... using both internal and external criteria for determining quality. (External being defined as what is considered good type of work on the topic using outside, non-Wikipedia specific definitions of quality.)
Sincerely, Laura Hale
On Tue, Jul 8, 2014 at 12:06 PM, Han-Teng Liao (OII) < han-teng.liao@oii.ox.ac.uk> wrote:
Thanks Jane for the comments and suggestions.
Correct me if I misread your comments/suggestions, Jane.
(1) Did you suggest measurements that are observable *inside* Wikipedia/Wikimedia websites? (2) If so, does it mean that your suggestion of measuring the current state of a language version as "a combination of the state of its content and community" describes only the *internal* state of that version? (3) When you said "zero-state", did you mean the state where the number of articles in a given language version is zero?
Your suggestions appear to me deal with a measurement of the current state of a language version. The use of "zero-state" suggests the equal grounds for any language version to develop on the Wikipedia platform.
However, my call for help focuses on the current external state out there external to Wikipedia platform. In this context, the term *baseline* suggests some languages are already *more equal* than the others because of the availability of language users and content out there. Since Wikipedia depends on reliable published secondary sources, some languages are *expected* to be more developed than the others. What I want to do is to come up such *expectation values* so that researchers and community members can see which language versions perform better/worse than expected, in comparison to other languages.
While I can agree that on the Wikipedia platform, any language may have equal groundings when they start from zero. It is my contestation that some languages are already *more equal* than the other.
In other words, I want to construct sensible baselines *against which* the development of language versions can be better understood. Such baselines thus should capture external factors that are likely to condition the development. Normalization of development metrics using such baselines can then control these external factors to see which language versions underperform even when the external availability content and users is not an issue. It can also help to see which language versions outperform even when the external conditions are not that great.
Hence, I really appreciate your suggestions as potential indicators of the (internal) development state of a language version of Wikipedia, but they do not appear to capture factors that are external to Wikipedia.
Best,
2014-07-08 10:09 GMT+01:00 Jane Darnell jane023@gmail.com:
Well as I see it, the state of any language version is a combination of the state of its content and community. Going back to the zero-state, in order to have permission to start a language version, there must be a "list of 10,000 important topics" that has to be registered somewhere (sorry, no idea where). This list for the English wikipedia includes an entry for the singer Michael Jackson, one of the many articles that gets lots and lots of page hits daily. Perhaps this is the case for all other languages in the world (I have no idea), but I would assume one measurement going forward from the zero-state would be the number of changes over time involving this list in the specific language, such as
- The list itself (do these topics ever change?)
- The average number of edits and page views of those pages in the
specific language 3) The average number of blue links per page on those pages in the specific language 4) The average number of editors *ever* contributing per page on those pages in the specific language 5) The average number of active editors contributing per page on those pages in the specific language ...
Other important measurements could be the number of active editors over all, the number of edits appearing in the recent changes list per day/month/year, the number of pages created or deleted per day/month/year...
On Tue, Jul 8, 2014 at 9:27 AM, Han-Teng Liao (OII) < han-teng.liao@oii.ox.ac.uk> wrote:
Dear all,
Your suggestions are needed on the ways in which one can
construct some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability
of language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users.
As previous research has suggested that Wikipedia activities
need mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the
world such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better
categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content.
Another use of this baseline measurement is to better compare
the development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.
The current similar external baseline data is probably the number
of language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.
So I welcome you to add to the following list, any external
indicators (and possibly data sources) that may help to construct such base line.
==Indicators==
- Internet users for each language (probably approximate measurement
based on CLDR Territory-Language information and ITU internet penetration rates.
- Number of books published annually in different languages (suggested
data sources? Does ISBN have a database or stat report on published languages?)
- Number of web pages returned by major search engines on the queries
of "Wikipedia" in different languages, excluding results from Wikimedia projects.
- Number of scholarly publications across languages (suggested data
sources?)
- Number of major newspaper publications across languages (suggested
data sources?)
Please share your thoughts!
-- han-teng liao
"[O]nce the Imperial Institute of France and the Royal Society of London begin to work together on a new encyclopaedia, it will take less than a year to achieve a lasting peace between France and England." - Henri Saint-Simon (1810)
"A common ideology based on this Permanent World Encyclopaedia is a possible means, to some it seems the only means, of dissolving human conflict into unity." - H.G. Wells (1937)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- twitter: purplepopple
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
This is such a great discussion. Thanks for starting it, Hang-teng :)
Laura, I just loved your analysis. Makes me realize that I spend way too much time thinking about these things rather than practicing them which is what you showed in your rapid analysis :)
One thing that I was really interested in was how you are thinking about diversity of source languages. It's interesting because I tend to think about this in exactly the opposite way! Basically, it seems that in your analysis you're rewarding articles if they have a diversity of language sources whereas I have always considered sources in terms of the verifiability principle where the source should ideally be in the language of the Wikipedia version so that users can verify whether the source is being accurately reflected in the relevant article.
So I went to the 'verifiability' articles in a few different languages to check whether there is consensus about this on Wikipedia, at least. The english version [1] states that a) english language sources are preferred because it's the English Wikipedia b) if another language source is used, then editors may request a translation of relevant sections of the source, and c) if other languages are used in quotations, then a translation must be provided.
I looked at a few other language versions of the verifiability article (only 58 language versions have a version of this page) and few mention what to do with other language sources. Afrikaans [2] seems to follow the principles of the English version but Spanish and Catalan, for example, don't mention other language versions of sources.
Anyway, I'd be really interested in what you think about this. Do you think it's valuable to take Wikipedia's (or at least Wikipedia English's) normative framework for evaluating citations or do you think there's value in using another principle?
Thanks!
Best, Heather.
[1] https://en.wikipedia.org/wiki/Wikipedia:Verifiability [2] https://af.wikipedia.org/wiki/Wikipedia:Verifieerbaarheid
Heather Ford Oxford Internet Institute http://www.oii.ox.ac.uk Doctoral Programme EthnographyMatters http://ethnographymatters.net | Oxford Digital Ethnography Group http://www.oii.ox.ac.uk/research/projects/?id=115 http://hblog.org | @hfordsa http://www.twitter.com/hfordsa
On 8 July 2014 11:13, Laura Hale laura@fanhistory.com wrote:
I more or less tried to have a go at this on http://wikinewsreporter.wordpress.com/2014/06/30/determining-the-relative-qu... using both internal and external criteria for determining quality. (External being defined as what is considered good type of work on the topic using outside, non-Wikipedia specific definitions of quality.)
Sincerely, Laura Hale
On Tue, Jul 8, 2014 at 12:06 PM, Han-Teng Liao (OII) < han-teng.liao@oii.ox.ac.uk> wrote:
Thanks Jane for the comments and suggestions.
Correct me if I misread your comments/suggestions, Jane.
(1) Did you suggest measurements that are observable *inside* Wikipedia/Wikimedia websites? (2) If so, does it mean that your suggestion of measuring the current state of a language version as "a combination of the state of its content and community" describes only the *internal* state of that version? (3) When you said "zero-state", did you mean the state where the number of articles in a given language version is zero?
Your suggestions appear to me deal with a measurement of the current state of a language version. The use of "zero-state" suggests the equal grounds for any language version to develop on the Wikipedia platform.
However, my call for help focuses on the current external state out there external to Wikipedia platform. In this context, the term *baseline* suggests some languages are already *more equal* than the others because of the availability of language users and content out there. Since Wikipedia depends on reliable published secondary sources, some languages are *expected* to be more developed than the others. What I want to do is to come up such *expectation values* so that researchers and community members can see which language versions perform better/worse than expected, in comparison to other languages.
While I can agree that on the Wikipedia platform, any language may have equal groundings when they start from zero. It is my contestation that some languages are already *more equal* than the other.
In other words, I want to construct sensible baselines *against which* the development of language versions can be better understood. Such baselines thus should capture external factors that are likely to condition the development. Normalization of development metrics using such baselines can then control these external factors to see which language versions underperform even when the external availability content and users is not an issue. It can also help to see which language versions outperform even when the external conditions are not that great.
Hence, I really appreciate your suggestions as potential indicators of the (internal) development state of a language version of Wikipedia, but they do not appear to capture factors that are external to Wikipedia.
Best,
2014-07-08 10:09 GMT+01:00 Jane Darnell jane023@gmail.com:
Well as I see it, the state of any language version is a combination of the state of its content and community. Going back to the zero-state, in order to have permission to start a language version, there must be a "list of 10,000 important topics" that has to be registered somewhere (sorry, no idea where). This list for the English wikipedia includes an entry for the singer Michael Jackson, one of the many articles that gets lots and lots of page hits daily. Perhaps this is the case for all other languages in the world (I have no idea), but I would assume one measurement going forward from the zero-state would be the number of changes over time involving this list in the specific language, such as
- The list itself (do these topics ever change?)
- The average number of edits and page views of those pages in the
specific language 3) The average number of blue links per page on those pages in the specific language 4) The average number of editors *ever* contributing per page on those pages in the specific language 5) The average number of active editors contributing per page on those pages in the specific language ...
Other important measurements could be the number of active editors over all, the number of edits appearing in the recent changes list per day/month/year, the number of pages created or deleted per day/month/year...
On Tue, Jul 8, 2014 at 9:27 AM, Han-Teng Liao (OII) < han-teng.liao@oii.ox.ac.uk> wrote:
Dear all,
Your suggestions are needed on the ways in which one can construct
some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability of
language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users.
As previous research has suggested that Wikipedia activities need
mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the world
such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better
categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content.
Another use of this baseline measurement is to better compare the
development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.
The current similar external baseline data is probably the number
of language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.
So I welcome you to add to the following list, any external
indicators (and possibly data sources) that may help to construct such base line.
==Indicators==
- Internet users for each language (probably approximate measurement
based on CLDR Territory-Language information and ITU internet penetration rates.
- Number of books published annually in different languages (suggested
data sources? Does ISBN have a database or stat report on published languages?)
- Number of web pages returned by major search engines on the queries
of "Wikipedia" in different languages, excluding results from Wikimedia projects.
- Number of scholarly publications across languages (suggested data
sources?)
- Number of major newspaper publications across languages (suggested
data sources?)
Please share your thoughts!
-- han-teng liao
"[O]nce the Imperial Institute of France and the Royal Society of London begin to work together on a new encyclopaedia, it will take less than a year to achieve a lasting peace between France and England." - Henri Saint-Simon (1810)
"A common ideology based on this Permanent World Encyclopaedia is a possible means, to some it seems the only means, of dissolving human conflict into unity." - H.G. Wells (1937)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
-- twitter: purplepopple
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Wed, Jul 9, 2014 at 1:11 PM, Heather Ford hfordsa@gmail.com wrote:
So I went to the 'verifiability' articles in a few different languages to check whether there is consensus about this on Wikipedia, at least. The english version [1] states that a) english language sources are preferred because it's the English Wikipedia b) if another language source is used, then editors may request a translation of relevant sections of the source, and c) if other languages are used in quotations, then a translation must be provided.
I looked at a few other language versions of the verifiability article (only 58 language versions have a version of this page) and few mention what to do with other language sources. Afrikaans [2] seems to follow the principles of the English version but Spanish and Catalan, for example, don't mention other language versions of sources.
Anyway, I'd be really interested in what you think about this. Do you think it's valuable to take Wikipedia's (or at least Wikipedia English's) normative framework for evaluating citations or do you think there's value in using another principle?
My understanding is that most Wikipedias preference sources in their own language first, sources from languages in the region second, English third and other languages first. This may not always be stated explicitly as policy.
For example, as I understand Euskara, Argonese and Galacian Wikipedias, they prefer their native sources first but often these are difficult to find. Then they preference Spanish followed by English. Catalan Wikipedia preferences Catalan languages first, then for reasons of nationalism around the language, they preference English and then Spanish or French.
The issue when looking at minor languages with less than say 10 million speakers is there an easy preference to use English because for many people, this is the second language they speak. In terms of operating effectively inside the Wikimedia movement on a broader scale, if you don't speak English, you're linguistically disadvantaged. Patterns I have observed suggest English sourcing is easiest once native source and near spoken language areas are not available. This is especially true for topics of specific geographic interest.
The choice of language has the potential to impact the narrative of the article in ways that may not actually be 100% neutral because of the available best sources. This is especially true with biographies, politics and controversies. The preference of one language to the exclusion of another may actually result in something that may violate WP:NPOV, which I believe is a pillar across all projects. The use of multiple languages in these cases may result in a more balanced article, especially when the place language is given some preference.
In a Spanish context, the Spanish, Catalan, Euskara and English sources from Spain will probably express different political views in general as it pertains to the issue of Catalan separatism. Catalan language and English language sources will probably be much more pro-separatism than Spanish and Euskara language sources.
Also, sources are supposed to be used for the purpose of verifying sources, and preferencing sources the same based on languages seems problematic. It does make a number of sources non-veriable unless one has the language skills. That's not useful.
I think considering the language of sources is something worth considering, but any such metric for assessing quality would probably need to be fluid to a degree to address the specific broad topics being assessed. No way should political biographies be assessed the same way as physics articles. That's nuts.
Sincerely, Laura Hale
One thing that troubles me slightly with this conversation is that I think there is a presumption that people will naturally choose to read and write Wikipedia in their native language, but that isn't necessarily so.
Anecdotally it seems many people read English Wikipedia because precisely it is larger and more comprehensive (obviously they must have a reasonable ability to read English). And I would imagine that this is true too for, say, Catalan, where one might imagine they would also know Spanish and might well turn to the larger Spanish Wikipedia most of the time. Generally speaking, speakers of "small" languages (meaning small populations of native speakers) are likely to speak one or more "larger" language and therefore may preferentially read Wikipedia in those "larger" languages in order to have a broader and deeper array of content. As I don't speak any "small" languages myself, I do not know if those Wikipedias tend to cover the more general topics or whether there is a greater focus on local content unlikely to be covered in "larger" Wikipedias - does anyone know?
If this is true about reading Wikipedia, then it seems likely to flow over into writing Wikipedia as well. Writing for the "larger" language has the benefit of bringing information to more people. So, here, motivation for editing comes into play. I suspect people who write for the "small" language Wikipedias probably have a motivation to keep their language alive, whereas this is unlikely to be a consideration for the large languages. But OTOH if you write for Wikipedia because you are passionate about sharing your knowledge of a topic area (e.g. Pokemon, football, cactus), then it seems that you would write in the Wikipedia with the largest content base on that topic (within your linguistic abilities) as you would have more to build on and a larger community of other editors to work with. Of course, working with others on Wikipedia isn't always easy, and perhaps that might be a factor that might drive an editor to write in a "smaller" language Wikipedia (which might be more work, but with less conflict).
What I don't know is whether any of these issues are microscopic or macroscopic. If they are macroscopic, then they have to be factored into the model of "how a Wikipedia should develop".
My personal view is that the "extremely small" language Wikipedias are unlikely to achieve a broad coverage of general topics because they are unlikely to find a large enough editor community. I think they will underperform whatever level of development they might theoretically be capable of. My rationale is that we know that Wikipedia is written predominantly by people with higher than average levels of education, which almost certainly means you have had to learn one or more larger languages to do this, thus opening up the ability to work with other Wikipedis, thus siphoning off some proportion of the editor base to boost the development of larger Wikipedias at the expense of their native-language Wikipedia. I think it is more realistic to focus on more local content in small language Wikipedias and leave the more general content to the larger Wikipedias.
Note, this is all written on the assumption of not using machine translation. Clearly with machine translation, there is far greater potential for content in languages for which there are machine translation tools. But again, machine translation is less likely to be available for the "very small" languages, so even in that scenario, I think the smaller language Wikipedias will miss out on the content.
Kerry
Hoi, One more thing to consider is the possibility to generate articles on the fly based in information in Wikidata. This is already done in the "Reasonator" and it functions with differing results for 2,225,364 items. In essence it is a small script that can be translated in other languages. Obviously it will not consider grammatical constructs that well so the text will be awkward. This will however change once Wikidata gains lexical information as is planned for its future.
To make this work well, the text can be cached and will not be saved as a Wikipedia article.. A different text is needed for other classes. Thanks, GerardM
On 10 July 2014 01:22, Kerry Raymond kerry.raymond@gmail.com wrote:
One thing that troubles me slightly with this conversation is that I think there is a presumption that people will naturally choose to read and write Wikipedia in their native language, but that isn’t necessarily so.
Anecdotally it seems many people read English Wikipedia because precisely it is larger and more comprehensive (obviously they must have a reasonable ability to read English). And I would imagine that this is true too for, say, Catalan, where one might imagine they would also know Spanish and might well turn to the larger Spanish Wikipedia most of the time. Generally speaking, speakers of “small” languages (meaning small populations of native speakers) are likely to speak one or more “larger” language and therefore may preferentially read Wikipedia in those “larger” languages in order to have a broader and deeper array of content. As I don’t speak any “small” languages myself, I do not know if those Wikipedias tend to cover the more general topics or whether there is a greater focus on local content unlikely to be covered in “larger” Wikipedias – does anyone know?
If this is true about reading Wikipedia, then it seems likely to flow over into writing Wikipedia as well. Writing for the “larger” language has the benefit of bringing information to more people. So, here, motivation for editing comes into play. I suspect people who write for the “small” language Wikipedias probably have a motivation to keep their language alive, whereas this is unlikely to be a consideration for the large languages. But OTOH if you write for Wikipedia because you are passionate about sharing your knowledge of a topic area (e.g. Pokemon, football, cactus), then it seems that you would write in the Wikipedia with the largest content base on that topic (within your linguistic abilities) as you would have more to build on and a larger community of other editors to work with. Of course, working with others on Wikipedia isn’t always easy, and perhaps that might be a factor that might drive an editor to write in a “smaller” language Wikipedia (which might be more work, but with less conflict).
What I don’t know is whether any of these issues are microscopic or macroscopic. If they are macroscopic, then they have to be factored into the model of “how a Wikipedia should develop”.
My personal view is that the “extremely small” language Wikipedias are unlikely to achieve a broad coverage of general topics because they are unlikely to find a large enough editor community. I think they will underperform whatever level of development they might theoretically be capable of. My rationale is that we know that Wikipedia is written predominantly by people with higher than average levels of education, which almost certainly means you have had to learn one or more larger languages to do this, thus opening up the ability to work with other Wikipedis, thus siphoning off some proportion of the editor base to boost the development of larger Wikipedias at the expense of their native-language Wikipedia. I think it is more realistic to focus on more local content in small language Wikipedias and leave the more general content to the larger Wikipedias.
Note, this is all written on the assumption of not using machine translation. Clearly with machine translation, there is far greater potential for content in languages for which there are machine translation tools. But again, machine translation is less likely to be available for the “very small” languages, so even in that scenario, I think the smaller language Wikipedias will miss out on the content.
Kerry
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Web browser language settings are an obvious place to start this. This will give you an approximation of user's preferred language (more likely the preferred language of those who configured their software). See http://www.w3.org/International/questions/qa-lang-priorities.en.php for the gory details.
cheers stuart
(user language log: e.g. Accept-Language parameter)
Yes Stuart, locale data could be a nice source to look at, including the HTTP headers of the Accept-Language to find locale such as " zh-TW,zh;q=0.8,en;q=0.6"
Do you or anyone have suggestions on the external or global datasets that can be used as a proxy for global web user activities based on locale/languages?
I guess the above question is a more general question that I may want to also ask the people in air-l mailing list.
Best, han-teng
2014-07-08 11:13 GMT+01:00 Stuart A. Yeates syeates@gmail.com:
Web browser language settings are an obvious place to start this. This will give you an approximation of user's preferred language (more likely the preferred language of those who configured their software). See http://www.w3.org/International/questions/qa-lang-priorities.en.php for the gory details.
cheers stuart
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hoi, At the WMF language committee, the question if a language is viable for a Wikimedia project is a practical one. It is also very much a political one. One vitally important difference with your approach is that the distinction is between a first project and a subsequent project. In the latest iteration of the approach we do not consider Wikidata a first project. Relevance is that we do not require localisation of MediaWiki or an Incubator stage.
When the question is what it takes for a new project to work? .. the simple answer is "a few good men". There are a few projects that are alive and well that rely on no more than 3 people.
By not focussing on Wikipedia, it is possible that a Wikisource becomes the first project. When this is what those "few good men" want.. It is their party.
You may imagine that we thought about what are the likely success factors for a new project. We did come up with similar ideas that you have. The problem is that it does not help. So you determine the likelihood of success, it does not guarantee it.
What we certainly do not consider is the number of data sources. Sourcing is very much a luxury in starting projects. Insisting on sourcing at all will kill most initiatives immediately. What is important is that people start writing, reading in their language.. With a Wikipedia that gets active participation / readership, there will be a move to a more consistent orthography. Those that write determine in the end.
Wikidata was given its exception because it represents the lowest level of participation with the most effect. Add one label to an item that is used a lot (human, male, female eg) and it can be used thousands of times. It is also very obvious to re-use dictionary information to make an impact. Thanks, GerardM
On 8 July 2014 09:27, Han-Teng Liao (OII) han-teng.liao@oii.ox.ac.uk wrote:
Dear all,
Your suggestions are needed on the ways in which one can construct
some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability of
language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users.
As previous research has suggested that Wikipedia activities need
mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the world
such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better
categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content.
Another use of this baseline measurement is to better compare the
development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.
The current similar external baseline data is probably the number of
language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.
So I welcome you to add to the following list, any external
indicators (and possibly data sources) that may help to construct such base line.
==Indicators==
- Internet users for each language (probably approximate measurement
based on CLDR Territory-Language information and ITU internet penetration rates.
- Number of books published annually in different languages (suggested
data sources? Does ISBN have a database or stat report on published languages?)
- Number of web pages returned by major search engines on the queries of
"Wikipedia" in different languages, excluding results from Wikimedia projects.
- Number of scholarly publications across languages (suggested data
sources?)
- Number of major newspaper publications across languages (suggested data
sources?)
Please share your thoughts!
-- han-teng liao
"[O]nce the Imperial Institute of France and the Royal Society of London begin to work together on a new encyclopaedia, it will take less than a year to achieve a lasting peace between France and England." - Henri Saint-Simon (1810)
"A common ideology based on this Permanent World Encyclopaedia is a possible means, to some it seems the only means, of dissolving human conflict into unity." - H.G. Wells (1937)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Indeed, GerardM, I agree with you that a few good women or men with passions can kick start some Wikimedia projects, and different Wikimedia projects have different barriers or paths of development.
I also agree with you that the direction that I am pursuing may not be helpful to those languages in its incubation state. To be honest, I am not trying to measure the likelihood of success.
What I am trying to measure is probably akin to the external *difficulty* to be overcome for success. Here I have to admit that I approach this question wearing a researcher hat more so than a Wikipedian hat.
Having said that, I personally believe this approach can be very productive in generating outcomes for major world languages such as Mandarin, Spanish, Hindi, Arabic, Bengali, Russian, Japanese and Punjabi (all these languages have more native speakers than German, BTW). This way, researchers can make them more comparable because of the available external baselines.
I can envision that the outcomes can help these communities to find their strengths and weakness to develop. Then the strategies can be made to increase/expand their reach of available external content or users.
This should also help sociolinguists to identify which languages (especially non-national languages such as Kurdish or Cantonese) that are more developed than others in the Wikipedia sphere, and seeks explanations for their relative success/failure by contrasting the Wikipedia sphere and offline/online sphere. These languages include many of the mid-size language versions of Wikipedias such as Catalan, Cantonese, Tamil, etc.
Thus, I would argue that the analytical direction I want to take would be useful for many language versions which already have some user base and content. Again, I want them to be aware of both the internal and external state of each language versions, thereby contextualizing the differences among them. The baseline stats based on external sources should make them more comparable, instead of just number games among different language groups of Wikipedians.
Also, I have to agree with GerardM that the issue is both practical and political. I would like to add it is also political in terms of fund dissemination within the global Wikimedia/open knowledge movement. I personally believe that with the external numbers about potential available users and content outside Wikipedia, we can only realize how much is utilized/recruited from the external pool to the internal Wikimedia/Wikipedia projects. This should provide some sensible comparison bases on which Wikipedians can reflect upon.
Finally, may I point out the external environments for languages are also changing, which could be useful for the global Wikimedia/open knowledge movement. Based on my research on the competition of Baidu Baike and Chinese Wikiepdia in mainland China, I found that the windfall of fast growing internet users during the years of late 2005-2008 are crucial for any websites to thrive in mainland China, a windfall that Chinese Wikipedia missed because of the block by Beijing. From this, I argue that it makes strategic sense to catch the wave of rising internet users, esp. during the time when the penetration rates quickly rise from 12.8% to 40% for a given population. The external time-series data points can help pointing out the rising language users on the Web (probably Indian languages when Chinese languages have reached 40-50%).
Best, han-teng liao
2014-07-08 12:03 GMT+01:00 Gerard Meijssen gerard.meijssen@gmail.com:
Hoi, At the WMF language committee, the question if a language is viable for a Wikimedia project is a practical one. It is also very much a political one. One vitally important difference with your approach is that the distinction is between a first project and a subsequent project. In the latest iteration of the approach we do not consider Wikidata a first project. Relevance is that we do not require localisation of MediaWiki or an Incubator stage.
When the question is what it takes for a new project to work? .. the simple answer is "a few good men". There are a few projects that are alive and well that rely on no more than 3 people.
By not focussing on Wikipedia, it is possible that a Wikisource becomes the first project. When this is what those "few good men" want.. It is their party.
You may imagine that we thought about what are the likely success factors for a new project. We did come up with similar ideas that you have. The problem is that it does not help. So you determine the likelihood of success, it does not guarantee it.
What we certainly do not consider is the number of data sources. Sourcing is very much a luxury in starting projects. Insisting on sourcing at all will kill most initiatives immediately. What is important is that people start writing, reading in their language.. With a Wikipedia that gets active participation / readership, there will be a move to a more consistent orthography. Those that write determine in the end.
Wikidata was given its exception because it represents the lowest level of participation with the most effect. Add one label to an item that is used a lot (human, male, female eg) and it can be used thousands of times. It is also very obvious to re-use dictionary information to make an impact. Thanks, GerardM
On 8 July 2014 09:27, Han-Teng Liao (OII) han-teng.liao@oii.ox.ac.uk wrote:
Dear all,
Your suggestions are needed on the ways in which one can construct
some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability of
language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users.
As previous research has suggested that Wikipedia activities need
mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the world
such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better
categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content.
Another use of this baseline measurement is to better compare the
development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.
The current similar external baseline data is probably the number of
language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.
So I welcome you to add to the following list, any external
indicators (and possibly data sources) that may help to construct such base line.
==Indicators==
- Internet users for each language (probably approximate measurement
based on CLDR Territory-Language information and ITU internet penetration rates.
- Number of books published annually in different languages (suggested
data sources? Does ISBN have a database or stat report on published languages?)
- Number of web pages returned by major search engines on the queries of
"Wikipedia" in different languages, excluding results from Wikimedia projects.
- Number of scholarly publications across languages (suggested data
sources?)
- Number of major newspaper publications across languages (suggested data
sources?)
Please share your thoughts!
-- han-teng liao
"[O]nce the Imperial Institute of France and the Royal Society of London begin to work together on a new encyclopaedia, it will take less than a year to achieve a lasting peace between France and England." - Henri Saint-Simon (1810)
"A common ideology based on this Permanent World Encyclopaedia is a possible means, to some it seems the only means, of dissolving human conflict into unity." - H.G. Wells (1937)
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
h, 08/07/2014 13:49:
This should also help sociolinguists to identify which languages [...] that are more developed than others in the Wikipedia sphere, and seeks explanations for their relative success/failure by contrasting the Wikipedia sphere and offline/online sphere.
Agreed on the importance of this (though I wouldn't restrict to Wikipedia), and not only for researchers but also for editors to self-assess. For many years our main tool has been sorting by "Editors (5+) per million speakers" column in http://stats.wikimedia.org/EN/Sitemap.htm , which however has two main issues: 1) absurdly high number of editors in some editions makes some noise though not tragic (classic example: Volapük; funny but doesn't really do any harm); 2) irrealistic baseline of "speakers in millions" (which is not so closely related to what happens on the wiki) means the rank mostly shows how well those languages are doing on the internet, e.g. classic dominance of Scandinavia and Israel and classic disuse of Tagalog/Filipino (with some surprises like Northern Sami which clearly has some strong supporters out there).
Realistic baselines would let me answer simple questions like whether it.wiki is really doing better than de.wiki (35 vs. 33?!); given the similarity of conditions, if not I may conclude there is a large uncultivated land out there just waiting for some seeds (outreach to people not knowing Wikimedia projects enough), if yes I may conclude we've probably exhausted our natural resources and need to focus on using them more efficiently.
Nemo
De: Han-Teng Liao (OII) han-teng.liao@oii.ox.ac.uk Para: Research into Wikimedia content and communities wiki-research-l@lists.wikimedia.org Enviado: Martes 8 de julio de 2014 9:27 Asunto: [Wiki-research-l] Constructing sensible baselines for Wikipedia language development analytics
Dear all,
Your suggestions are needed on the ways in which one can construct some sensible baselines, most likely based on data sets *external* to Wikipedia projects, of *expected* Wikipedia language versions development.
Such baselines should ideally indicate, given the availability of language users and content (some numbers based on external data sets), a certain language version should have expected number of articles/active users.
Hello all,
It looks like some of these are questions addressed by an ongoing research line conducted by Kevin Crowston, Nicolas Jullien and me:
Sustainability of Open Collaborative Communities: Analyzing Recruitment Efficiency
http://timreview.ca/article/646
Abstract: Extensive research has been conducted over the past years to improve our understanding of sustainability conditions for large-scale collaborative projects, especially from an economic and governance perspective. However, the influence of recruitment and retention of participants in these projects has received comparatively less attention from researchers. Nevertheless, these concerns are significant for practitioners, especially regarding the apparently decreasing ability of the main open online projects to attract and retain new contributors. A possible explanation for this decrease is that those projects have simply reached a mature state of development. Marwell and Oliver (1993) and Oliver, Marwell, and Teixeira (1985) note that, at the initial stage in collective projects, participants are few and efforts are costly; in the diffusion phase, the number of participants grows, as their efforts are rewarding; and in the mature phase, some inefficiency may appear as the number of contributors is greater than required for the work.
In this article, we examine this possibility. We use original data from 36 Wikipedias in different languages to compare their efficiency in recruiting participants. We chose Wikipedia because the different language projects are at different states of development, but are quite comparable on the other aspects, providing a test of the impact of development on efficiency. Results confirm that most of the largest Wikipedias seem to be characterized by a reduced return to scale. As a result, we can draw interesting conclusions that can be useful for practitioners, facilitators, and managers of collaborative projects in order to identify key factors potentially influencing the adequate development of their communities over the medium-to-long term.
As for external data sources, we integrate in the analysis information from UNESCO and OECD, among others.
Best regards, Felipe.
As previous research has suggested that Wikipedia activities need mutually-reinforcing cycles of participation, content, and readership, it is expected that the development of a Wikipedia language version is conditioned by the availability of (digitally) literate users and (possibly digitized) content/sources.
So the assumption is:
Wikipedia Activities = Some function of (available users and content)
For example, the major non-English writing languages in the world such as Arabic, Chinese, Spanish, etc., may have different numbers of Internet users and digital content. These numbers indicate the basis on which a Wikipedia language version can develop.
One practical use of this baseline measurement is to better categorize/curate activities across Wikipedia language versions. We can then better come up with expected values of Wikipedia development, and thus categorize language versions accordingly based on the *external conditions* of available/potential users and content.
Another use of this baseline measurement is to better compare the development of different language versions. It should help answer questions such as (1) whether Korean language version is *underdeveloped* on Wikipedia platforms when compared with a language version that enjoys similar number of available/potential users and content.
The current similar external baseline data is probably the number of language speakers. My hunch is that it is not good enough in taking into accounts the available/potential users and content, especially the digitally-ready one.
So I welcome you to add to the following list, any external indicators (and possibly data sources) that may help to construct such base line. ==Indicators==
Internet users for each language (probably approximate measurement based on CLDR Territory-Language information and ITU internet penetration rates.
Number of books published annually in different languages (suggested data sources? Does ISBN have a database or stat report on published languages?)
Number of web pages returned by major search engines on the queries of "Wikipedia" in different languages, excluding results from Wikimedia projects.
Number of scholarly publications across languages (suggested data sources?)
Number of major newspaper publications across languages (suggested data sources?)
Please share your thoughts!
-- han-teng liao
"[O]nce the Imperial Institute of France and the Royal Society of London begin to work together on a new encyclopaedia, it will take less than a year to achieve a lasting peace between France and England." - Henri Saint-Simon (1810)
"A common ideology based on this Permanent World Encyclopaedia is a possible means, to some it seems the only means, of dissolving human conflict into unity." - H.G. Wells (1937) _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Thinking about this, en.wiki has an interesting structure https://en.wikipedia.org/wiki/Category:Redirects_from_non-English-language_t... Maybe if you could measure the usage of these redirects by language, you could estimate the relative population size of the wiki-using speakers of that language who currently use en.wiki? The measurement might be sharpened by a push to classify https://en.wikipedia.org/wiki/Category:Redirects_from_alternative_languages Presumably other structures exist in other wikis that could be similarly measured.
cheers stuart
wiki-research-l@lists.wikimedia.org