I thought this might be of interest to people on this list.
The email from Kropotkine (Board member of Wikimedia France) is in French, and I'm not going to translate it all, but the gist of it is:
A book came out titled "The Real Difficulties of the French language in the 21st Century". The author, Dominique Laurent, is an editor of spellchecking software. In the course of his research to better his software, he has studied a Wikipedia dump to find out what the most common mistakes in French might be, and ended up writing a book to present his findings.
A bit of statistics: The author studied 471 million words in more than 36 million sentences, and in the end analysed about 3 million mistakes, made by round 120 000 users having contributed to Wikipedia. He lists the 700 most common mistakes, their typology, the evolution of mistakes based on a corpus of texts from 20 years ago, a classification by absolute frequency (how many occurrences of one mistake) and relative frequency (how many mistakes relatively to the number of times a word is used) etc.
The book can be found here: http://www.synapse-fr.com/boutique2/catalog/product_info.php?products_id=226
Question from Kropotkine I found interesting: how can such a work be used to "train" our spellcheck bots on Wikipedia? :)
Cheers,
Delphine
---------- Forwarded message ---------- From: Kropotkine_113 Date: 2012/4/10 Subject: [Discussions WMFr] Wikipédia comme corpus d'étude des difficultés du français To: discussions@lists.wikimedia.fr
Bonsoir.
Reçu au siège de l'association un livre : « Les vraies difficultés du français au XXIe siècle », Dominique Laurent, Éditions Synapse Développement.
Pourquoi je vous en parle ? Parce l'auteur nous « adresse ce livre à titre d'information, considérant que c'est un juste retour des choses, Wikipédia ayant en l'occurrence contribué indirectement à ce travail ». Ce monsieur, éditeur de logiciels professionnels de correction, s'est servi d'un dump[1] complet des articles de Wikipédia pour faire une analyse des fautes de français.
471 millions de mots dans plus de 36 millions de phrases, et au final l'analyse de près de 3 millions de fautes commises par environ de 120 mille internautes ayant contribué à Wikipédia. Les 700 fautes les plus courantes, leur typologie, l'évolution des fautes par rapport à un autre corpus de textes datant d'il y a 20 ans, un classement par fréquence absolue (nombre d'occurrences d'une faute) et par fréquence relative (nombre d'occurrences d'une faute relativement au nombre d'occurrences du mot), attribution d'une « importance » en s'appuyant sur le barème de notation de l'agrégation de lettres (!), etc.[2]
Au passage, il est peut-être possible d'en tirer des informations intéressantes pour les robots correcteurs qui scannent en permanence le contenu de Wikipédia (à votre avis qui dresse un tel robot et pourrait être intéressé par un exemplaire de l'ouvrage ?) et aussi pourquoi pas pour alimenter les travaux et rapports de l'association concernant la langue française.
J'avoue que je n'ai pas encore eu le temps de lire le livre, mais c'est en tout cas un bel hommage à Wikipédia, au moins dans son aspect base de données/corpus d'étude. Wikipédia ce n'est pas qu'une encyclopédie, c'est aussi un énorme terrain de recherches et d'analyses.
Est-ce que vous pensez que c'est une bonne idée que de le contacter pour lui proposer de nous rédiger un billet pour le blog, sur le côté « Wikipédia c'est une mine d'or pour les études de la langue française » ? Oui ? Non ?
Dernière chose : avec une erreur tous les 170 mots, « le taux d'erreur n'est pas si élevé ». C'est un pro qui le dit :)
++
Kropot.
[1] C'est-à-dire l'extraction sous forme de fichier informatique de toutes les versions de tous les articles, et non pas uniquement de la version en ligne. Ce qui permet de repérer l'introduction de la faute, sa correction éventuelle, etc.
[2] En feuilletant, j'ai aussi aperçu quelques uns des plus beaux trolls orthographiques de Wikipédia ;D
Here's an opinion piece, "The Problem with Wikidata", by Mark Graham, who "is a Research Fellow at the Oxford Internet Institute," which appears on The Atlantic's website. I'm not personally supporting or opposing his views but I found this to be an interesting read. http://www.theatlantic.com/technology/archive/2012/04/the-problem-with-wikid...
Pine
I very much look forward to a reply by the Wikidata team and hope the Atlantic will host it.
Dario
On Apr 10, 2012, at 4:50 PM, En Pine wrote:
Here's an opinion piece, "The Problem with Wikidata", by Mark Graham, who "is a Research Fellow at the Oxford Internet Institute," which appears on The Atlantic's website. I'm not personally supporting or opposing his views but I found this to be an interesting read. http://www.theatlantic.com/technology/archive/2012/04/the-problem-with-wikid...
Pine
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
On Wed, Apr 11, 2012 at 02:08, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I very much look forward to a reply by the Wikidata team and hope the Atlantic will host it.
file it under "been there, done that". Denny from Wikidata has written a verbose reply right under the article. Just look in the comments-section.
Mathias
Mark,
thank you for your well-thought criticism. When we were thinking first of adding structured data to Wikipedia, we were indeed thinking of giving every language edition its own data space. This way the Arab and the Hebrew Wikipedia community would not interfere with each other, nor would the Estonian and the Russian communities interfere with each other. Actually, they wouldn't even interact with each other. They could happily build their niches and purport their own points of view of the world, and then they would come together in the English Wikipedia, where they would be forced either to abstain from the conversation or to find a common ground and compromise. This would not necessarily translate back in the language editions - they could remain in their carefully crafted filter bubbles. Readers not able or willing to read different languages on an article where they are not even aware of the controversies would return from Wikipedia with the satisfying feeling that they learned something about the world, and would shake their heads about the ignorant inhabitants of the neighbouring country who believe some obvious misconception about the issue.
We still opted for having one common data space for all language editions. Does this mean we expect the whole world to agree on one common set of true facts, saved and redistributed in Wikidata, the perfect form of Wikiality, and everything else will be considered falsehood and lies? Not in the least.
First, Wikidata will not be about The Truth. I expect the Wikidata community to follow the spirit of the Wikipedia community, and require citations and references for the data. We do not expect the editors to agree on the population of Israel, but we do expect them to agree on what specific sources claim aboiut the population of Israel. They will be able to gather several sources with their sometimes contradicting data. So we might have the population according to the Israeli statistics office, according to the Egyptian staistics office, according to the CIA World Fact book, and according to even more sources. Instead of hiding these differences in their respective language editions, we can have one space to gather them all and display them side by side, making the disagreement explicit and visible.
Second, Wikidata will not force anything into the Wikipedias. For every step of the different possible ways the data can flow from Wikidata to the Wikipedias, there will be ways to opt out for every language edition. The language editions can choose to give preference to certain sources. The language editions can opt out to use Wikidata for a specific value, and replace it with a locally agreed fact. The language editions can even ignore Wikidata entirely and just continue as they had the last decade. Wikidata is an offer, and not a mandate.
Third, Wikidata will have a different coverage than Wikipedia. A lot of issues that you mentioned are far too nuanced to be expressed in Wikidata. Let us take the example of the Bronze Soldier of Tallinn that you mentioned: whereas a text, featuring an intepretation of the symbolism of the statue can lead to controversy and discussion, what points of data about it would be? The material? The height? The date of erection? Its current geolocation? None of these statements are disputed, and they could be used in the Estonian, Russian, and English version alike. What about your second example, the population of Israel? Does it include Gaza or not? Well, this kind of information can be made explicit in Wikidata. Our knowledge model will enable the editors to state "The population of Israel in 2012, excluding Gaza, was X, according to the following sources". I think that once you consider the limits of what can be stated in Wikidata, and the importance I expect to be given to properly referencing the sources, the number of expected controversies will be much smaller than many expect now.
Fourth, you rightfully point out that the Wikipedias today are mostly written by a specific contributor demographics. This is true, but it glances over the fact that it used to be even more specific. With the growth of Wikipedia the contributor demographics have expanded and diversified - not yet as much as one might hope, but it is getting better. One of your points raised was that Wikipedia has not many contributors in Africa. We actually hope that Wikidata will improve this situation: since all languages will work on the same data space, contributions from Africa and from Europe will live side by side, and the motivation for contributing to a common space that everyone will benefit from - and not just the much smaller language community one belongs to - might increase the number of contributions coming from regions underrepresented today (compare this to the situation in countries like Uzbekistan, where a language like Russian binds a lot of the attention and possible contributions to the bigger and more succesful Wikipedia language edition).
Fifth, in your criticism you implicate the idea that languages are good and valid borders for keeping knowledge diversity alive. If this was true, how comes that English language articles, where communities otherwise separated by language often come together and create article of higher quality and reflecting a richer diversity than the individual language articles? My own experiences are rooted in the Croatian, Serbian, Bosnian, etc. Wikipedias, all language editions of their own. The richness of diversity that the English Wikipedia article show on topics of the Yugoslav wars is not matched by any of the native language editions.
What is particularly interesting about your criticism is that Wikidata was developed with support from the EU research project RENDER, which has its main concern about knowledge diversity. We had discussions about some of our research results in the past, especially the Wikipedia map, not so unsimilar to some of your own results. In RENDER we developed the requirements for a data model that is centred on the ideas of being a possibly inconsistent, secondary data source, not being about The Truth.
Whereas I understand your concern from an abstract view on the issue, I challenge you to point to the actual articles that you fear will get poorer in their diversity once Wikidata will be operational. You cite your own and your colleagues research on this issue, so I assume your concerns are based on real use cases.
I am sorry for this long answer, but since I consider your concerns would be very valid if Wikidata was done in a more naive way, and since I understand that many people will think that Wikidata is being developed in such a naive way, I took the liberty to expand more on our current thinking of how Wikidata could work, and some of the design decisions in building Wikidata.
Thank you for this opportunity! Denny Vrandecic, project director Wikidata
Thanks Mathias, I hadn't gone through the comments.
On Apr 10, 2012, at 23:51, Mathias Schindler mathias.schindler@gmail.com wrote:
On Wed, Apr 11, 2012 at 02:08, Dario Taraborelli dtaraborelli@wikimedia.org wrote:
I very much look forward to a reply by the Wikidata team and hope the Atlantic will host it.
file it under "been there, done that". Denny from Wikidata has written a verbose reply right under the article. Just look in the comments-section.
Mathias
Mark,
thank you for your well-thought criticism. When we were thinking first of adding structured data to Wikipedia, we were indeed thinking of giving every language edition its own data space. This way the Arab and the Hebrew Wikipedia community would not interfere with each other, nor would the Estonian and the Russian communities interfere with each other. Actually, they wouldn't even interact with each other. They could happily build their niches and purport their own points of view of the world, and then they would come together in the English Wikipedia, where they would be forced either to abstain from the conversation or to find a common ground and compromise. This would not necessarily translate back in the language editions - they could remain in their carefully crafted filter bubbles. Readers not able or willing to read different languages on an article where they are not even aware of the controversies would return from Wikipedia with the satisfying feeling that they learned something about the world, and would shake their heads about the ignorant inhabitants of the neighbouring country who believe some obvious misconception about the issue.
We still opted for having one common data space for all language editions. Does this mean we expect the whole world to agree on one common set of true facts, saved and redistributed in Wikidata, the perfect form of Wikiality, and everything else will be considered falsehood and lies? Not in the least.
First, Wikidata will not be about The Truth. I expect the Wikidata community to follow the spirit of the Wikipedia community, and require citations and references for the data. We do not expect the editors to agree on the population of Israel, but we do expect them to agree on what specific sources claim aboiut the population of Israel. They will be able to gather several sources with their sometimes contradicting data. So we might have the population according to the Israeli statistics office, according to the Egyptian staistics office, according to the CIA World Fact book, and according to even more sources. Instead of hiding these differences in their respective language editions, we can have one space to gather them all and display them side by side, making the disagreement explicit and visible.
Second, Wikidata will not force anything into the Wikipedias. For every step of the different possible ways the data can flow from Wikidata to the Wikipedias, there will be ways to opt out for every language edition. The language editions can choose to give preference to certain sources. The language editions can opt out to use Wikidata for a specific value, and replace it with a locally agreed fact. The language editions can even ignore Wikidata entirely and just continue as they had the last decade. Wikidata is an offer, and not a mandate.
Third, Wikidata will have a different coverage than Wikipedia. A lot of issues that you mentioned are far too nuanced to be expressed in Wikidata. Let us take the example of the Bronze Soldier of Tallinn that you mentioned: whereas a text, featuring an intepretation of the symbolism of the statue can lead to controversy and discussion, what points of data about it would be? The material? The height? The date of erection? Its current geolocation? None of these statements are disputed, and they could be used in the Estonian, Russian, and English version alike. What about your second example, the population of Israel? Does it include Gaza or not? Well, this kind of information can be made explicit in Wikidata. Our knowledge model will enable the editors to state "The population of Israel in 2012, excluding Gaza, was X, according to the following sources". I think that once you consider the limits of what can be stated in Wikidata, and the importance I expect to be given to properly referencing the sources, the number of expected controversies will be much smaller than many expect now.
Fourth, you rightfully point out that the Wikipedias today are mostly written by a specific contributor demographics. This is true, but it glances over the fact that it used to be even more specific. With the growth of Wikipedia the contributor demographics have expanded and diversified - not yet as much as one might hope, but it is getting better. One of your points raised was that Wikipedia has not many contributors in Africa. We actually hope that Wikidata will improve this situation: since all languages will work on the same data space, contributions from Africa and from Europe will live side by side, and the motivation for contributing to a common space that everyone will benefit from - and not just the much smaller language community one belongs to - might increase the number of contributions coming from regions underrepresented today (compare this to the situation in countries like Uzbekistan, where a language like Russian binds a lot of the attention and possible contributions to the bigger and more succesful Wikipedia language edition).
Fifth, in your criticism you implicate the idea that languages are good and valid borders for keeping knowledge diversity alive. If this was true, how comes that English language articles, where communities otherwise separated by language often come together and create article of higher quality and reflecting a richer diversity than the individual language articles? My own experiences are rooted in the Croatian, Serbian, Bosnian, etc. Wikipedias, all language editions of their own. The richness of diversity that the English Wikipedia article show on topics of the Yugoslav wars is not matched by any of the native language editions.
What is particularly interesting about your criticism is that Wikidata was developed with support from the EU research project RENDER, which has its main concern about knowledge diversity. We had discussions about some of our research results in the past, especially the Wikipedia map, not so unsimilar to some of your own results. In RENDER we developed the requirements for a data model that is centred on the ideas of being a possibly inconsistent, secondary data source, not being about The Truth.
Whereas I understand your concern from an abstract view on the issue, I challenge you to point to the actual articles that you fear will get poorer in their diversity once Wikidata will be operational. You cite your own and your colleagues research on this issue, so I assume your concerns are based on real use cases.
I am sorry for this long answer, but since I consider your concerns would be very valid if Wikidata was done in a more naive way, and since I understand that many people will think that Wikidata is being developed in such a naive way, I took the liberty to expand more on our current thinking of how Wikidata could work, and some of the design decisions in building Wikidata.
Thank you for this opportunity! Denny Vrandecic, project director Wikidata
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Delphine,
thanks for sharing this, it's unfortunate that the results are only accessible in the book, I like the idea of asking the author to write an excerpt/blog post of his results, which we could then cover in the Research Newsletter.
On related news, this MBA thesis we mentioned in December looks at misspellings in the English Wikipedia (but using a much smaller sample and rudimentary dictionary lookup approach): http://meta.wikimedia.org/wiki/Research:Newsletter/2011-12-26#Spell-checking...
Dario
On Apr 10, 2012, at 1:57 PM, Delphine Ménard wrote:
I thought this might be of interest to people on this list.
The email from Kropotkine (Board member of Wikimedia France) is in French, and I'm not going to translate it all, but the gist of it is:
A book came out titled "The Real Difficulties of the French language in the 21st Century". The author, Dominique Laurent, is an editor of spellchecking software. In the course of his research to better his software, he has studied a Wikipedia dump to find out what the most common mistakes in French might be, and ended up writing a book to present his findings.
A bit of statistics: The author studied 471 million words in more than 36 million sentences, and in the end analysed about 3 million mistakes, made by round 120 000 users having contributed to Wikipedia. He lists the 700 most common mistakes, their typology, the evolution of mistakes based on a corpus of texts from 20 years ago, a classification by absolute frequency (how many occurrences of one mistake) and relative frequency (how many mistakes relatively to the number of times a word is used) etc.
The book can be found here: http://www.synapse-fr.com/boutique2/catalog/product_info.php?products_id=226
Question from Kropotkine I found interesting: how can such a work be used to "train" our spellcheck bots on Wikipedia? :)
Cheers,
Delphine
---------- Forwarded message ---------- From: Kropotkine_113 Date: 2012/4/10 Subject: [Discussions WMFr] Wikipédia comme corpus d'étude des difficultés du français To: discussions@lists.wikimedia.fr
Bonsoir.
Reçu au siège de l'association un livre : « Les vraies difficultés du français au XXIe siècle », Dominique Laurent, Éditions Synapse Développement.
Pourquoi je vous en parle ? Parce l'auteur nous « adresse ce livre à titre d'information, considérant que c'est un juste retour des choses, Wikipédia ayant en l'occurrence contribué indirectement à ce travail ». Ce monsieur, éditeur de logiciels professionnels de correction, s'est servi d'un dump[1] complet des articles de Wikipédia pour faire une analyse des fautes de français.
471 millions de mots dans plus de 36 millions de phrases, et au final l'analyse de près de 3 millions de fautes commises par environ de 120 mille internautes ayant contribué à Wikipédia. Les 700 fautes les plus courantes, leur typologie, l'évolution des fautes par rapport à un autre corpus de textes datant d'il y a 20 ans, un classement par fréquence absolue (nombre d'occurrences d'une faute) et par fréquence relative (nombre d'occurrences d'une faute relativement au nombre d'occurrences du mot), attribution d'une « importance » en s'appuyant sur le barème de notation de l'agrégation de lettres (!), etc.[2]
Au passage, il est peut-être possible d'en tirer des informations intéressantes pour les robots correcteurs qui scannent en permanence le contenu de Wikipédia (à votre avis qui dresse un tel robot et pourrait être intéressé par un exemplaire de l'ouvrage ?) et aussi pourquoi pas pour alimenter les travaux et rapports de l'association concernant la langue française.
J'avoue que je n'ai pas encore eu le temps de lire le livre, mais c'est en tout cas un bel hommage à Wikipédia, au moins dans son aspect base de données/corpus d'étude. Wikipédia ce n'est pas qu'une encyclopédie, c'est aussi un énorme terrain de recherches et d'analyses.
Est-ce que vous pensez que c'est une bonne idée que de le contacter pour lui proposer de nous rédiger un billet pour le blog, sur le côté « Wikipédia c'est une mine d'or pour les études de la langue française » ? Oui ? Non ?
Dernière chose : avec une erreur tous les 170 mots, « le taux d'erreur n'est pas si élevé ». C'est un pro qui le dit :)
++
Kropot.
[1] C'est-à-dire l'extraction sous forme de fichier informatique de toutes les versions de tous les articles, et non pas uniquement de la version en ligne. Ce qui permet de repérer l'introduction de la faute, sa correction éventuelle, etc.
[2] En feuilletant, j'ai aussi aperçu quelques uns des plus beaux trolls orthographiques de Wikipédia ;D
-- @notafish
NB. This gmail address is used for mailing lists. Personal emails will get lost. Intercultural musings: Ceci n'est pas une endive - http://blog.notanendive.org Photos with simple eyes: notaphoto - http://photo.notafish.org
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Delphine, Dario,
I'm very much looking forward this blog entry too, thanks for sharing!
Dario Taraborelli dtaraborelli@wikimedia.org writes:
thanks for sharing this, it's unfortunate that the results are only accessible in the book
Yes. Also, this is the kind of research where "reproducible research" would be a great plus, as the initial corpus will evolve over time.
http://reproducibleresearch.net/
All best,
wiki-research-l@lists.wikimedia.org