The recent elections showed us that language issues and translation are something we have to take very seriously from now on. As a first step towards improving communication, it seems like we should get an idea of which users speak which languages?
We could directly ask them to tell us, but upon reflection, the information is already hidden in our database. A multilingual user is one that actively edits two projects of different languages.
In devising a comprehensive translation strategy, we need to know how interconnected any two given projects are. We also need to know how connected any given project is to English, since it's our working language.
We need to pay special attention to languages that are very 'distant' from English-- distant in the sense of having few members who fluent in both English and the language in question.
Could someone aid me in getting this data, or explaining why I don't need it or why we already have it, etc?
Specifically, I'm looking for: # For each non-english-language project, how many of their active users are ALSO active on an english-language project? (the answer is should be a single whole number for each project) # For any two projects, how many users are there who are active on both? (answer is a square matrix, roughly 750x750 ) # For any two languages, how many users appear to speak both languages? (answer is a square matrix, roughly 750x750)
Does anyone know how to pull this out of the database? It's an important question for us to recruit translators and really just assess "where we are" in terms of inter-project language capabilities.
Alec
On Wed, Jun 15, 2011 at 8:46 AM, Alec Conroy alecmconroy@gmail.com wrote:
We could directly ask them to tell us, but upon reflection, the information is already hidden in our database. A multilingual user is one that actively edits two projects of different languages.
That doesn't follow. Perhaps someone speaks a language, but doesn't edit the corresponding wiki. For instance, I know a decent amount of Hebrew, although I wouldn't call myself fluent in Modern Hebrew. But I'm a native English speaker, and English Wikipedia articles are almost always better than the corresponding Hebrew ones (often even on Judaism-related topics). So I have no reason to read the Hebrew Wikipedia, when it takes more effort for me and the content isn't usually as good. Likewise, some people edit exclusively or almost exclusively on multilingual projects like Commons.
On the other hand, people might edit on projects in languages they don't understand. For instance, they might be running scripts that automatically fix interwikis or such. This is less likely, though, once you exclude bot accounts.
If you want this info, toolserver queries are the right way to do it. It should be pretty easy to pull this kind of info out of the revision or recentchanges tables, although it would require reading a lot of data. The simplest way would be to get a list of usernames for each wiki that have edited in the last X days, then use a script to reverse the lists so that you get a list of languages for each user. You'd probably want to only include unified accounts here. (How many accounts still aren't unified?)
Hi Aryeh, thanks for the fast reply. Yes, this will definitely underestimate linguistic capabilities of some users, and overestimate the linguistic capabilities of others--- it's a rough measure at best.
But is there another way to try to get who how "easily" two languages should be able to communicate with each other? The best way I can think of is looking for editing patterns that suggest multilingual skills. Even if this isn't a direct measure of language, it's at least a measure of "inter-wiki interaction", which is a good measure to have.
The important point of doing this would be: 1) to identify those users with unique language skills and recruit them 2) to identify projects and languages that are 'most disconnected' from the English hub, so we can make them less disconnected.
Is there an easy way to run this:
For each of the 86,000 'active users': Store a list for their edit counts on each project they've edited
That's actually a fairly small dataset, and it would get us all the data we want. I've been a developer before, but never here. Any idea how I go about getting that info?
(global accounts only is fine, usernames not needed at this point if we have privacy concerns)
Alec
On Wed, Jun 15, 2011 at 7:24 AM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Wed, Jun 15, 2011 at 8:46 AM, Alec Conroy alecmconroy@gmail.com wrote:
We could directly ask them to tell us, but upon reflection, the information is already hidden in our database. A multilingual user is one that actively edits two projects of different languages.
That doesn't follow. Perhaps someone speaks a language, but doesn't edit the corresponding wiki. For instance, I know a decent amount of Hebrew, although I wouldn't call myself fluent in Modern Hebrew. But I'm a native English speaker, and English Wikipedia articles are almost always better than the corresponding Hebrew ones (often even on Judaism-related topics). So I have no reason to read the Hebrew Wikipedia, when it takes more effort for me and the content isn't usually as good. Likewise, some people edit exclusively or almost exclusively on multilingual projects like Commons.
On the other hand, people might edit on projects in languages they don't understand. For instance, they might be running scripts that automatically fix interwikis or such. This is less likely, though, once you exclude bot accounts.
If you want this info, toolserver queries are the right way to do it. It should be pretty easy to pull this kind of info out of the revision or recentchanges tables, although it would require reading a lot of data. The simplest way would be to get a list of usernames for each wiki that have edited in the last X days, then use a script to reverse the lists so that you get a list of languages for each user. You'd probably want to only include unified accounts here. (How many accounts still aren't unified?)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Alec Conroy wrote:
Is there an easy way to run this:
For each of the 86,000 'active users': Store a list for their edit counts on each project they've edited
That's actually a fairly small dataset, and it would get us all the data we want. I've been a developer before, but never here. Any idea how I go about getting that info?
(global accounts only is fine, usernames not needed at this point if we have privacy concerns)
Alec
Yes, there is. It's not efficient, but it should be no problem to generate.
On 15 June 2011 17:34, Alec Conroy alecmconroy@gmail.com wrote:
The important point of doing this would be:
- to identify those users with unique language skills and recruit them
Recruit them to do what?
- to identify projects and languages that are 'most disconnected'
from the English hub, so we can make them less disconnected.
Can we make them less disconnected? How?
-Niklas
There is a lot of cross-wiki collaboration that can be done (whilst supporting the idea of wiki independence) and should be encouraged. Foundation work, cross-wiki translations of material, etc. Alec is largely talking about the board elections though, which was Anglo-centric and could have benefited from extra translation work and contacts on many more language wiki's to promote the election in line with "local customs".
I think the idea at the root of this thread is a good one; it's not a perfect metric by any stretch of the imagination - but it could highlight Wiki's that have little cross-wiki collaboration.
I'd be interested to see activity intersection between the various Wiki's and Meta (and other organisational wiki's); to see what portion of people are also active in the foundation.
This could highlight areas where Wiki's suffer from under-representation in areas of the foundation and gives us something to target "outreach" work etc.
Tom
On 15 June 2011 16:08, Niklas Laxström niklas.laxstrom@gmail.com wrote:
On 15 June 2011 17:34, Alec Conroy alecmconroy@gmail.com wrote:
The important point of doing this would be:
- to identify those users with unique language skills and recruit them
Recruit them to do what?
- to identify projects and languages that are 'most disconnected'
from the English hub, so we can make them less disconnected.
Can we make them less disconnected? How?
-Niklas
Niklas Laxström
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Jun 15, 2011 at 8:08 AM, Niklas Laxström niklas.laxstrom@gmail.com wrote:
On 15 June 2011 17:34, Alec Conroy alecmconroy@gmail.com wrote:
The important point of doing this would be:
- to identify those users with unique language skills and recruit them
Recruit them to do what?
Recruit them to help the global community with itself. There are currently-unidentified individuals with a special gift that will enable them to unite the global community in a way beyond that of monolingual members. Most recently, we needed a translator army to help us run the elections, but the need for translators isn't going away. Everyone language we have needs to have a clear and direct translation path so it can participate in the movement.
- to identify projects and languages that are 'most disconnected'
from the English hub, so we can make them less disconnected.
Can we make them less disconnected? How?
First and foremost by pointing out to us that a certain community is isolated. This will hopefully cause members of the global community to reach out to the isolated community. At the same time, it will hopefully inspire members of the isolated community to reach out to the global community.
In extreme cases, it's not inconceivable that the foundation has a direct role to play in helping underrepresented projects communicate with the rest of us.
Alec
On Wed, Jun 15, 2011 at 10:34 AM, Alec Conroy alecmconroy@gmail.com wrote:
Is there an easy way to run this:
For each of the 86,000 'active users': Store a list for their edit counts on each project they've edited
That's actually a fairly small dataset, and it would get us all the data we want. I've been a developer before, but never here. Any idea how I go about getting that info?
Get any toolserver user to run the necessary SQL queries. This page might be helpful, if no one on this list wants to run it for you:
Alec Conroy wrote:
The recent elections showed us that language issues and translation are something we have to take very seriously from now on. As a first step towards improving communication, it seems like we should get an idea of which users speak which languages?
We could directly ask them to tell us, but upon reflection, the information is already hidden in our database. A multilingual user is one that actively edits two projects of different languages.
Many users already told us, by using babel templates. That also explains how much confidence do they have in those languages (native level, basic skills...).
In devising a comprehensive translation strategy, we need to know how interconnected any two given projects are. We also need to know how connected any given project is to English, since it's our working language.
There's also the motivation factor. I am not much of a translator. Although I have fixed translations that I encountered just when accessing as a user that had been there for days. From what I have seen in the past many translations aren't done by the skilled people but just by people that was motivated enough to translate it, which sometimes are in a autotranslation-like level. However, as the people running the event obviously don't know every language, they have to rely on the few translating users, and bad texts pass as 'translated'.
We need to pay special attention to languages that are very 'distant' from English-- distant in the sense of having few members who fluent in both English and the language in question.
Could someone aid me in getting this data, or explaining why I don't need it or why we already have it, etc?
Specifically, I'm looking for: # For each non-english-language project, how many of their active users are ALSO active on an english-language project? (the answer is should be a single whole number for each project)
First point: define being active. That should be something like 'more than X non-minor edits in the last Y weeks.'
I see a problem in that you are exposing it as a symmetric relationship, while I don't think it should be. I could be very skilled to translate something to my mother tongue, but an inept to translate it in the opposite way. Specially when translating between similar languages, where a non-speaker can easily grasp the meaning.
Also, someone which routinely translates articles for enwiki to xzwiki would have the exact profile you want to discover, but could be skipped due to not having enough edits to enwiki.
# For any two projects, how many users are there who are active on both? (answer is a square matrix, roughly 750x750 ) # For any two languages, how many users appear to speak both languages? (answer is a square matrix, roughly 750x750)
I think the answer would actually be three-dimensional, since for each cell you would have a list of people, the number being just a summary.
Does anyone know how to pull this out of the database? It's an important question for us to recruit translators and really just assess "where we are" in terms of inter-project language capabilities.
Alec
I think I can build you something if you give me appropiate values for the above definition.
Cheers
I think I can build you something if you give me appropiate values for the above definition.
Cheers
Excellent-- so striking while the iron is hot-- I see that [[Special:Statistics]] defines active as "edited within the last 30 days". I'm open to whoever many users we can realistically get info on-- the more the merrier, at least until I run out of ram. :)
My initial query my go something like "Select users where lasttouched was within the last month and total edit counts are greater than 500".
And then, adding in the requirement of second project will narrow that pool. And then adding the constraint of a second project with a second language will narrow the pool even more.
We're looking for the orphan community who have a lot of editors but little connection to English and Meta.
Alec Conroy wrote:
I think I can build you something if you give me appropiate values for the above definition.
Cheers
Excellent-- so striking while the iron is hot-- I see that [[Special:Statistics]] defines active as "edited within the last 30 days". I'm open to whoever many users we can realistically get info on-- the more the merrier, at least until I run out of ram. :)
My initial query my go something like "Select users where lasttouched was within the last month and total edit counts are greater than 500".
And then, adding in the requirement of second project will narrow that pool. And then adding the constraint of a second project with a second language will narrow the pool even more.
We're looking for the orphan community who have a lot of editors but little connection to English and Meta.
I have added a small script at http://www.toolserver.org/~platonides/activeusers/activeusers.php to show active users per project and language. Requisites for appearing there are more than 500 edits (total) and at least one action (usually an edit) in the last month (since May 16, data is cached). Bots appear in the list. I'm still populating the data, but it should be completed by the time you read this.
Dear Alec,
Maybe the Community Department can help you out with your question. We are doing a number of research sprints this summer to map out different aspects of the Wikipedia communities and this sounds like a great question and we have some researchers available to help write the queries. So please contact me and I'll hook you up with the right people. Best, Diederik
On Thu, Jun 16, 2011 at 4:40 AM, Platonides Platonides@gmail.com wrote:
Alec Conroy wrote:
I think I can build you something if you give me appropiate values for the above definition.
Cheers
Excellent-- so striking while the iron is hot-- I see that [[Special:Statistics]] defines active as "edited within the last 30 days". I'm open to whoever many users we can realistically get info on-- the more the merrier, at least until I run out of ram. :)
My initial query my go something like "Select users where lasttouched was within the last month and total edit counts are greater than 500".
And then, adding in the requirement of second project will narrow that pool. And then adding the constraint of a second project with a second language will narrow the pool even more.
We're looking for the orphan community who have a lot of editors but little connection to English and Meta.
I have added a small script at http://www.toolserver.org/~platonides/activeusers/activeusers.php to show active users per project and language. Requisites for appearing there are more than 500 edits (total) and at least one action (usually an edit) in the last month (since May 16, data is cached). Bots appear in the list. I'm still populating the data, but it should be completed by the time you read this.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I have added a small script at http://www.toolserver.org/~platonides/activeusers/activeusers.php to show active users per project and language. Requisites for appearing there are more than 500 edits (total) and at least one action (usually an edit) in the last month (since May 16, data is cached). Bots appear in the list. I'm still populating the data, but it should be completed by the time you read this.
I have done the intersection part http://toolserver.org/~platonides/activeusers/intersection.php
I find the results to be quite useless for the original goal. Almost all entries are bots. Even intersecting big wikis like de-en [1] or en-es [2], where many people is able to speak both languages, only shows one user. So my conclusion is that people stays on its home wiki, and it is very strange that someone passes 500 edits *both* on its wiki and in a foreign one.
For the record, going through the whole list to get the active users took 30m26.207s. Not bad for 797 wikis. Actually doing the intersection took 3m47.916s. The app doesn't check sul accounts, instead it naively takes equal usernames as being the same person. All wikis were compared for actions after 20110516081337. The drift between that epoch and the point where the query was done was not compensated.
1-http://wolfsbane.toolserver.org/~platonides/activeusers/intersection.php?pro... 2-http://wolfsbane.toolserver.org/~platonides/activeusers/intersection.php?pro...
On Thu, Jun 16, 2011 at 9:44 AM, Platonides Platonides@gmail.com wrote:
So my conclusion is that people stays on its home wiki, and it is very strange that someone passes 500 edits *both* on its wiki and in a foreign one.
Agreed, I don't think this is a surprising result.
If we can filter out bots and contribs that have been imported/exported (maybe via the logs?) then I think it would be more useful to lower the bar. Perhaps at least 100 edits on their home wiki, and 10 edits on another?
Steven
You might also get better results when you don't limit yourself to recent contributions. For example, I contributed heavily to the Dutch Wikipedia a few years ago, and now contribute heavily to the English. I don't appear in Platonides's list, because I hardly edit nl: at all any more. There may be more people like that.
Jelle Zijlstra
2011/6/16 Steven Walling steven.walling@gmail.com
On Thu, Jun 16, 2011 at 9:44 AM, Platonides Platonides@gmail.com wrote:
So my conclusion is that people stays on its home wiki, and it is very strange that someone passes 500 edits *both* on its wiki and in a foreign one.
Agreed, I don't think this is a surprising result.
If we can filter out bots and contribs that have been imported/exported (maybe via the logs?) then I think it would be more useful to lower the bar. Perhaps at least 100 edits on their home wiki, and 10 edits on another?
Steven _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Jelle Zijlstra wrote:
You might also get better results when you don't limit yourself to recent contributions. For example, I contributed heavily to the Dutch Wikipedia a few years ago, and now contribute heavily to the English. I don't appear in Platonides's list, because I hardly edit nl: at all any more. There may be more people like that.
Jelle Zijlstra
Maybe we should broad the span for being active.
I would say broaden the span and lower the number of contribs required just a little (maybe 300?).
2011/6/16 Platonides Platonides@gmail.com:
Jelle Zijlstra wrote:
You might also get better results when you don't limit yourself to recent contributions. For example, I contributed heavily to the Dutch Wikipedia a few years ago, and now contribute heavily to the English. I don't appear in Platonides's list, because I hardly edit nl: at all any more. There may be more people like that.
Jelle Zijlstra
Maybe we should broad the span for being active.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Or look for actives on one wiki.. and then cross check those names with all the other wikis for the same names with over, say, 300 edits (at any time).
Tom
On 16 June 2011 22:34, M. Williamson node.ue@gmail.com wrote:
I would say broaden the span and lower the number of contribs required just a little (maybe 300?).
2011/6/16 Platonides Platonides@gmail.com:
Jelle Zijlstra wrote:
You might also get better results when you don't limit yourself to
recent
contributions. For example, I contributed heavily to the Dutch Wikipedia
a
few years ago, and now contribute heavily to the English. I don't appear
in
Platonides's list, because I hardly edit nl: at all any more. There may
be
more people like that.
Jelle Zijlstra
Maybe we should broad the span for being active.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Jun 15, 2011 at 7:42 AM, Platonides Platonides@gmail.com wrote:
Alec Conroy wrote: > We could directly ask them to tell us, but upon reflection, the > information is already hidden in our database. A multilingual user is > one that actively edits two projects of different languages.
Many users already told us, by using babel templates. That also explains how much confidence do they have in those languages (native level, basic skills...).
Babel templates are great-- if every user had them, we'd be good. Unfortunately, if you know enough to use a babel template, you probably are already 'tied in' to the global community and thus not in need of outreach. (this assumption may be false).
There's also the motivation factor.
That's saying a mouthful. Just knowing people can translate is not at all the same as being able to expect they'll actually do it. We just found that out, and that's why we need to start building a translator network now, rather than wait till next year.
First point: define being active. That should be something like 'more than X non-minor edits in the last Y weeks.'
I'm flexible. The point of activity is just to weed the data down to a manageable size. If we want to call anyone active at this stage, that'd work. I suggest lasttouched in 30 days, but that's totally arbitrary.
I see a problem in that you are exposing it as a symmetric relationship, while I don't think it should be.
Again, another very brilliant caveat. I should say that my initial attempt at getting these kinds of estimates was to look at wordwide language-overlap statistics and just assume that wikimedians are "average humans", which they clearly aren't. This would get us a very very rough picture.
Analysis of actual edit patterns will get us a better view, but it'll still be less precise than babel boxes or actual self-identification as a translator. Perhaps at some point we can explicitly ask users to tell us directly their language skills.
Alecmconroy
wikitech-l@lists.wikimedia.org