Hi Aryeh, thanks for the fast reply. Yes, this will definitely underestimate linguistic capabilities of some users, and overestimate the linguistic capabilities of others--- it's a rough measure at best.
But is there another way to try to get who how "easily" two languages should be able to communicate with each other? The best way I can think of is looking for editing patterns that suggest multilingual skills. Even if this isn't a direct measure of language, it's at least a measure of "inter-wiki interaction", which is a good measure to have.
The important point of doing this would be: 1) to identify those users with unique language skills and recruit them 2) to identify projects and languages that are 'most disconnected' from the English hub, so we can make them less disconnected.
Is there an easy way to run this:
For each of the 86,000 'active users': Store a list for their edit counts on each project they've edited
That's actually a fairly small dataset, and it would get us all the data we want. I've been a developer before, but never here. Any idea how I go about getting that info?
(global accounts only is fine, usernames not needed at this point if we have privacy concerns)
Alec
On Wed, Jun 15, 2011 at 7:24 AM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Wed, Jun 15, 2011 at 8:46 AM, Alec Conroy alecmconroy@gmail.com wrote:
We could directly ask them to tell us, but upon reflection, the information is already hidden in our database. A multilingual user is one that actively edits two projects of different languages.
That doesn't follow. Perhaps someone speaks a language, but doesn't edit the corresponding wiki. For instance, I know a decent amount of Hebrew, although I wouldn't call myself fluent in Modern Hebrew. But I'm a native English speaker, and English Wikipedia articles are almost always better than the corresponding Hebrew ones (often even on Judaism-related topics). So I have no reason to read the Hebrew Wikipedia, when it takes more effort for me and the content isn't usually as good. Likewise, some people edit exclusively or almost exclusively on multilingual projects like Commons.
On the other hand, people might edit on projects in languages they don't understand. For instance, they might be running scripts that automatically fix interwikis or such. This is less likely, though, once you exclude bot accounts.
If you want this info, toolserver queries are the right way to do it. It should be pretty easy to pull this kind of info out of the revision or recentchanges tables, although it would require reading a lot of data. The simplest way would be to get a list of usernames for each wiki that have edited in the last X days, then use a script to reverse the lists so that you get a list of languages for each user. You'd probably want to only include unified accounts here. (How many accounts still aren't unified?)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l