Hi Laura;
I am in a research group in the University of Cádiz, Spain. We are developing a tool for retrieving statistical information from MediaWiki installations, it is called StatMediaWiki.[1] There are two versions: Classic, and NP. The first one (in text mode) is focused in small wikis, due to its output is a bunch of HTML and PNG filles, static. The second one has a simple GUI, and the analysis are generated on the fly (but a previous pre-processing step is mandatory). Now, both generate activity graphs, rankings, etc.
I read your first e-mail to this list a few days ago, about the geoip location. We have added an analysis which generates a bar graph of countries for anonymous users (for registered users is not possible in an easy way, as you say). You can see an example of this analysis here[2] for answers.wikia.com, a wiki where people ask questions. The analysis counts the whole wiki, not single pages. I think you are interested in single pages of English Wikipedia, so, we need to add an option to load single pages (instead pre-processing the whole Wikipedia, although it is possible too).
If you are interested in using StatMediaWiki, we can work together in the details of this analysis.
Regards, emijrp
[1] http://meta.wikimedia.org/wiki/StatMediaWiki [2] http://img87.imageshack.us/img87/2522/answersgeoip.png
2010/10/10 Laura Hale laura@fanhistory.com
A copy of this post can be found at http://ozziesport.com/2010/10/expanded-profile-of-australian-en-wp-users/
My dissertation topic involves doing a demographic and geographic study of Australian sport fandom online. There are several sites and social networks where you can get publicly available demographic data to begin to formulate a picture of the user population, and then segment that population out by interest in a league, sport and athlete. I’ve spent a lot of time looking at Twitter, Facebook and LiveJournal. Recently, partly because of a trip to the Wikimedia Foundationhttp://en.wikipedia.org/wiki/Wikipedia:WikiProject_Screencastand discussions with a few people at UCNISS http://www.ucniss.net/, my interest in who was contributing to Australian sport wiki articles on Wikipedia increased.
Finding out who edited Wikipedia articles using publicly available information is a bit of a challenge. The most reliable information for who edited comes from IP address information. IP addresses can provide an idea as to the geographic location of the contributor. It is easy enough, with the help of a friend, to create a tool that pull the history of a Wikipedia article, get a list of IP addresses that edited the article, feed the IP address into another tool that will pull up the general location of the contributor. (One of my favorite visualizations of this type of information is WikipediaVision http://www.lkozma.net/wpv/index.html.) The data isn’t always accurate and if I was looking primarily at New Zealand, a country without its own dedicated IP address range, this would be even less reliable. Still, for my purposes, this data works pretty well.
This data is still pretty limited. There are a lot of articles that are edited by non-anonymous users. Sometimes, it is possible to get demographic and geographic information about Wikipedia contributors by viewing their profile pages. This can just be time consuming to do manually if an article has a large number of contributors as you need to view a lot of user pages. It becomes a deterrence for trying to collect geographic information about article contributors.
I was looking for a more time effective and accurate method of collecting geographic and demographic information about contributors that is publicly available on their user pages. The easiest and quickest way to get this information on a mass scale is to utilize user box information. Many user boxes, when included on a user page, put the user into a category. These categories are often then linked through the Wikipedian category structurehttp://en.wikipedia.org/wiki/Category:Wikipedians. Beyond that, user boxes involve templates. It is easy to get a list of articles (user pages) that the template is included on.
The methodology that I selected from this point is rather straightforward. It involved:
- Select a category.
- Copy and paste the list of articles (user pages) in the selected
category to an Excel spreadsheet. Sort the list alphabetically. Copy and paste only the user pages to Notepad. Replace * User with blank. Copy and paste this list back to Excel. 3. Create a filter where the cell contains / . Select those cells. Copy them to notepad, replace / with [tab] in order to remove user subpages from the list. Copy this back to Excel. Select only the column with usernames. 4. Run an advance filter in order to remove all duplicate rows. 5. Copy this list back to the dedicated spreadsheet. Label all those users with the category from which they were pulled in a unique column. 6. Repeat steps 1 to 5 until all the categories that you want to have included are included. 7. Merge/Group all the rows by username.
This method may not be the most efficient way of going about doing this. It can probably be improved by automating some of these steps. In my case, step 7 was not able to be completed using Excel. I had to e-mail the file to @woganmay http://twitter.com/woganmay, who I believe converted the file to a mySQL database, used the group feature, converted the results back to csv and e-mailed the file to me.
In my case, I did not complete this for every category. Some categories did not seem worth it time wise as they had too few user pages to be included. In other cases, the categories were just too big to do. This included all the members of User de, User en, User es, User fr, User it, User jp. Only a selected number of categories were included because of time constraints. Data gathering was focused on categories that I perceived would have the greatest number of Australians and other possible contributors to Australian related articles. When these categories were more exhausted, categories with between 1,00 and 5,000 articles were selected.
There are all sort of limitations to this data. First, not everyone includes userboxes on their profile pages. This means that there could be a lot more Australians on Wikipedia than indicated by userbox inclusion on a user page. The assumption for the resulting data is that proportional representation exists for various categories. So while there are X amount of Christians and Y amount of Atheists, the assumption that the relationship between X and Y will always be proportional to the actual population on Wikipedia. Whatever data is available thus has to be viewed as good enough or supplemented by going to individual user ages to see if other information is available when a user appears where no information for someone when running against the history of the article.
Second, even when they do exist, there are often useful pieces of information that are missing. For example, in an Australian context, there is a userbox for Rugby League fans. There is not however a userbox for Australian rules footy fans. There are also not user boxes and categories for fans of NRL or AFL teams. (This type of user box and category exists for National Hockey League teams.)
About halfway through this process, I realized that this data could be useful for analysis beyond who is editing Wikipedia. At the moment, I’ve only totaled data I have for Australians. It is pretty fascinating and would be neat to go further with: How does the proportional size of the Australian Wikipedian population compare against the actual population? Does the size of the Australian Atheist versus Christiah community actively reflect the proportions in Australian society? Or is the Australian Wikipedian community demographically distinct from the greater population?
The following tables include the data based on people who were included in Wikipedians in Australiahttp://en.wikipedia.org/wiki/Category:Wikipedians_in_Australiaand its subcategories and Australian Wikipedians http://en.wikipedia.org/wiki/Category:Australian_Wikipedians. A copy of the raw data can be found at October 9 – Wikipedia English Data – Australians.xlshttp://csv.ozziesport.com/October%209%20-%20Wikipedia%20English%20Data%20-%20Australians.xls. The data is provided without comment though any attempts at explaining the patterns found are very much appreciated. Country Count Bangladesh 3 Canada 2 Egypt 2 India 1 Indonesia 2 Ireland 3 Jamaica 2 Japan 5 New Zealand 17 Papua New Guinea 1 Republic of Ireland 5 Singapore 5 South Africa 2 South Korea 1 Sri Lanka 2 Tanzania 2 Turkey 2 United States 16 State Count Australian Capital Territory 89 Canterbury 1 New South Wales 345 Northern Territory 5 Otago 1 Queensland 208 South Australia 144 Southland 1 Tasmania 54 Victoria 370 Wellington 2 Western Australia 145 Degree Count BA degrees 21 BCom degrees 2 BCS degrees 3 BE degrees 18 BMus degrees 1 BS degrees 41 MS degrees 5 PhD degrees 18 University/Alma Mater Count Australian National University 14 Avondale College 1 Charles Sturt University 1 Curtin University of Technology 7 Deakin University 6 Flinders University 7 Griffith University 1 James Cook University 2 La Trobe University 2 Macquarie University 5 Massey University 1 Monash University 19 Royal Melbourne Institute of Technology 10 University of Adelaide 4 University of Alberta 1 University of Canberra 3 University of Melbourne 21 University of New England 4 University of New South Wales 24 University of Newcastle 8 University of Sydney 16 University of Tasmania 3 University of Technology, Sydney 4 University of Western Australia 11 University of Wollongong 4 Victorian College of the Arts 1 Student type Count Business students 3 College students 26 Law students 9 Medical students 8 University students 59 Website Count Open Directory Project 1 OpenStreetMap 2 Wookieepedia 1 Religion Count Anglican and Episcopalian 8 Antitheist 3 Atheist 97 Buddhist 13 Catholic 7 Christian 47 Eastern Orthodox 2 Hindu 1 Jewish 4 Lutheran 1 Methodist 2 Muslim 4 Non-denominational Christian 2 Objectivist 2 Pastafarian 17 Presbyterian 3 Protestant 11 Roman Catholic 10 Ethnicity and nationality Count Argentine 2 Bangladeshi 2 British 3 English 10 Latino/Hispanic 1 Skill Count Aircraft pilots 5 Artists 3 Engineers 17 Filmmakers 17 Homebrewers 10 Mechanical engineers 1 Professional writers 1 Surfers 2 Profession Count Accountants 2 Actor 5 Actuaries 2 Aircraft pilots 5 Biologist 9 Broadcasters 5 Chemist 6 Composers 28 Computer scientists 7 Engineers 17 Filmmakers 17 Geoscientists 2 Mechanical engineers 1 Scientists 7 Teacher 18 University teacher 4 Web designers 2 Web developers 1 Interest Count Chemistry 27 Cooking 1 Physics 34 Strings (physics) 6 Sports Count Cavers 2 Cross-country runners 4 Dancers 3 Detroit Red Wings fans 2 Equestrians 2 Fencers 2 Geocachers 8 Hikers 2 Hunters 7 Outdoor pursuits 2 Rugby league fans 50 Runners 2 Sailing 1 Scuba divers 8 Snowboarders 2 Swimmers 16 Swing dancers 1 Toronto Maple Leafs fans 1 Ultimate Fighting Championship fans 2 Vancouver Canucks fans 3 WikiProject Tennis members 4 Wikipedia Status Count Administrator hopefuls 41 Administrators 45 Administrators who will provide copies of deleted articles 11 Bureaucrats 1 Contribute to Wikimedia Commons 1 Create userboxes 3 Opted out of automatic signing 4 Reviewers 10 Rollbackers 27 Service Award Level 01 12 Service Award Level 02 14 Service Award Level 03 10 Service Award Level 04 5 Service Award Level 05 6 Service Award Level 06 9 Service Award Level 07 11 Service Award Level 08 3 Service Award Level 09 2 Wikimedia Commons administrators 2 Philosophy Count Hindu 1 Humanist 6 Materialist 9 Pastafarian 16 Theist 9
-- twitter: purplepopple blog: ozziesport.com
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l