Wiki-research-l March 2015

wiki-research-l@lists.wikimedia.org

31 participants
23 discussions

March 2015 Research Showcase

by Leila Zia

Hi, This month's research showcase <https://www.mediawiki.org/w/index.php?title=Analytics/Research_and_Data/Sho…> is scheduled for Wednesday, March 25, 11:30 (PST). We will have two presentations on user session identification by Aaron Halfaker, and mining missing hyperlinks in Wikipedia by Bob West. As usual, the event will be recorded and publicly streamed on YouTube (links will follow). We will hold a discussion and take questions from remote participants via the Wikimedia Research IRC channel (#wikimedia-research on freenode). Looking forward to seeing you there. Leila

9 years, 1 month

Re: [Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps

by aaron shaw

Adding to Giovanni's points (all of which I agree with 100%): - This would be awesome! The pageviews are a super useful for many of us and cleaning them up a bit would save a lot of redundant work for many of us down the road. - If you don't have to collapse page views incoming from mobile and zero, I would recommend keeping them separate. That said, I haven't spent any time looking into it, and so I confess complete ignorance on this front. - I agree with you that page ids are better than titles. Great idea. - I don't think the byte information is/was useful in this dataset, so I agree with dumping that. - Backfill would be totally great. Happy to chat more if it seems helpful... a On Thu, Mar 19, 2015 at 7:13 PM, Giovanni Luca Ciampaglia < gciampag(a)indiana.edu> wrote: > Hi Oliver, > > Tab-separation would be welcomed. Title normalisation would be *very* > useful too. Another thing that could potentially save a lot of space would > be to throw out all malformed requests, pieces of javascript, and similar > junk. Not sure how difficult that would be though, without doing an actual > query on the DB for the page id. > > For example, an excerpt from 20140101-000000.gz (with only the title and > views fields): > > 'Ø§Ùï¿½ØØ§Ùï¿½Â_Ùï¿½Ø´Ø¨Ø§Ø¨'_Â_Ùï¿½Ùï¿½Ø§Ø·Ø¹Â_Ùï¿½Ø¶ØÙï¿½Ø© 1 > '/javascript:document.location.href='/'_encodeURIComponent(document.getElementById('txt_input_text').value) > 9 > '03_Bonnie_&_Clyde 18 > A_Night_at_the_Opera_(Queen_album) 57 > '40s_on_4 2 > '50s_on_5 1 > '71_(film) 4 > '74_Jailbreak 3 > '77 1 > '79-00_éÃ¯Â¿Â½åÃ¯Â¿Â½¶åÃ¯Â¿Â½©åºÃ¯Â¿Â½å_±éÃ¯Â¿Â½Ã¯Â¿Â½vol.8_ACåÃ¯Â¿Â½¬åÃ¯Â¿Â½±åºÃ¯Â¿Â½åÃ¯Â¿Â½Ã¯Â¿Â½æ©Ã¯Â¿Â½æ§Ã¯Â¿Â½ > 1 > > Cheers, > > G > > > > Giovanni Luca Ciampaglia > > ✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA > ☞ http://www.glciampaglia.com/ > ✆ +1 812 855-7261 > ✉ gciampag(a)indiana.edu > > 2015-03-13 12:06 GMT-07:00 Oliver Keyes <okeyes(a)wikimedia.org>: > > So, we've got a new pageviews definition; it's nicely integrated and >> spitting out TRUE/FALSE values on each row with the best of em. But >> what does that mean for third-party researchers? >> >> Well...not much, at the moment, because the data isn't being released >> somewhere. But one resource we do have that third-parties use a heck >> of a lot, is the per-page pageviews dumps on dumps.wikimedia.org. >> >> Due to historical size constrains and decision-making (and by >> historical I mean: last decade) these have a number of weirdnesses in >> formatting terms; project identification is done using a notation >> style not really used anywhere else, mobile/zero/desktop appear on >> different lines, and the files are space-separated. I'd like to put >> some volunteer time into spitting out dumps in an easier-to-work-with >> format, using the new definition, to run in /parallel/ with the >> existing logs. >> >> *The new format* >> At the moment we have the format: >> >> project_notation - encoded_title - pageviews - bytes >> >> This puts zero and mobile requests to pageX in a different place to >> desktop requests, requires some reconstruction of project_notation, >> and contains (for some use cases) extraneous information - that being >> the byte-count. The files are also headerless, unquoted and >> space-separated, which saves space but is sometimes...I think the term >> is "eeeeh-inducing". >> >> What I'd like to use as a new format is: >> >> full_project_url - encoded_title - desktop_pageviews - >> mobile_and_zero_pageviews >> >> This file would: >> >> 1. Include a header row; >> 2. Be formatted as a tab-separated, rather than space-separated, file; >> 3. Exclude bytecounts; >> 4. Include desktop and mobile pageview counts on the same line; >> 5. Use the full project URL ("en.wikivoyage.org") instead of the >> pagecounts-specific notation ("en.v") >> >> So, as a made-up example, instead of: >> >> de.m.v Florence 32 9024 >> de.v Florence 920 7570 >> >> we'd end up with: >> >> de.wikivoyage.org Florence 920 32 >> >> In the future we could also work to /normalise/ the title - replacing >> it with the page title that refers to the actual pageID. This won't >> impact legacy files, and is currently blocked on the Apps team, but >> should be viable as soon as that blocker goes away. >> >> I've written a script capable of parsing and reformatting the legacy >> files, so we should be able to backfill in this new format too, if >> that's wanted (see below). >> >> *The size constraints* >> >> There really aren't any. Like I said, the historical rationale for a >> lot of these decisions seems to have been keeping the files small. But >> by putting requests to the same title from different site versions on >> the same line, and dropping byte-count, we save enough space that the >> resulting files are approximately the same size as the old ones - or >> in many cases, actually smaller. >> >> *What I'm asking for* >> >> Feedback! What do people think of the new format? What would they like >> to see that they don't? What don't they need, here? How useful would >> normalisation be? How useful would backfilling be? >> >> *What I'm not asking for* >> WMF time! Like I said, this is a spare-time project; I've also got >> volunteers for Code Review and checking, too (Yuvi and Otto). >> >> The replacement of the old files! Too many people depend on that >> format and that definition, and I don't want to make them sad. >> >> Thoughts? >> >> -- >> Oliver Keyes >> Research Analyst >> Wikimedia Foundation >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > >

9 years, 1 month

[Technical][Request for Comment] A new format for the pageview dumps

by Oliver Keyes

So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers? Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have that third-parties use a heck of a lot, is the per-page pageviews dumps on dumps.wikimedia.org. Due to historical size constrains and decision-making (and by historical I mean: last decade) these have a number of weirdnesses in formatting terms; project identification is done using a notation style not really used anywhere else, mobile/zero/desktop appear on different lines, and the files are space-separated. I'd like to put some volunteer time into spitting out dumps in an easier-to-work-with format, using the new definition, to run in /parallel/ with the existing logs. *The new format* At the moment we have the format: project_notation - encoded_title - pageviews - bytes This puts zero and mobile requests to pageX in a different place to desktop requests, requires some reconstruction of project_notation, and contains (for some use cases) extraneous information - that being the byte-count. The files are also headerless, unquoted and space-separated, which saves space but is sometimes...I think the term is "eeeeh-inducing". What I'd like to use as a new format is: full_project_url - encoded_title - desktop_pageviews - mobile_and_zero_pageviews This file would: 1. Include a header row; 2. Be formatted as a tab-separated, rather than space-separated, file; 3. Exclude bytecounts; 4. Include desktop and mobile pageview counts on the same line; 5. Use the full project URL ("en.wikivoyage.org") instead of the pagecounts-specific notation ("en.v") So, as a made-up example, instead of: de.m.v Florence 32 9024 de.v Florence 920 7570 we'd end up with: de.wikivoyage.org Florence 920 32 In the future we could also work to /normalise/ the title - replacing it with the page title that refers to the actual pageID. This won't impact legacy files, and is currently blocked on the Apps team, but should be viable as soon as that blocker goes away. I've written a script capable of parsing and reformatting the legacy files, so we should be able to backfill in this new format too, if that's wanted (see below). *The size constraints* There really aren't any. Like I said, the historical rationale for a lot of these decisions seems to have been keeping the files small. But by putting requests to the same title from different site versions on the same line, and dropping byte-count, we save enough space that the resulting files are approximately the same size as the old ones - or in many cases, actually smaller. *What I'm asking for* Feedback! What do people think of the new format? What would they like to see that they don't? What don't they need, here? How useful would normalisation be? How useful would backfilling be? *What I'm not asking for* WMF time! Like I said, this is a spare-time project; I've also got volunteers for Code Review and checking, too (Yuvi and Otto). The replacement of the old files! Too many people depend on that format and that definition, and I don't want to make them sad. Thoughts? -- Oliver Keyes Research Analyst Wikimedia Foundation

9 years, 1 month

Fwd: [Wikimedia Announcements] Announcing Wikimedia Foundation's New Open Access Policy

by Pine W

Forwarding announcement. Pine ---------- Forwarded message ---------- From: "Manprit Brar" <mbrar(a)wikimedia.org> Date: Mar 18, 2015 1:19 PM Subject: [Wikimedia Announcements] Announcing Wikimedia Foundation's New Open Access Policy To: <wikimediaannounce-l(a)lists.wikimedia.org> Cc: Hi all, We're proud to announce that the Wikimedia Foundation today joins the growing ranks of major institutions with open access policies. Our new Open Access Policy <https://wikimediafoundation.org/wiki/Open_access_policy>[1] will ensure that all research the Wikimedia Foundation supports through grants, equipment, or research collaboration is made widely accessible and reusable. Research, data, and code developed through these collaborations will be made available in open access venues and under a free license <http://freedomdefined.org/>[2] in keeping with the Wikimedia Foundation’s mission to support free knowledge. You can read more about this effort in today's blog post <https://blog.wikimedia.org/2015/03/18/wikimedia-open-access-policy/>[3]. [1] https://wikimediafoundation.org/wiki/Open_access_policy [2] http://freedomdefined.org/Definition [3] https://blog.wikimedia.org/2015/03/18/wikimedia-open-access-policy/ Manprit Brar Legal Counsel *Wikimedia Foundation* *NOTICE*: This message may be confidential or legally privileged. If you have received it by accident, please delete it and let us know about the mistake. As an attorney for the Wikimedia Foundation, for legal/ethical reasons I cannot give legal advice to, or serve as a lawyer for, community members, volunteers, or staff members in their personal capacity. For more on what this means, please see our legal disclaimer <https://meta.wikimedia.org/wiki/Wikimedia_Legal_Disclaimer>. _______________________________________________ Please note: all replies sent to this mailing list will be immediately directed to Wikimedia-l, the public mailing list of the Wikimedia community. For more information about Wikimedia-l: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l _______________________________________________ WikimediaAnnounce-l mailing list WikimediaAnnounce-l(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaannounce-l

9 years, 1 month

Research into global Wikimedia content and communities: regional categorization/aggregation

by h

Dear all, Recently I have gathered and plotted the world Internet users from 2000 onwords based on three different world geographic categorization schemes: UN, World Bank, and CIA http://people.oii.ox.ac.uk/hanteng/2015/03/18/the-new-internet-world-un-wor… In this blog post, I have noticed that the Wikimedia Foundation's geographic categorization seems to be different from the Stats page and the International development team, signaling out the distinct absence/presence of the category of Middle East and North Africa (MENA), a category that is used only in the World Bank categorization scheme. Personally I do not have any preferences of any of the categorization schemes, I believe in open data, open research, and open solutions. So it is up to researchers and practitioners to decide which categorization schemes to use. Nonetheless, it points to a practice gap in the current research into " global Wikimedia content and communities". Thus I raise a few questions in the blog post as below in terms how we can systematically compare and share the Wikimedia's global activities in editing, viewing, fundraising and fund dissemination: The above charts no only show the important baselines for any Websites (assuming that Internet users are also Web users) for their global strategies for growth, but also demonstrate the importance of the choice of geographic categorization scheme. ..... ...., it is relevant and important if the Wikimedia Foundation can release its global activities, including the viewing, editing and fundraising numbers so that researchers and Wikipedians can compare those numbers with the world distribution of Internet users. It would be interesting to see, for example, how the funds are raised from around the world and then distributed across different geographic regions, per Internet user. To me, I believe that data aggregation reports to the level of any of the three geographic categorization schemes (CIA, worldbank, and UN) should have much less privacy concerns and thus should be reported regularly, if not monthly, at least annually. Feel free to forward it to another mailing list of the larger Wikimedia family if you found it relevant. Best, han-teng liao

9 years, 1 month

Deadline approaching! OpenSym 2015

by aaron shaw

Please see below for details. Note that there are several research tracks including a Wikipedia Research track. OPENSYM 2015, THE 11TH INTERNATIONAL SYMPOSIUM ON OPEN COLLABORATION August 19-21, 2015 | San Francisco, California, U.S.A. http://opensym.org/os2015 | ACM SIWEB and ACM SIGSOFT supported ABOUT THE CONFERENCE The 11th International Symposium on Open Collaboration (OpenSym 2015) is the premier conference on open collaboration research and practice, including free/libre/open source software, open data, IT-driven open innovation research, wikis and related open collaborative media, and Wikipedia and related Wikimedia projects. OpenSym brings together the different strands of open collaboration research and practice, seeking to create synergies and inspire new collaborations between computer science and information systems researchers, social scientists, legal scholars, and everyone interested in understanding open collaboration and how it is changing the world. OpenSym 2015 will be held in San Francisco, California, on *August 19-21, 2015*. This is the general call for papers and includes the - research track call for submissions, - industry and community track call for submissions, and - doctoral symposium call for submissions. OpenSym is held in-cooperation with ACM SIGWEB and ACM SIGSOFT. As in previous years, the conference proceedings will be archived in the ACM digital library. RESEARCH TRACK CALL FOR SUBMISSIONS The conference provides the following peer-reviewed research tracks. - Free/libre/open source software research, chaired by Carlos Jensen of Oregon State University and Gregorio Robles of Universidad Rey Juan Carlos. This track seeks papers on all aspects of FLOSS. For detailed topics and the research track committee please see http://wp.me/Pezfy-IU. - IT-driven open innovation research, chaired by Ann Majchrzak of University of Southern California and Arvind Malhotra of University of North Carolina at Chapel Hill. This track is devoted to research on the process of expanding research and development activities beyond the boundaries of single company structures. For detailed topics and the research track committee please see http://wp.me/Pezfy-J3. - Open data research, chaired by Carl Lagoze of University of Michigan. This track contributes to the increasing awareness on Open Data in research. For detailed topics and the research track committee please see http://wp.me/Pezfy-J5. - Wikis and open collaboration research, chaired by Kevin Crowston of Syracuse University. This track is dedicated to the science and application of wikis and open collaboration technology outside of the context of Wikipedia. For detailed topics and the research track committee please see http://wp.me/Pezfy-J7. - Wikipedia and related projects research, chaired by Claudia Müller-Birn of Freie Universität Berlin and Aaron Shaw of Northwestern University. This track addresses research specifically on Wikipedia and associated projects. For detailed topics and the research track committee please see http://wp.me/Pezfy-J9. Research papers present integrative reviews or original reports of substantive new work: theoretical, empirical, and/or in the design, development and/or deployment of novel concepts, systems, and mechanisms. Research papers will be reviewed by a research track program committee to meet rigorous academic standards of publication. Papers will be reviewed for relevance, conceptual quality, innovation and clarity of presentation. Authors can submit full papers (5-10 pages), short papers (2-4 pages), and research posters (1-2 pages). For more details on paper types please see http://wp.me/Pezfy-Je. Submission deadline for all research contributions is *March 29th, 2015*. Authors submit through EasyChair at https://easychair.org/conferences/?conf=opensym2015. Submissions and final contributions must follow the ACM SIG Proceedings template found at http://www.acm.org/sigs/publications/proceedings-templates. OpenSym seeks to accommodate the needs of the different research disciplines it draws on. Authors whose submissions have been accepted for presentation at the conference have a choice of having - their paper become part of the official proceedings, archived in the ACM Digital Library, or having - only a short abstract included in the proceedings (rather than the full submitted paper) in order to preserve future publication possibilities. DOCTORAL SYMPOSIUM CALL FOR SUBMISSIONS OpenSym seeks to explore the synergies between all strands of open collaboration research. Thus, we will have a doctoral symposium, in which Ph.D. students from different disciplines can present their work and receive feedback from senior faculty and their peers. Submission deadline for doctoral symposium position papers is *May 3rd, 2015*. Authors submit through EasyChair at https://easychair.org/conferences/?conf=opensym2015. Submissions and final contributions must follow the ACM SIG Proceedings template found at http://www.acm.org/sigs/publications/proceedings-templates. More information is available at http://wp.me/Pezfy-Jh. INDUSTRY AND COMMUNITY TRACK CALL FOR SUBMISSIONS OpenSym is also seeking submissions for experience reports (full and short), tutorials, workshops, panels, non-research posters, and demos. Such work accepted for presentation or performance at the conference is considered part of the industry and community track. It will be put into the proceedings in an industry and community track section; authors can opt-out of the publication, as with research papers, but will still have to provide an abstract (less than one page) for the proceedings. Submission deadline for industry and community track papers is *April 19, 2015*. Authors submit through EasyChair at https://easychair.org/conferences/?conf=opensym2015. Submissions and final contributions must follow the ACM SIG Proceedings template found at http://www.acm.org/sigs/publications/proceedings-templates. More information is available at http://wp.me/Pezfy-Jh. THE OPENSYM CONFERENCE EXPERIENCE OpenSym 2015 will be held in San Francisco, California, on August 19-21, 2015. Research, industry, and community presentations and performances will be accompanied by keynotes, invited speakers, and a social program in one of the most vibrant cities on this planet. The open space track is a key ingredient of the event that distinguishes OpenSym from other conferences. It is an integral part of the program that makes it easy to talk to other researchers and practitioners and to stretch your imagination and conversations beyond the limits of your own sub-discipline, exposing you to the full breadth of open collaboration research. The open space track is entirely participant-organized, is open for everyone, and requires no submission or review. The general chair of the conference is Dirk Riehle of Friedrich-Alexander University Erlangen-Nürnberg. Feel free to contact us with any questions you might have at info(a)opensym.org. _______________________________________________

9 years, 1 month

(no subject)

by h

Dear all, I have some findings to show the page views per Internet user measurement may help comparing different language editions of Wikipedia. Criticism and suggestions are welcome. ----- http://people.oii.ox.ac.uk/hanteng/2015/03/15/comparing-language-developmen… Which language version of Wikipedia enjoys the most page views per language Internet user than expected? It is Finnish. In terms of absolute positive and negative gap, English has the widest positive gap whereas Chinese has the largest negative gap. ...... In particular, it is known that Wikipedia (and Google which often favours Wikipedia) faces local competition in the People's Republic of China and South Korea. Therefore it is understandable the page views may be lower in Chinese and Korean Wikipedia language projects simply because some users' need to read user-generated encyclopedias are satisfied by other websites. However, it remains an important question to examine why these particular Latin and Asian languages are under-developed for Wikipedia projects.

9 years, 1 month

Fwd: help with CFP

by Giovanni Luca Ciampaglia

FYI, people looking at reverts logs may be interested in this satellite event at ICWSM'15 Cheers, Giovanni Luca Ciampaglia ✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag(a)indiana.edu ---------- Forwarded message ---------- From: Nicola Perra <nicolaperra(a)gmail.com> Date: 2015-03-11 10:00 GMT-04:00 *Modeling and Mining Temporal Interactions (M2TI)* *ICWSM'15 workshop.* *Oxford, UK, May 26, 2015* *Webpage:* http://m2ti.weebly.com/ The emergence of the Big Data paradigm together with the framework of complex networks has played a crucial role in providing the tools and datasets to begin understanding human interactions and the dynamics of social systems with wide applications to informatics, social sciences, information technology, and epidemiology. The rise of the social web has resulted in the creation of an unprecedented number of social systems than can be analyzed in great detail. This unique opportunity has fostered an interdisciplinary effort to study their structure and dynamics with network theory taking the forefront. As a result, researchers have unveiled a number of surprising properties about the structure and dynamics of large-scale social systems such as online social networks, scientific and OSS collaboration networks, or mobile phone communication networks. Due to both theoretical and practical difficulties, until recently, the majority of studies have focused on static representations of the social interactions, where the underlying network structure does not change over time. While this assumption has proven to be useful and allowed for a great progress recently, its limitations are now becoming clear. Static representations miss the timing, duration, and concurrency of social interactions, which are crucial to characterize many processes such as the spread of rumors, information and epidemics, the emergence of social norms and memes or even online congestion, resource depletion and mutual influence, among others. In general, neglecting the network dynamics might lead to mischaracterizations of the dynamical processes. Today, thanks to the ubiquity of online services, social media, GPS enabled smartphones and the rise of wearable computers, rich time-resolved datasets of both user behaviors and interactions are available. Empirical data on human interaction dynamics now include user access logs, email communications, chatting, phone call and SMS metadata, online forum discussions, collaborations in software projects, movie participations, geo-located individual mobility datasets, location check in, and many others. The availability of such data is triggering a new wave of data-driven studies of temporal properties of human behavior, communication, and social interactions. The mining and study of the temporal characteristics of social dynamics, raises new fundamental challenges, both theoretical and computational, with applications to fields such as social sciences, marketing, computer science, and epidemiology. In particular, new tools and frameworks are needed to mine, characterize and model temporal behaviors as well as the profound consequences that individual characteristics and behavioral correlations have on processes under study. *Topics and Themes* The list of topics that we aim to cover at the workshop is the following: - Data mining approaches for time-resolved datasets - Representation of time-resolved datasets - Dynamic interplay between social temporal networks and human behavior - Analysis and characterization of temporal networks - Modeling of temporal networks - Behavioral pattern mining - User modeling and analysis of behavioral patterns - Processes on temporal networks - Data driven models of collaboration - Influence models - User and product recommendation based on temporal patterns - Spatio-temporal correlation between interactions and service usage The meeting will be one day long and will host around 15 contributed talks. Submissions will be evaluated and selected by the Program Committee members, based on the adherence with the theme of the satellite, originality and scientific soundness. Abstracts are submitted via the EasyChair website: https://easychair.org/conferences/?conf=m2ti Authors who do not have an EasyChair account should sign up for an account (for identification purposes, make sure to use the same email address as the one used for the conference registration). We solicit research papers (up to 8 pages) and extended abstracts (max 2 pages with one illustration/figure). Extended abstracts will not be included in the proceedings of the conference Once the selection process is complete, the authors of the accepted abstracts will be notified by e-mail Papers submission: *March 13, 2015* Acceptance notification: March 27, 2015 Camera-ready paper due: April 3, 2015 Workshop Day: May 26, 2015 *Invited Speakers*: - Bruno Ribeiro, Carnegie Mellon University - Suzy Moat, Warwick University *Organizers*: - Bruno Goncalves - Marton Karsai - Nicola Perra

9 years, 1 month

whoVIS editor-editor interaction visualization prototype, API for word provenance

by Flöck, Fabian

Hi all, we produced a prototype of an editor-editor interaction network visualization for individual articles, based on the word/tokens deleted and reintroduced by editors. It will be presented as a demo at the WWW conference this year [1], but we would love to also get some feedback on it from this list. It's in an early stage and pretty slow when loading up, so have patience when you try it out here: http://km.aifb.kit.edu/sites/whovis/index.html, and be sure to read the "how to" section on the site. Alternatively you can watch the (semi-professional) screencast I did :P, it explains most of the functions. The (disagreement) interactions are based on a extended version of the extraction of authorship we do with wikiwho [2], and the graph drawing is done almost exactly after the nice method proposed by Brandes et al. [3] . The code can be found at github, both for the interaction-extraction extension of wikiwho [4] and the visualization itself [5], which basically produces an json output for feeding the D3 visualization libraries we use. We have yet to generate output for more articles, so far we only show a handful for demonstration purposes. The whole thing also fits nicely (and was supposed to go along) with the IEG proposal that Pine had started on editor interaction [6] . word provenance/authorship API prototype: Also, we have worked a bit on our early prototype for an API for word provenance/authorship: You can get word/token-wise information from which revision what content originated (and thereby which editor originally authored the word) at http://193.175.238.123/wikiwho/wikiwho_api.py?revid=<REV_ID>"&name=<ARTICLENAME>&format=json (<ARTICLENAME> -> name of the article in ns:0, in the english wikipedia, <REV_ID> -> rev_id of that article for which you want the authorship information, format is currently only json) Example: http://193.175.238.123/wikiwho/wikiwho_api.py?revid=649876382&name=Laura_Bu… Output format is currently: {"tokens": [{"token": "<FIRST TOKEN IN THE WIKI MARKUP TEXT>", "author_name": "<NAME OF AUTHOR OF THE TOKEN>", "rev_id": "<REV_ID WHEN TOKEN WAS FIRST ADDED>"}, {"token": "<SECOND TOKEN IN THE WIKI MARKUP TEXT>", "author_name": "<NAME OF AUTHOR OF THE TOKEN>", "rev_id": "<REV_ID WHEN TOKEN WAS FIRST ADDED>"}, {"token": "<THIRD TOKEN … … ], "message": null, "success": "true", "revision": {"article": "<NAME OF REQUESTED ARTICLE>", "time": "<TIMESTAMP OF REQUESTED REV_ID>", "reviid": <REQUESTED REV_ID>, "author": "<AUTHOR OF REQUESTED REV_ID>"}} DISCLAIMER: there are problems with getting/processing the XML for larger articles right now, so don't be surprised if that gives you an error sometimes (i.e. querying "Barack Obama" for instance and similar sizes will *not* succeed for higher revision numbers). Also, we are working on the speed and providing more precomputed articles (right now almost all are computed on request, although we save intermediary results). Still, for most articles it works fine and the output has been tested for accuracy (cf. [2]). At some point in the future, this API will also be able to deliver the interaction data that the visualization is build on. I'm looking forward to your feedback :) Cheers, Fabian [1] http://f-squared.org/wikiwho/demo32.pdf [2] http://f-squared.org/wikiwho/ [3] http://dl.acm.org/citation.cfm?id=1526808 [4] https://github.com/maribelacosta/wikiwho [5] https://github.com/wikiwho/whovis [6] https://meta.wikimedia.org/wiki/Grants:IEG/Editor_Interaction_Data_Extracti… -- Fabian Flöck Research Associate Computational Social Science department @GESIS Unter Sachsenhausen 6-8, 50667 Cologne, Germany Tel: + 49 (0) 221-47694-208 fabian.floeck(a)gesis.org<mailto:fabian.floeck@gesis.org> www.gesis.org<http://www.gesis.org/> www.facebook.com/gesis.org<http://www.facebook.com/gesis.org>

9 years, 1 month

Re: [Wiki-research-l] a cautious note on gender stats Re: Fwd: [Gendergap] Wikipedia readers

by aaron shaw

On Mon, Feb 16, 2015 at 10:41 PM, koltzenburg(a)w4w.net <koltzenburg(a)w4w.net> wrote: > ____Aaron wrote: > "higher quality survey data" > well, and how does one recognize low quality and how come it is so low? > and "quality" by whose epistemological aims and standards? > > "causes and mechanisms that drive the gender gap (and related > participation gaps)" > which "related participation gaps" do you have in mind here? > Jane's response was helpful and similar to mine. Based on existing surveys, there are demographic and social categories of people who are underrepresented among current editors. I don't have specifics off the top of my head, but if you look at WMF survey results for US editors and compare the findings to US census data (for example), you can get an idea of some categories. Women are underrepresented to an extreme degree, but they are not the only population that does not seem to edit en:WP. I am less knowledgeable about other WPs, but I suspect there are other inequalities and gaps on other wikis. > where would these gaps be situated in terms of areas of participation? > See above. > and, again, in which language version(s)? > See above. On Mon, Feb 16, 2015 at 11:38 PM, Federico Leva (Nemo) <nemowiki(a)gmail.com> wrote: > > Speaking of which, the WMF doesn't have resources to appropriately > process the 2012 survey data, so results aren't available yet. Did you > consider offering them to take care of it, at least for the gendergap > number? You would then be able to publish an update. > > https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_Editor_Survey_2012#… As before, my understanding is that the method by which respondents were selected to participate in the survey does not meet standard methods of survey sampling (see this chunk <https://meta.wikimedia.org/wiki/Research:Wikipedia_Editor_Survey_2012#When.…> of the description of the survey). As a result, I do not trust the results of the 2012 survey to generate precise estimates of the gender gap or other demographic details about participation. I've spoken to some very receptive folks at the foundation about this and I hope that they/we will be able to improve it in the future. I'm eager to help improve the survey data collection procedures. Unfortunately, I do not have the capacity to analyze the current survey data in greater depth. The thing that allowed Mako and I to do the study that we published in PLOSONE was the fact that (1) the old UNU-Merit & WMF survey sought to include readers as well as editors; *and* (2) at the exact same time Pew carried out a survey in which they asked a nearly identical question about readership. We used the overlapping results about WP readership from both surveys to generate a correction for the data about editorship. Without similar data on readership and similar data from a representative sample of some reference population (in the case of the pew survey, US adults), we cannot perform the same correction. As a result, I do not feel comfortable estimating how biased (or unbiased) the 2012 survey results may be. a On Tue, Feb 17, 2015 at 12:00 AM, Jane Darnell <jane023(a)gmail.com> wrote: > Hi Claudial, > I responded to your questions in the text - hope it's readable. > Jane > > ____WereSpielChequers wrote: > "the community is more abrasive towards women" > > I think he is simply referring to earlier discussions where the > conclusion was "the community can be perceived to be abrasive" and this > conclusion, in yet other discussions led to this conclusion, which should > be rephrased as "the community is more often perceived as abrasive by women > than by men" > > ____Kerry wrote: > "But I would agree that if an organisation sets a target (25% women in this > particular case) and then does not put in place a means of measuring the > progress against that target, one has to question the point of > establishing a > target." > > ___Claudia (responding to Kerry): > I think one has to question the point of not putting in place a means of > measuring the progress... > and also ask why, if the issue is a high priority (allegedly, one might > add, in > speeches at meetings, in interviews with the press...) this organisation > does > not fund any top level research... - or does it? > > I think here you are forgetting about the "holy shit graph" which shows > a reduction in the number of active editors over time. This is much more of > a direct threat to the Wikiverse than the gendergap, which, as has been > stated before, is only one of many serious gaps in knowledge coverage. > Oddly, I think it is one of the easiest of all "participatory gaps" to > measure, but we seem to constantly get stranded in objections to ways that > previous editor surveys have been held, leading to the strange situation of > never actually being able to run even one editor survey twice. Since we > have not yet been able to establish any trend at all, we are only comparing > apples to oranges. > > ____Aaron wrote: > "higher quality survey data" > __Claudia (responding to Aaron): ...how does one recognize low quality..? > Hmm. I just looked and I couldn't find the criticism of the various editor > surveys. Is this stashed somewhere on meta? Or do we need to sift through > reams of emails until we find all the various objections? Objections > galore, as I recall. > > ___Claudia: which "related participation gaps" do you have in mind here? > Off the top of my head, some of these would be > > 1) lack of geographical editor coverage such as active editors in rural > areas or even in whole states such as Wyoming or South Dakota and the whole > "Global South participation problem" (the Global South participation > problem is even helped along inadvertently by the new read-only > "Wikipedia-zero" effect); > 2) lack of topical expertise on subjects that technically don't lend > themselves well to the Wikiverse, such as auditory fields (musical > production) or visual fields (how to paint, how to make movies, how to > choreograph motion) > 3) lack of topical expertise on subjects that legally don't lend > themselves well to the Wikiverse, such as articles about artworks under > copyright that cannot be illustrated in an article; > 4) lack of topical editor coverage on subjects previously shut out - there > is still unwillingness by a whole group to re-enter the Wikiverse after > being banned (earlier shut-outs such as blocking whole institution-wide ip > ranges for vandalism or whole areas of expertise such as groups of writers > for their COI editing, carry with them a history of anti-Wikipedia > sentiment that lasts a long time in various enclaves) > > ___Claudia: > and, again, in which language version(s)? > That's easy - the languages that we can technically support but don't yet > have Wikipedias for and the languages for which we don't even have the > fonts to display them. > > best, > Claudia > > > On Tue, Feb 17, 2015 at 7:41 AM, <koltzenburg(a)w4w.net> wrote: > >> Hi WereSpielChequers, Kerry, Aaron and all, >> >> ____WereSpielChequers wrote: >> "the community is more abrasive towards women" >> >> this may be stats expert discourse, but let me show you how the question >> itself has a gendered slant. >> imagine what would happen - also in your research design - if it read: >> "the >> community is less abrasive towards men" - how does this compare to the >> first question re who are "the community"? >> >> and again, re phasing ten years in 2011 and four years on, which language >> version(s) are hypotheses based on? >> >> ____Kerry wrote: >> "But I would agree that if an organisation sets a target (25% women in >> this >> particular case) and then does not put in place a means of measuring the >> progress against that target, one has to question the point of >> establishing a >> target." >> >> I think one has to question the point of not putting in place a means of >> measuring the progress... >> and also ask why, if the issue is a high priority (allegedly, one might >> add, in >> speeches at meetings, in interviews with the press...) this organisation >> does >> not fund any top level research... - or does it? >> >> ____Aaron wrote: >> "higher quality survey data" >> well, and how does one recognize low quality and how come it is so low? >> and "quality" by whose epistemological aims and standards? >> >> "causes and mechanisms that drive the gender gap (and related >> participation gaps)" >> which "related participation gaps" do you have in mind here? >> where would these gaps be situated in terms of areas of participation? >> and, again, in which language version(s)? >> >> best, >> Claudia >> >> ---------- Original Message ----------- >> From:aaron shaw <aaronshaw(a)northwestern.edu> >> To:Research into Wikimedia content and communities <wiki-research- >> l(a)lists.wikimedia.org> >> Sent:Mon, 16 Feb 2015 20:50:17 -0800 >> Subject:Re: [Wiki-research-l] a cautious note on gender stats Re: Fwd: >> [Gendergap] Wikipedia readers >> >> > Hi all! >> > >> > Thanks, Jeremy & Dariusz for following up. >> > >> > On Mon, Feb 16, 2015 at 5:58 AM, Dariusz >> > Jemielniak <darekj(a)alk.edu.pl> wrote: >> > >> > > As far as I recall, they did a follow-up on this topic, and maybe a >> > > publication coming up? >> > >> > Sadly, no follow ups at the moment. >> > >> > If we want to have a more precise sense of the >> > demographics of participants the biggest need in >> > this space is simply higher quality survey data. >> > My paper with Mako has a lot of detail about why >> > the 2008 editor survey (and all subsequent editor >> > surveys, to my knowledge) has some profound limitations. >> > >> > The identification and estimation of the effects >> > of particular causes and mechanisms that drive the >> > gender gap (and related participation gaps) >> > presents an even tougher challenge for >> > researchers and is an area of active inquiry. >> > >> > all the best, >> > Aaron >> ------- End of Original Message ------- >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > >

9 years, 2 months

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l March 2015