Wiki-research-l April 2016

wiki-research-l@lists.wikimedia.org

25 participants
21 discussions

Wikipedia Research policy
by song＠cs.umn.edu 14 Jul '23

14 Jul '23

Pursuant to prior discussions about the need for a research policy on Wikipedia, WikiProject Research is drafting a policy regarding the recruitment of Wikipedia users to participate in studies. At this time, we have a proposed policy, and an accompanying group that would facilitate recruitment of subjects in much the same way that the Bot Approvals Group approves bots. The policy proposal can be found at: http://en.wikipedia.org/wiki/Wikipedia:Research The Subject Recruitment Approvals Group mentioned in the proposal is being described at: http://en.wikipedia.org/wiki/Wikipedia:Subject_Recruitment_Approvals_Group Before we move forward with seeking approval from the Wikipedia community, we would like additional input about the proposal, and would welcome additional help improving it. Also, please consider participating in WikiProject Research at: http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Research -- Bryan Song GroupLens Research University of Minnesota

8 10

[Analytics] Beeline as Hive client
by Madhumitha Viswanathan 02 Oct '18

02 Oct '18

Hi all, For all Hive users using stat1002/1004, you might have seen a deprecation warning when you launch the hive client - that claims it's being replaced with Beeline. The Beeline shell has always been available to use, but it required supplying a database connection string every time, which was pretty annoying. We now have a wrapper <https://github.com/wikimedia/operations-puppet/blob/production/modules/role…> script setup to make this easier. The old Hive CLI will continue to exist, but we encourage moving over to Beeline. You can use it by logging into the stat1002/1004 boxes as usual, and launching `beeline`. There is some documentation on this here: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Beeline. If you run into any issues using this interface, please ping us on the Analytics list or #wikimedia-analytics or file a bug on Phabricator <http://phabricator.wikimedia.org/tag/analytics>. (If you are wondering stat1004 whaaat - there should be an announcement coming up about it soon!) Best, --Madhu :)

2 2

Wikipedia aggregate clickstream data released
by Dario Taraborelli 17 Jan '18

17 Jan '18

We’re glad to announce the release of an aggregate clickstream dataset extracted from English Wikipedia http://dx.doi.org/10.6084/m9.figshare.1305770 <http://dx.doi.org/10.6084/m9.figshare.1305770> This dataset contains counts of (referer, article) pairs aggregated from the HTTP request logs of English Wikipedia. This snapshot captures 22 million (referer, article) pairs from a total of 4 billion requests collected during the month of January 2015. This data can be used for various purposes: • determining the most frequent links people click on for a given article • determining the most common links people followed to an article • determining how much of the total traffic to an article clicked on a link in that article • generating a Markov chain over English Wikipedia We created a page on Meta for feedback and discussion about this release: https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream <https://meta.wikimedia.org/wiki/Research_talk:Wikipedia_clickstream> Ellery and Dario

5 5

Wikipedia Clickstream dataset refreshed (March 2016)
by Dario Taraborelli 02 May '16

02 May '16

Hey all, heads up that a refreshed Wikipedia Clickstream dataset is now available for March 2016, containing 25 million (referer, resource) pairs extracted from about 7 billion webrequests. https://dx.doi.org/10.6084/m9.figshare.1305770.v16 Ellery (the author of the dataset) is cc'ed if you have any questions, or you can chime in on the talk page of the dataset entry on Meta <https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream>. Show us what you do with this data, if you use it in your research. Dario *Dario Taraborelli *Head of Research, Wikimedia Foundation wikimediafoundation.org • nitens.org • @readermeter <http://twitter.com/readermeter>

2 2

Re: [Wiki-research-l] [Wikitech-l] Updates to ORES service & BREAKING CHANGE on April 7th
by Aaron Halfaker 30 Apr '16

30 Apr '16

Hi Mortiz, There's two types of stability you should be aware of: API behavior and model scores. You should expect that the version'd API behavior will remain stable. So, if we choose to make a change to the request or response style, that will appear under the path "v3/" and so forth. So, if you write code against the v2/ API (you shouldn't be writing new code against the v1/ API, but you *can* expect it to be stable), you should expect that it will continue to work as expected. You can see the swagger spec's for the APIs at these endpoints: https://ores.wmflabs.org/v1/spec/ or https://ores.wmflabs.org/v2/spec/ You should expect that the API behavior described will not change. But we may still need to update the models in the future and that would likely change the range of scores slightly. We include versions of the models in the basic API response so that you can cache and invalidate scores that you get from the API. We're still working out the right way to report evaluation metrics to you so that you'll be able to dynamically adjust any thresholds you set in your own application. FWIW, I do not forsee us changing our modeling strategy substantially in the short- or mid-term. It took us ~3 months of work to prepare for the breaking change that was announced in this thread. In the end, we're interested in learning about your needs and concerns so that we can adjust our process and make changes accordingly. So if you have concerns with any of the above please let us know. -Aaron On Sat, Apr 30, 2016 at 5:50 PM, Moritz Schubotz <physik(a)physikerwelt.de> wrote: > Hi Aaron, > > can you say a few words about the stability of the API. > We are working on a scoring model for user contributions, rather than > revisions using Apache Flink. > http://imwa.gehaxelt.in:9090/pdfs/expose.pdf > However, it would be nice to have a somehow compatible API in the end. > > Best > Moritz > > On Thu, Apr 7, 2016 at 10:55 AM, Aaron Halfaker <aaron.halfaker(a)gmail.com> > wrote: > > > FYI, the new models (BREAKING CHANGE) are now deployed. > > > > On Sun, Apr 3, 2016 at 5:38 AM, Aaron Halfaker <aaron.halfaker(a)gmail.com > > > > wrote: > > > > > Hey folks, we have a couple of announcements for you today. First is > that > > > ORES has a large set of new functionality that you might like to take > > > advantage of. We'll also want to talk about a *BREAKING CHANGE on April > > > 7th.* > > > > > > Don't know what ORES is? See > > > > > > http://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/ > > > > > > *New functionality* > > > > > > *Scoring UI* > > > Sometimes you just want to score a few revisions in ORES and > remembering > > > the URL structure is hard. So, we've build a simple scoring > > user-interface > > > <https://ores.wmflabs.org/ui/> that will allow you to more easily > score > > a > > > set of edits. > > > > > > *New API version* > > > We've been consistently getting requests to include more information in > > > ORES' responses. In order to make space for this new information, we > > needed > > > to change the structure of responses. But we wanted to do this without > > > breaking the tools that are already using ORES. So, we've developed a > > > versioning scheme that will allow you to take advantage of new > > > functionality when you are ready. The same old API will continue to be > > > available at https://ores.wmflabs.org/scores/, but we've added two > > > additional paths on top of this. > > > > > > - https://ores.wmflabs.org/v1/scores/ is a mirror of the old > scoring > > > API which will henceforth be referred to as "v1" > > > - https://ores.wmflabs.org/v2/scores/ implements a new response > > format > > > that is consistent between all sub-paths and adds some new > > functionality > > > > > > *Swagger documentation* > > > Curious about the new functionality available in "v2" or maybe what the > > > change was from "v1"? We've implemented a structured description of > both > > > versions of the scoring API using swagger -- which is becoming a > defacto > > > stanard for this sort of thing. Visit https://ores.wmflabs.org/v1/ or > > > https://ores.wmflabs.org/v2/ to see the Swagger user-interface. > > > Visithttps://ores.wmflabs.org/v1/spec/ or > > > https://ores.wmflabs.org/v2/spec/ to get the specification in a > > > machine-readable format. > > > > > > *Feature values & injection* > > > Have you wondered what ORES uses to make it's predictions? You can now > > ask > > > ORES to show you the list of "feature" statistics it uses to score > > > revisions. For example, > > > https://ores.wmflabs.org/v2/scores/enwiki/wp10/34567892/?features will > > > return the score with a mapping of feature values used by the "wp10" > > > article quality model in English Wikipedia to score oldid=34567892 > > > <https://en.wikipedia.org/wiki/Special:Diff/34567892>. You can also > > > "inject" features into the scoring process to see how that affects the > > > prediction. E.g., > > > > > > https://ores.wmflabs.org/v2/scores/enwiki/wp10/34567892?features&feature.wi… > > > > > > *Breaking change -- new models* > > > We've been experimenting with new learning algorithms to make ORES work > > > better and we've found that we get better results with gradient > boosting > > > <https://en.wikipedia.org/wiki/Gradient_boosting> and random forest > > > <https://en.wikipedia.org/wiki/Random_forest> strategies than we do > with > > > the current linear svc > > > <https://en.wikipedia.org/wiki/Support_vector_machine> models. We'd > like > > > to get these new, better models deployed as soon as possible, but with > > the > > > new algorithm comes a change in the range of probabilities returned by > > the > > > model. So, when we deploy this change, any tools that uses hard-coded > > > thresholds on ORES' prediction probabilities will suddenly start > behaving > > > strangely. Regretfully, we haven't found a way around this problem, so > > > we're announcing the change now and we plan to deploy this *BREAKING > > > CHANGE on April 7th*. Please subscribe to the AI mailinglist > > > <https://lists.wikimedia.org/mailman/listinfo/ai> or watch our project > > > page [[:m:ORES <https://meta.wikimedia.org/wiki/ORES>]] to catch > > > announcements of future changes and new functionality. > > > > > > In order to make sure we don't end up in the same situation the next > time > > > we want to change an algorithm, we've included a suite of evaluation > > > statistics with each model. The filter_rate_at_recall(0.9), > > > filter_rate_at_recall(0.75), and recall_at_fpr(0.1) thresholds > represent > > > three critical thresholds (should review, needs review, and definitely > > > damaging -- respectively) that can be used to automatically configure > > your > > > wiki tool. You can find out these thresholds for your model of choice > by > > > adding the ?model_info parameter to requests. So, come breaking > change, > > > we strongly recommend basing your thresholds on these statistics in the > > > future. We'll be working to submit patches to tools that use ORES in > the > > > next week to implement this flexibility. Hopefully, all you'll need to > > do > > > is worth with us on those. > > > > > > -halfak & The Revision Scoring team > > > < > https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service> > > > > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l(a)lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > > > -- > Mit freundlichen Grüßen > Moritz Schubotz > > Telefon (Büro): +49 30 314 22784 > Telefon (Privat):+49 30 488 27330 > E-Mail: schubotz(a)itp.physik.tu-berlin.de > Web: http://www.physikerwelt.de > Skype: Schubi87 > ICQ: 200302764 > Msn: Moritz(a)Schubotz.de > _______________________________________________ > Wikitech-l mailing list > Wikitech-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l

1 0

"On Wikipedia, all languages are not created equal; but that could be changing"
by Pine W 29 Apr '16

29 Apr '16

Forwarding some interesting news! >From the Wikimedia Blog: Find, Prioritize, and Recommend: An article recommendation system to fill knowledge gaps across Wikipedia <https://blog.wikimedia.org/2016/04/27/article-recommendation-system/>, by Leila Zia and Dario Taraborelli >From Venturebeat: On Wikipedia, all languages are not created equal; but that could be changing <http://venturebeat.com/2016/04/28/on-wikipedia-all-languages-are-not-create…>, by Paul Sawers Pine

1 0

2016 Social Media & Society Conference: Keynotes & Presentations Announced!
by Anatoliy 25 Apr '16

25 Apr '16

(Apologies for cross-posting) We would like to invite you to attend the 2016 International Conference on Social Media & Society that will be held on July 11-13 in London, UK. KEYNOTES This year, we are honoured to have two featured keynotes: * Dr. Susan Halford - Director, Web Science Institute, University of Southampton, UK * Dr. Helen Kennedy - Professor of Digital Society, University of Sheffield, UK PRESENTATIONS The conference's intensive 3-day program will feature workshops, full & work-in-progress papers, panels, and posters, covering a wide range of areas including Communication, Computer Science, Education, Journalism, Information Science, Management, Political Science, Sociology, etc. * Accepted workshops: <http://socialmediaandsociety.org/2016-workshops/> http://socialmediaandsociety.org/2016-workshops/ * Accepted panels, papers and posters: <http://socialmediaandsociety.org/schedule/> http://socialmediaandsociety.org/schedule/ REGISTRATION The early-bird deadline ends May 1, 2016, so register ASAP. We hope you can join us for this exciting event and contribute to this emerging research area! Register here: <http://socialmediaandsociety.org/registration/> http://socialmediaandsociety.org/registration/ If you have any questions about the conference, please email us at: smsociety16(a)easychair.org <mailto:smsociety16@easychair.org> ~2016 #SMSociety Organizing Committee Anatoliy Gruzd, Philip Mai, Marc Esteve Del Valle, Ryerson University, Canada Jenna Jacobson, University of Toronto, Canada Dhiraj Murthy, Evelyn Ruppert, & Ville Takala, Goldsmiths, University of London, UK <http://socialmediaandsociety.org/> http://SocialMediaAndSociety.org

1 0

Proposal: Building a database of etymological relationships and an Interactive and Visual Etymology Dictionary based on Wiktionary
by Ester Pantaleo 25 Apr '16

25 Apr '16

Hello, I am writing to get some feedback/suggestion on an IGE grant proposal <https://meta.wikimedia.org/wiki/Grants:IEG/A_graphical_and_interactive_etym…> I submitted to Wikimedia that might be of interest to the research community. I am working on an interactive visualization tool for etymological relationships and I produced a demo of my interactive visualization *etytree*: http://www.epantaleo.com/2015/12/01/etymology-tree/ The aim of the application is to visualize - in one graph - the etymology of all words deriving from the same ancestor. Users can expand/collapse the tree to visualize what they are interested in. The textual part attached to the graph can be easily translated in any language and the app would become a multilingual resource. My idea is to use dbnary's extraction-framework (for Wiktionary) and develop a (possibly) smart pre-processing strategy to translate Wiktionary textual etymology into a graph database of etymological relationships. The database of etymological relationships will be available for the community and can be used as a resource to study the history of languages, how pronunciation evolved through time, and eventually how semantics evolved through time. The link to the grant proposal is https://meta.wikimedia.org/wiki/Grants:IEG/A_graphical_and_interactive_etym… Feedback from the community is important to receive a grant from the Wikimedia foundation so please leave a feedback there if you are interested in the project. Thanks a lot! Ester Pantaleo

3 3

Upcoming research newsletter (April 2016): new papers open for review
by Mohammed Sadat 23 Apr '16

23 Apr '16

Hi everybody, We’re preparing for the April 2016 research newsletter and looking for contributors. Please take a look at: https://etherpad.wikimedia.org/p/WRN201604 and add your name next to any paper you are interested in covering. Our target publication date is Wednesday April 27 UTC although actual publication might happen several days later. As usual, short notes and one-paragraph reviews are most welcome. Highlights from this month: • Bridging the gap between Wikipedia and academia • Centralizing content and distributing labor: a community model for curating the very long tail of microbial genomes • Der lexikographische Prozess im deutschen Wiktionary • Die Offenheitssemantik der Wikipedia • Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes • Estudio sobre el contenido de las Ciencias de la Documentación en la Wikipedia en español • Generating Article Placeholders from Wikidata for Wikipedia - Increasing Access to Free and Open Knowledge • LlamaFur: Learning Latent Category Matrix to Find Unexpected Relations in Wikipedia • The Evolution of Wikipedia's Norm Network • Towards a (De)centralization-Based Typology of Peer Production • Wikipedia and Stock Return: Wikipedia Usage Pattern Helps to Predict the Individual Stock Movement • Wikipedia in der Praxis • Writing a Wikipedia Article on Cultural Competence in Health Care If you have any question about the format or process feel free to get in touch off-list. Masssly, Tilman Bayer and Dario Taraborelli [1] http://meta.wikimedia.org/wiki/Research:Newsletter

1 0

Sharing Wiki related research data
by Physikerwelt 22 Apr '16

22 Apr '16

Dear all, is there a central data repository, we want to use to share research data. We put our data in the release of the GitHub repository~[1], but that might not be optimal. Best Moritz [1] https://github.com/wikimedia/citolytics/releases/tag/v0.0.2

5 4

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Wiki-research-l April 2016