Hey folks!
This is the 32 - 41st weekly update from the revision scoring team that we have sent to this mailing list. We've been busy, but our reporting fell behind. So here I am getting us caught up! This is going to be a long one. Bear with me.
One major thing we've done in the past few weeks is drafted and presented a proposal to increase the resourcing for the ORES project in the 2017 Fiscal Year. Currently, we're just one fully funded staff member (halfak) and partially funded contractor (Amir1) working with a bunch of volunteers. We're proposing to staff the team with fulltime engineers, a liaison and a tech writer. See a full draft of our proposal and pitch deck here: https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Scoring_Platform_team
New development:
We've expanded support for our "editquality" models to more wikis and improved the performance of some of the models.
- We scaled up the number of observations for Indonesian Wikipedia to 100k[1]
- We added language support for Romanian[2] and built the basic "reverted" model[3]
- We trained and tested "damaging" and "goodfaith" models for Czech Wikipedia[4]
- We implemented some params in our training utilites to control memory usage[5]
- We deployed all of the above to Wikimedia Labs[6]. A production deployment is coming soon.
Prompted by the 2016 community wishlist[7], we've implemented a "draftquality" model for evaluating new page creations.
- We researched deletion reasons on English Wikipedia[8] and created a labeled dataset using the deletion log.
- We engineered a set of features to predict the quality of new articles[9] and built a model[10]
- We generated a set of datasets[11,12,13] to make it easier for volunteers and external researchers to help us audit the performance of the model.
- We deployed the model on WMFLabs[14] and announced it's presence to a few interested patrollers in English Wikipedia
- We've started the process of deploying the model in production[15,16]
We completed a project exploring the use of advance natural-language processing strategies to extract new signal about vandalism, article quality and problematic new articles. Regretfully, memory issues prevent us from trivially putting this into production[17], so we're looking into alternative strategies[18].
- We implemented a strategy for extracting sentence from Wikitext[19]
- We built sentence banks for personal attacks[20, vandalism[21], spam[22], and Featured Articles[23].
- We built PCFG-based models[24] and analyzed their ability to differentiate[25]
We've been working with the Collaboration Team[26] on their Edit Review Improvments project[27]
- We defined and implemented a set of new precision-based test statistics that will inform thresholds used in their new user interface[28]
- But we also decided to continue to report recall-based test statistics as well[29]
Based on advice from engineers on the Collaboration Team, we've begun the process of converting Wiki labels[30] to a stand-alone tool in labs.
- We generalize the gadget interface so that it can handle all langauges/wikis[31]
- We implemented a means to auto-configure wikis based on the dbname[32,33] and that allowed us to simplify configuration[34]
- We also implemented some performance improvements with minification, bundling[35]
Labeling:
In the past few weeks, we've set up labeling campaigns for a few wikis.
- We deployed an edit types campaign for Catalan Wikipedia[36]
- We deployed an edit quality campagin for Chinese[37] and Romanian[38] Wikipedias
- We deployed a new type of campaign for English Wikipedia -- "discussion quality" asks editors to label talk posts as "toxic" or not[39]
Maintenance and robustness:
We've solved a large set of problems with logging issues, compatibility with wikibase, and we've made minor improvements to performance.
- We addressed a few bugs in the ORES Review Tool[40,44]
- We quieted some errors from our logging in ORES[41,45]
- We updated our code to work with a wikibase schema change[42]
- We fixed a language fallback pattern in Wiki labels[43]
- We set up monitoring on ORES database disk sizes[46]
- We fixed some issues with scap, phabricator's diffusion and other supporting systems so that we can continue deploying to beta labs[47]
- We split our assets repo so that we can let our WMFLabs deploy get ahead of the Production deployment[48]
- ORES can now minify its JSON responses[49]
- We identified a bug in flask-assets and worked around it in our local installation of Wiki labels[50]
Communications and outreach:
We had a big presence at the Wikimedia Developer summit, we've drafted a resourcing proposal, and we've made some announcements about upcoming plans for the ORES Review tool.
- We facilitated the "Artificial Intelligence to build and navigate content" track[51]
- We ran a session for building an AI wishlist[52] and captured notes about more than 20 new AI proposals on a new tag in phabricator[53]
- We also ran a session discussion the ethics and dangers of advanced algorithms mediating our processes[54]
- We helped facilitate a session about where to surface current AIs in Wikimedia Projects[55]
- We held a discussion with Legal about licensing labeled data that comes out of Wiki labels[56] and updated the interface to state the CC0 license clearly[57]
- We worked with the Reading Infrastructure team to analyze the consumption of "oresscores" through the MediaWiki API[58]
- We drafted a pitch for increasing the resources for our team[59]
- We worked with the Collaboration team to announce that they'll experimenting with a new RecentChanged filtering strategy in the ORES Review Tool[60,61]
Sincerely,
Aaron from the Revision Scoring Scoring Platform team