Hey Pierce,
Good to hear from you! You're right in your assumption that ORES' wp10 model is built upon our work. In addition to the 2013 paper, there's the additional work done in our 2015 CSCW paper (citation below) that significantly improved the model. The wikiclass library was originally built from that 2015 version.
As Aaron mentions, they've since switched to a Gradient Boosting classifier instead of the Random Forest we used, and added at least one feature (number of citation needed templates).
References: The Appendix in Warncke-Wang, M., Ayukaev, V. R., Hecht, B., and Terveen, L. "The Success and Failure of Quality Improvement Projects in Peer Production Communities" (CSCW 2015) http://www-users.cs.umn.edu/~morten/publications/cscw2015-improvementproject...
Cheers, Morten
On Fri, Jun 9, 2017 at 9:10 AM, Aaron Halfaker aaron.halfaker@gmail.com wrote:
Hi Pierce!
You're right that the wp10 model is based on Warncke-Wang's work. We've made some extensions to the feature set and changed the modeling strategy since then though.
If you want to see what features a model uses in a basic form, you can run a query to ORES with the "?features" parameter. E.g. https://ores.wikimedia.org/v3/scores/enwiki/779679551/wp10/?features returns: { "enwiki": { "models": { "wp10": { "version": "0.5.0" } }, "scores": { "779679551": { "wp10": { "features": { "feature.english.stemmed.revision.stems_length": 11621, "feature.enwiki.main_article_templates": 0, "feature.enwiki.revision.category_links": 11, "feature.enwiki.revision.cite_templates": 11, "feature.enwiki.revision.cn_templates": 2, "feature.enwiki.revision.image_links": 1, "feature.enwiki.revision.infobox_templates": 1, "feature.wikitext.revision.chars": 19241, "feature.wikitext.revision.content_chars": 12961, "feature.wikitext.revision.external_links": 24, "feature.wikitext.revision.headings_by_level(2)": 11, "feature.wikitext.revision.headings_by_level(3)": 0, "feature.wikitext.revision.ref_tags": 23, "feature.wikitext.revision.templates": 30, "feature.wikitext.revision.wikilinks": 66 }, "score": { "prediction": "C", "probability": { "B": 0.13747039004562459, "C": 0.8331703672870666, "FA": 0.007180710735104919, "GA": 0.005799232485106759, "Start": 0.015370319423127086, "Stub": 0.0010089800239699196 } } } } } } }
This does not represent the *exact* feature vector. The exact feature vector involved controlling features (e.g. content_chars / chars or log(ref_tags)). The best way to get the exact feature set is to install the appropriate library (wikiclass, editquality, etc.) from https://github.com/wiki-ai/ and ask for the feature set. E.g.
$ python Python 3.5.1+ (default, Mar 30 2016, 22:46:26) [GCC 5.3.1 20160330] on linux Type "help", "copyright", "credits" or "license" for more information.
from editquality.feature_lists.enwiki import damaging for f in damaging:
... print(f) ... feature.revision.page.is_articleish feature.revision.page.is_mainspace feature.revision.page.is_draftspace feature.log((wikitext.revision.parent.chars + 1)) feature.log((len(<datasource.tokenized(datasource.revision.parent.text)>)
- 1))
feature.log((len(<datasource.wikitext.revision.parent.words>) + 1)) feature.log((len(<datasource.wikitext.revision.parent.uppercase_words>) + 1)) feature.log((wikitext.revision.parent.headings + 1)) feature.log((wikitext.revision.parent.wikilinks + 1)) feature.log((wikitext.revision.parent.external_links + 1)) feature.log((wikitext.revision.parent.templates + 1)) feature.log((wikitext.revision.parent.ref_tags + 1)) feature.revision.parent.chars_per_word feature.revision.parent.words_per_token feature.revision.parent.uppercase_words_per_word feature.revision.parent.markups_per_token feature.wikitext.revision.diff.markup_delta_sum feature.wikitext.revision.diff.markup_delta_increase feature.wikitext.revision.diff.markup_delta_decrease feature.wikitext.revision.diff.markup_prop_delta_sum feature.wikitext.revision.diff.markup_prop_delta_increase feature.wikitext.revision.diff.markup_prop_delta_decrease feature.wikitext.revision.diff.number_delta_sum feature.wikitext.revision.diff.number_delta_increase feature.wikitext.revision.diff.number_delta_decrease feature.wikitext.revision.diff.number_prop_delta_sum feature.wikitext.revision.diff.number_prop_delta_increase feature.wikitext.revision.diff.number_prop_delta_decrease feature.wikitext.revision.diff.uppercase_word_delta_sum feature.wikitext.revision.diff.uppercase_word_delta_increase feature.wikitext.revision.diff.uppercase_word_delta_decrease feature.wikitext.revision.diff.uppercase_word_prop_delta_sum feature.wikitext.revision.diff.uppercase_word_prop_delta_increase feature.wikitext.revision.diff.uppercase_word_prop_delta_decrease feature.revision.diff.chars_change feature.revision.diff.tokens_change feature.revision.diff.words_change feature.revision.diff.words_change feature.revision.diff.headings_change feature.revision.diff.external_links_change feature.revision.diff.wikilinks_change feature.revision.diff.templates_change feature.revision.diff.ref_tags_change feature.revision.diff.longest_new_token feature.revision.diff.longest_new_repeated_char feature.revision.user.is_bot feature.revision.user.has_advanced_rights feature.revision.user.is_admin feature.revision.user.is_trusted feature.revision.user.is_patroller feature.revision.user.is_curator feature.revision.user.is_anon feature.log((temporal.revision.user.seconds_since_registration + 1)) feature.revision.comment.suggests_section_edit feature.revision.comment.has_link feature.english.badwords.revision.diff.match_delta_sum feature.english.badwords.revision.diff.match_delta_increase feature.english.badwords.revision.diff.match_delta_decrease feature.english.badwords.revision.diff.match_prop_delta_sum feature.english.badwords.revision.diff.match_prop_delta_increase feature.english.badwords.revision.diff.match_prop_delta_decrease feature.english.informals.revision.diff.match_delta_sum feature.english.informals.revision.diff.match_delta_increase feature.english.informals.revision.diff.match_delta_decrease feature.english.informals.revision.diff.match_prop_delta_sum feature.english.informals.revision.diff.match_prop_delta_increase feature.english.informals.revision.diff.match_prop_delta_decrease feature.english.dictionary.revision.diff.dict_word_delta_sum feature.english.dictionary.revision.diff.dict_word_delta_increase feature.english.dictionary.revision.diff.dict_word_delta_decrease feature.english.dictionary.revision.diff.dict_word_prop_delta_sum feature.english.dictionary.revision.diff.dict_word_prop_delta_increase feature.english.dictionary.revision.diff.dict_word_prop_delta_decrease feature.english.dictionary.revision.diff.non_dict_word_delta_sum feature.english.dictionary.revision.diff.non_dict_word_delta_increase feature.english.dictionary.revision.diff.non_dict_word_delta_decrease feature.english.dictionary.revision.diff.non_dict_word_prop_delta_sum feature.english.dictionary.revision.diff.non_dict_word_prop_delta_increase feature.english.dictionary.revision.diff.non_dict_word_prop_delta_decrease
You can ask ORES to tell you details about each of the models such as test statistics and modeling algorithm. E.g. https://ores.wikimedia.or g/v3/scores/enwiki/?models=wp10&model_info=type|params|version returns:
{ "enwiki": { "models": { "wp10": { "type": "GradientBoosting", "version": "0.5.0", "params": { "balanced_sample": true, "balanced_sample_weight": false, "center": true, "init": null, "learning_rate": 0.01, "loss": "deviance", "max_depth": 7, "max_features": "log2", "max_leaf_nodes": null, "min_samples_leaf": 1, "min_samples_split": 2, "min_weight_fraction_leaf": 0.0, "n_estimators": 700, "presort": "auto", "random_state": null, "scale": true, "subsample": 1.0, "verbose": 0, "warm_start": false } } } } }
For information about how features are extracted, see http://pythonhosted.org/revscoring. For the full process by which models are built, see the makefile for the appropriate repository. E.g. https://github.com/wiki-ai/wikiclass/blob/master/Makefile#L111
-Aaron
On Fri, Jun 9, 2017 at 10:50 AM, Pierce Edmiston pedmiston@wisc.edu wrote:
Hello,
I'm wondering how to find out the details of edit and article quality models, specifically the *reverted* and *damaging* edit quality models, and the *wp10* article quality model. I'm wondering what algorithm is being used and what features are being trained on.
I believe the *wp10* model may have originated with Warncke-Wang, Cosley, & Riedl (2013) Tell me more: An actionable quality model for Wikipedia, in which case I can figure out the model specification and features from the paper. But I'm not sure if the details of the edit quality models have been similarly summarized in any papers or in any online documentation.
Thanks for your help! Pierce
AI mailing list AI@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ai
AI mailing list AI@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ai