Hey Pierce,

Good to hear from you! You're right in your assumption that ORES' wp10 model is built upon our work. In addition to the 2013 paper, there's the additional work done in our 2015 CSCW paper (citation below) that significantly improved the model. The wikiclass library was originally built from that 2015 version.

As Aaron mentions, they've since switched to a Gradient Boosting classifier instead of the Random Forest we used, and added at least one feature (number of citation needed templates).

References:
The Appendix in Warncke-Wang, M., Ayukaev, V. R., Hecht, B., and Terveen, L. "The Success and Failure of Quality Improvement Projects in Peer Production Communities" (CSCW 2015) http://www-users.cs.umn.edu/~morten/publications/cscw2015-improvementprojects.pdf


Cheers,
Morten


On Fri, Jun 9, 2017 at 9:10 AM, Aaron Halfaker <aaron.halfaker@gmail.com> wrote:
Hi Pierce!  

You're right that the wp10 model is based on Warncke-Wang's work.  We've made some extensions to the feature set and changed the modeling strategy since then though.  

If you want to see what features a model uses in a basic form, you can run a query to ORES with the "?features" parameter.  E.g. https://ores.wikimedia.org/v3/scores/enwiki/779679551/wp10/?features returns:
{
  "enwiki": {
    "models": {
      "wp10": {
        "version": "0.5.0"
      }
    },
    "scores": {
      "779679551": {
        "wp10": {
          "features": {
            "feature.english.stemmed.revision.stems_length": 11621,
            "feature.enwiki.main_article_templates": 0,
            "feature.enwiki.revision.category_links": 11,
            "feature.enwiki.revision.cite_templates": 11,
            "feature.enwiki.revision.cn_templates": 2,
            "feature.enwiki.revision.image_links": 1,
            "feature.enwiki.revision.infobox_templates": 1,
            "feature.wikitext.revision.chars": 19241,
            "feature.wikitext.revision.content_chars": 12961,
            "feature.wikitext.revision.external_links": 24,
            "feature.wikitext.revision.headings_by_level(2)": 11,
            "feature.wikitext.revision.headings_by_level(3)": 0,
            "feature.wikitext.revision.ref_tags": 23,
            "feature.wikitext.revision.templates": 30,
            "feature.wikitext.revision.wikilinks": 66
          },
          "score": {
            "prediction": "C",
            "probability": {
              "B": 0.13747039004562459,
              "C": 0.8331703672870666,
              "FA": 0.007180710735104919,
              "GA": 0.005799232485106759,
              "Start": 0.015370319423127086,
              "Stub": 0.0010089800239699196
            }
          }
        }
      }
    }
  }
}

This does not represent the *exact* feature vector.  The exact feature vector involved controlling features (e.g. content_chars / chars or log(ref_tags)).  The best way to get the exact feature set is to install the appropriate library (wikiclass, editquality, etc.) from https://github.com/wiki-ai/ and ask for the feature set.  E.g. 

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from editquality.feature_lists.enwiki import damaging
>>> for f in damaging:
...     print(f)
... 
feature.revision.page.is_articleish
feature.revision.page.is_mainspace
feature.revision.page.is_draftspace
feature.log((wikitext.revision.parent.chars + 1))
feature.log((len(<datasource.tokenized(datasource.revision.parent.text)>) + 1))
feature.log((len(<datasource.wikitext.revision.parent.words>) + 1))
feature.log((len(<datasource.wikitext.revision.parent.uppercase_words>) + 1))
feature.log((wikitext.revision.parent.headings + 1))
feature.log((wikitext.revision.parent.wikilinks + 1))
feature.log((wikitext.revision.parent.external_links + 1))
feature.log((wikitext.revision.parent.templates + 1))
feature.log((wikitext.revision.parent.ref_tags + 1))
feature.revision.parent.chars_per_word
feature.revision.parent.words_per_token
feature.revision.parent.uppercase_words_per_word
feature.revision.parent.markups_per_token
feature.wikitext.revision.diff.markup_delta_sum
feature.wikitext.revision.diff.markup_delta_increase
feature.wikitext.revision.diff.markup_delta_decrease
feature.wikitext.revision.diff.markup_prop_delta_sum
feature.wikitext.revision.diff.markup_prop_delta_increase
feature.wikitext.revision.diff.markup_prop_delta_decrease
feature.wikitext.revision.diff.number_delta_sum
feature.wikitext.revision.diff.number_delta_increase
feature.wikitext.revision.diff.number_delta_decrease
feature.wikitext.revision.diff.number_prop_delta_sum
feature.wikitext.revision.diff.number_prop_delta_increase
feature.wikitext.revision.diff.number_prop_delta_decrease
feature.wikitext.revision.diff.uppercase_word_delta_sum
feature.wikitext.revision.diff.uppercase_word_delta_increase
feature.wikitext.revision.diff.uppercase_word_delta_decrease
feature.wikitext.revision.diff.uppercase_word_prop_delta_sum
feature.wikitext.revision.diff.uppercase_word_prop_delta_increase
feature.wikitext.revision.diff.uppercase_word_prop_delta_decrease
feature.revision.diff.chars_change
feature.revision.diff.tokens_change
feature.revision.diff.words_change
feature.revision.diff.words_change
feature.revision.diff.headings_change
feature.revision.diff.external_links_change
feature.revision.diff.wikilinks_change
feature.revision.diff.templates_change
feature.revision.diff.ref_tags_change
feature.revision.diff.longest_new_token
feature.revision.diff.longest_new_repeated_char
feature.revision.user.is_bot
feature.revision.user.has_advanced_rights
feature.revision.user.is_admin
feature.revision.user.is_trusted
feature.revision.user.is_patroller
feature.revision.user.is_curator
feature.revision.user.is_anon
feature.log((temporal.revision.user.seconds_since_registration + 1))
feature.revision.comment.suggests_section_edit
feature.revision.comment.has_link
feature.english.badwords.revision.diff.match_delta_sum
feature.english.badwords.revision.diff.match_delta_increase
feature.english.badwords.revision.diff.match_delta_decrease
feature.english.badwords.revision.diff.match_prop_delta_sum
feature.english.badwords.revision.diff.match_prop_delta_increase
feature.english.badwords.revision.diff.match_prop_delta_decrease
feature.english.informals.revision.diff.match_delta_sum
feature.english.informals.revision.diff.match_delta_increase
feature.english.informals.revision.diff.match_delta_decrease
feature.english.informals.revision.diff.match_prop_delta_sum
feature.english.informals.revision.diff.match_prop_delta_increase
feature.english.informals.revision.diff.match_prop_delta_decrease
feature.english.dictionary.revision.diff.dict_word_delta_sum
feature.english.dictionary.revision.diff.dict_word_delta_increase
feature.english.dictionary.revision.diff.dict_word_delta_decrease
feature.english.dictionary.revision.diff.dict_word_prop_delta_sum
feature.english.dictionary.revision.diff.dict_word_prop_delta_increase
feature.english.dictionary.revision.diff.dict_word_prop_delta_decrease
feature.english.dictionary.revision.diff.non_dict_word_delta_sum
feature.english.dictionary.revision.diff.non_dict_word_delta_increase
feature.english.dictionary.revision.diff.non_dict_word_delta_decrease
feature.english.dictionary.revision.diff.non_dict_word_prop_delta_sum
feature.english.dictionary.revision.diff.non_dict_word_prop_delta_increase
feature.english.dictionary.revision.diff.non_dict_word_prop_delta_decrease

You can ask ORES to tell you details about each of the models such as test statistics and modeling algorithm.  E.g. https://ores.wikimedia.org/v3/scores/enwiki/?models=wp10&model_info=type|params|version returns:

{
  "enwiki": {
    "models": {
      "wp10": {
        "type": "GradientBoosting",
        "version": "0.5.0",
        "params": {
          "balanced_sample": true,
          "balanced_sample_weight": false,
          "center": true,
          "init": null,
          "learning_rate": 0.01,
          "loss": "deviance",
          "max_depth": 7,
          "max_features": "log2",
          "max_leaf_nodes": null,
          "min_samples_leaf": 1,
          "min_samples_split": 2,
          "min_weight_fraction_leaf": 0.0,
          "n_estimators": 700,
          "presort": "auto",
          "random_state": null,
          "scale": true,
          "subsample": 1.0,
          "verbose": 0,
          "warm_start": false
        }
      }
    }
  }
}

For information about how features are extracted, see http://pythonhosted.org/revscoring.  For the full process by which models are built, see the makefile for the appropriate repository.  E.g. https://github.com/wiki-ai/wikiclass/blob/master/Makefile#L111

-Aaron

On Fri, Jun 9, 2017 at 10:50 AM, Pierce Edmiston <pedmiston@wisc.edu> wrote:
Hello,

I'm wondering how to find out the details of edit and article quality models, specifically the reverted and damaging edit quality models, and the wp10 article quality model. I'm wondering what algorithm is being used and what features are being trained on.

I believe the wp10 model may have originated with Warncke-Wang, Cosley, & Riedl (2013) Tell me more: An actionable quality model for Wikipedia, in which case I can figure out the model specification and features from the paper. But I'm not sure if the details of the edit quality models have been similarly summarized in any papers or in any online documentation.

Thanks for your help!
Pierce

_______________________________________________
AI mailing list
AI@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai



_______________________________________________
AI mailing list
AI@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai