Hey Pierce,
Good to hear from you! You're right in your assumption that ORES' wp10
model is built upon our work. In addition to the 2013 paper, there's the
additional work done in our 2015 CSCW paper (citation below) that
significantly improved the model. The wikiclass library was originally
built from that 2015 version.
As Aaron mentions, they've since switched to a Gradient Boosting classifier
instead of the Random Forest we used, and added at least one feature
(number of citation needed templates).
References:
The Appendix in Warncke-Wang, M., Ayukaev, V. R., Hecht, B., and Terveen,
L. "The Success and Failure of Quality Improvement Projects in Peer
Production Communities" (CSCW 2015)
Cheers,
Morten
On Fri, Jun 9, 2017 at 9:10 AM, Aaron Halfaker <aaron.halfaker(a)gmail.com>
wrote:
Hi Pierce!
You're right that the wp10 model is based on Warncke-Wang's work. We've
made some extensions to the feature set and changed the modeling strategy
since then though.
If you want to see what features a model uses in a basic form, you can run
a query to ORES with the "?features" parameter. E.g.
https://ores.wikimedia.org/v3/scores/enwiki/779679551/wp10/?features
returns:
{
"enwiki": {
"models": {
"wp10": {
"version": "0.5.0"
}
},
"scores": {
"779679551": {
"wp10": {
"features": {
"feature.english.stemmed.revision.stems_length": 11621,
"feature.enwiki.main_article_templates": 0,
"feature.enwiki.revision.category_links": 11,
"feature.enwiki.revision.cite_templates": 11,
"feature.enwiki.revision.cn_templates": 2,
"feature.enwiki.revision.image_links": 1,
"feature.enwiki.revision.infobox_templates": 1,
"feature.wikitext.revision.chars": 19241,
"feature.wikitext.revision.content_chars": 12961,
"feature.wikitext.revision.external_links": 24,
"feature.wikitext.revision.headings_by_level(2)": 11,
"feature.wikitext.revision.headings_by_level(3)": 0,
"feature.wikitext.revision.ref_tags": 23,
"feature.wikitext.revision.templates": 30,
"feature.wikitext.revision.wikilinks": 66
},
"score": {
"prediction": "C",
"probability": {
"B": 0.13747039004562459,
"C": 0.8331703672870666,
"FA": 0.007180710735104919,
"GA": 0.005799232485106759,
"Start": 0.015370319423127086,
"Stub": 0.0010089800239699196
}
}
}
}
}
}
}
This does not represent the *exact* feature vector. The exact feature
vector involved controlling features (e.g. content_chars / chars or
log(ref_tags)). The best way to get the exact feature set is to install
the appropriate library (wikiclass, editquality, etc.) from
https://github.com/wiki-ai/ and ask for the feature set. E.g.
$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26)
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license"
for more information.
>> from editquality.feature_lists.enwiki
import damaging
>> for f in damaging:
... print(f)
...
feature.revision.page.is_articleish
feature.revision.page.is_mainspace
feature.revision.page.is_draftspace
feature.log((wikitext.revision.parent.chars + 1))
feature.log((len(<datasource.tokenized(datasource.revision.parent.text)>)
+ 1))
feature.log((len(<datasource.wikitext.revision.parent.words>) + 1))
feature.log((len(<datasource.wikitext.revision.parent.uppercase_words>) +
1))
feature.log((wikitext.revision.parent.headings + 1))
feature.log((wikitext.revision.parent.wikilinks + 1))
feature.log((wikitext.revision.parent.external_links + 1))
feature.log((wikitext.revision.parent.templates + 1))
feature.log((wikitext.revision.parent.ref_tags + 1))
feature.revision.parent.chars_per_word
feature.revision.parent.words_per_token
feature.revision.parent.uppercase_words_per_word
feature.revision.parent.markups_per_token
feature.wikitext.revision.diff.markup_delta_sum
feature.wikitext.revision.diff.markup_delta_increase
feature.wikitext.revision.diff.markup_delta_decrease
feature.wikitext.revision.diff.markup_prop_delta_sum
feature.wikitext.revision.diff.markup_prop_delta_increase
feature.wikitext.revision.diff.markup_prop_delta_decrease
feature.wikitext.revision.diff.number_delta_sum
feature.wikitext.revision.diff.number_delta_increase
feature.wikitext.revision.diff.number_delta_decrease
feature.wikitext.revision.diff.number_prop_delta_sum
feature.wikitext.revision.diff.number_prop_delta_increase
feature.wikitext.revision.diff.number_prop_delta_decrease
feature.wikitext.revision.diff.uppercase_word_delta_sum
feature.wikitext.revision.diff.uppercase_word_delta_increase
feature.wikitext.revision.diff.uppercase_word_delta_decrease
feature.wikitext.revision.diff.uppercase_word_prop_delta_sum
feature.wikitext.revision.diff.uppercase_word_prop_delta_increase
feature.wikitext.revision.diff.uppercase_word_prop_delta_decrease
feature.revision.diff.chars_change
feature.revision.diff.tokens_change
feature.revision.diff.words_change
feature.revision.diff.words_change
feature.revision.diff.headings_change
feature.revision.diff.external_links_change
feature.revision.diff.wikilinks_change
feature.revision.diff.templates_change
feature.revision.diff.ref_tags_change
feature.revision.diff.longest_new_token
feature.revision.diff.longest_new_repeated_char
feature.revision.user.is_bot
feature.revision.user.has_advanced_rights
feature.revision.user.is_admin
feature.revision.user.is_trusted
feature.revision.user.is_patroller
feature.revision.user.is_curator
feature.revision.user.is_anon
feature.log((temporal.revision.user.seconds_since_registration + 1))
feature.revision.comment.suggests_section_edit
feature.revision.comment.has_link
feature.english.badwords.revision.diff.match_delta_sum
feature.english.badwords.revision.diff.match_delta_increase
feature.english.badwords.revision.diff.match_delta_decrease
feature.english.badwords.revision.diff.match_prop_delta_sum
feature.english.badwords.revision.diff.match_prop_delta_increase
feature.english.badwords.revision.diff.match_prop_delta_decrease
feature.english.informals.revision.diff.match_delta_sum
feature.english.informals.revision.diff.match_delta_increase
feature.english.informals.revision.diff.match_delta_decrease
feature.english.informals.revision.diff.match_prop_delta_sum
feature.english.informals.revision.diff.match_prop_delta_increase
feature.english.informals.revision.diff.match_prop_delta_decrease
feature.english.dictionary.revision.diff.dict_word_delta_sum
feature.english.dictionary.revision.diff.dict_word_delta_increase
feature.english.dictionary.revision.diff.dict_word_delta_decrease
feature.english.dictionary.revision.diff.dict_word_prop_delta_sum
feature.english.dictionary.revision.diff.dict_word_prop_delta_increase
feature.english.dictionary.revision.diff.dict_word_prop_delta_decrease
feature.english.dictionary.revision.diff.non_dict_word_delta_sum
feature.english.dictionary.revision.diff.non_dict_word_delta_increase
feature.english.dictionary.revision.diff.non_dict_word_delta_decrease
feature.english.dictionary.revision.diff.non_dict_word_prop_delta_sum
feature.english.dictionary.revision.diff.non_dict_word_prop_delta_increase
feature.english.dictionary.revision.diff.non_dict_word_prop_delta_decrease
You can ask ORES to tell you details about each of the models such as test
statistics and modeling algorithm. E.g.
https://ores.wikimedia.or
g/v3/scores/enwiki/?models=wp10&model_info=type|params|version returns:
{
"enwiki": {
"models": {
"wp10": {
"type": "GradientBoosting",
"version": "0.5.0",
"params": {
"balanced_sample": true,
"balanced_sample_weight": false,
"center": true,
"init": null,
"learning_rate": 0.01,
"loss": "deviance",
"max_depth": 7,
"max_features": "log2",
"max_leaf_nodes": null,
"min_samples_leaf": 1,
"min_samples_split": 2,
"min_weight_fraction_leaf": 0.0,
"n_estimators": 700,
"presort": "auto",
"random_state": null,
"scale": true,
"subsample": 1.0,
"verbose": 0,
"warm_start": false
}
}
}
}
}
For information about how features are extracted, see
http://pythonhosted.org/revscoring. For the full process by which models
are built, see the makefile for the appropriate repository. E.g.
https://github.com/wiki-ai/wikiclass/blob/master/Makefile#L111
-Aaron
On Fri, Jun 9, 2017 at 10:50 AM, Pierce Edmiston <pedmiston(a)wisc.edu>
wrote:
Hello,
I'm wondering how to find out the details of edit and article quality
models, specifically the *reverted* and *damaging* edit quality models,
and the *wp10* article quality model. I'm wondering what algorithm is
being used and what features are being trained on.
I believe the *wp10* model may have originated with Warncke-Wang,
Cosley, & Riedl (2013) Tell me more: An actionable quality model for
Wikipedia, in which case I can figure out the model specification and
features from the paper. But I'm not sure if the details of the edit
quality models have been similarly summarized in any papers or in any
online documentation.
Thanks for your help!
Pierce
_______________________________________________
AI mailing list
AI(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai
_______________________________________________
AI mailing list
AI(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai