One small note: This will cause the ORES review tool to invalidate it's db
cache. So we will probably need to run some maintenance scripts here and
there. You might feel a few bumps in the tool in Wikipedia. We will let you
know beforehand :)
Best
On Sat, Aug 20, 2016 at 3:10 AM Aaron Halfaker <aaron.halfaker(a)gmail.com>
wrote:
Hey folks,
We've been working on generating some updated models for ORES. These
models will behave slightly differently from the models that we currently
have deployed. This is a natural artifact of retraining the models on the
*exact same data* again because of some random properties of the learning
algorithms. So, for the most part, this should be a non-issue for any
tools that use ORES. However, I wanted to take this opportunity to
highlight some of the facilities ORES provides to help automatically detect
and adjust for these types of changes.
*== Versions ==*
ORES provides information about all of the models. This information
includes a model version number. If you are caching ORES scores locally,
we recommend invalidating old scores whenever this model number changes.
For example,
https://ores.wikimedia.org/v2/scores/enwiki/damaging/12345678
currently returns
{
"scores": {
"enwiki": {
"damaging": {
"scores": {
"12345678": {
"prediction": false,
"probability": {
"false": 0.7141333465390294,
"true": 0.28586665346097057
}
}
},
"version": "0.1.1"
}
}
}
}
This score was generated with the "0.1.1" version of the model. But once
we deploy the new models, the same request will return:
{
"scores": {
"enwiki": {
"damaging": {
"scores": {
"12345678": {
"prediction": false,
"probability": {
"false": 0.8204647324045306,
"true": 0.17953526759546945
}
}
},
"version": "0.1.2"
}
}
}
}
Note that the version number changes to "0.1.2" and the probabilities
change slightly. In this case, we're essentially re-training the same
model in a similar way, so we increment the "patch" number.
However, we're switching modeling strategies for the article quality
models (enwiki-wp10, frwiki-wp10 & ruwiki-wp10), so those versions
increment the minor version from "0.3.2" to "0.4.0". You may see
more
substantial changes in prediction probabilities with those models, but a
quick spot-checking suggests that the changes are not substantial.
*== Test statistics and threshholding ==*
So, many tools that use our edit quality models (reverted, damaging and
goodfaith) will set threshholds for flagging edits for review. In order to
support these tools, we produce test statistics that suggest useful
thresholds.
https://ores.wmflabs.org/v2/scores/enwiki/damaging/?model_info=test_stats
produces:
...
"filter_rate_at_recall(min_recall=0.75)": {
"filter_rate": 0.869,
"recall": 0.752,
"threshold": 0.492
},
"filter_rate_at_recall(min_recall=0.9)": {
"filter_rate": 0.753,
"recall": 0.902,
"threshold": 0.173
},
...
These two statistics show useful thresholds for detecting damaging edits.
E.g. if you want to be sure that you catch nearly all vandalism (and are OK
with a higher false-positive rate), set the threshold at 0.173, but if
you'd like to catch most vandalism with almost no false-positives, set the
threshold at 0.492. These fields can be read automatically by tools so
that they do not need to be manually updated every time that we deploy a
new model.
Let me know if you have any questions and happy hacking!
-Aaron
_______________________________________________
AI mailing list
AI(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ai