Hey folks,

We've been working on generating some updated models for ORES.  These models will behave slightly differently from the models that we currently have deployed.  This is a natural artifact of retraining the models on the *exact same data* again because of some random properties of the learning algorithms.  So, for the most part, this should be a non-issue for any tools that use ORES.  However, I wanted to take this opportunity to highlight some of the facilities ORES provides to help automatically detect and adjust for these types of changes.  

== Versions ==
ORES provides information about all of the models.  This information includes a model version number.  If you are caching ORES scores locally, we recommend invalidating old scores whenever this model number changes.  For example, https://ores.wikimedia.org/v2/scores/enwiki/damaging/12345678 currently returns 

{
  "scores": {
    "enwiki": {
      "damaging": {
        "scores": {
          "12345678": {
            "prediction": false,
            "probability": {
              "false": 0.7141333465390294,
              "true": 0.28586665346097057
            }
          }
        },
        "version": "0.1.1"
      }
    }
  }
}

This score was generated with the "0.1.1" version of the model.  But once we deploy the new models, the same request will return: 
{
  "scores": {
    "enwiki": {
      "damaging": {
        "scores": {
          "12345678": {
            "prediction": false,
            "probability": {
              "false": 0.8204647324045306,
              "true": 0.17953526759546945
            }
          }
        },
        "version": "0.1.2"
      }
    }
  }
}

Note that the version number changes to "0.1.2" and the probabilities change slightly.  In this case, we're essentially re-training the same model in a similar way, so we increment the "patch" number.

However, we're switching modeling strategies for the article quality models (enwiki-wp10, frwiki-wp10 & ruwiki-wp10), so those versions increment the minor version from "0.3.2" to "0.4.0".  You may see more substantial changes in prediction probabilities with those models, but a quick spot-checking suggests that the changes are not substantial.

== Test statistics and threshholding ==
So, many tools that use our edit quality models (reverted, damaging and goodfaith) will set threshholds for flagging edits for review.  In order to support these tools, we produce test statistics that suggest useful thresholds.  

https://ores.wmflabs.org/v2/scores/enwiki/damaging/?model_info=test_stats produces:

      ...
            "filter_rate_at_recall(min_recall=0.75)": {
              "filter_rate": 0.869,
              "recall": 0.752,
              "threshold": 0.492
            },
            "filter_rate_at_recall(min_recall=0.9)": {
              "filter_rate": 0.753,
              "recall": 0.902,
              "threshold": 0.173
            },
      ...

These two statistics show useful thresholds for detecting damaging edits.  E.g. if you want to be sure that you catch nearly all vandalism (and are OK with a higher false-positive rate), set the threshold at 0.173, but if you'd like to catch most vandalism with almost no false-positives, set the threshold at 0.492.  These fields can be read automatically by tools so that they do not need to be manually updated every time that we deploy a new model.  

Let me know if you have any questions and happy hacking!

-Aaron