A good machine learning algorithm should try to distinguish categories (good vs bad) within large sub-categories (anon). That's supposed to be one of the advantages over a simple scoring formula—different elements can be weighted differently (even positively vs negatively) depending on the exact combination of features.
It may have been difficult for ORES to make the distinction because the signal within the anonymous sub-group was too noisy. At the end of the relevant section in the video (around 54m30s) Aaron mentions using a grammar to try to parse the edit to take more of the edit's content into account, which is where the real distinction is to be made among anon users.
(The other interesting takeaway from his presentation is that if you want to really vandalize Wikipedia successfully, make an account and then wait 8 years—then you'll be free to do anything!)