I've got a tool which parses sockpuppet investigation (SPI) pages and does some analysis. One of the steps is I need to validate that all of the usernames found in the SPI report are valid. I do that by sequentially calling usercontribs on each name with uclimit=1 and seeing if I get a baduser error.
This works, but it's slow because I need to make 1 API call for each user. For a big SPI case, the time to do this swamps everything else. Is there a more efficient way to do this? Some API call where I can give it a bunch of usernames in a batch and have it tell me which ones are invalid? Alternatively, is there a regex I could apply on the client side to test if a username is valid?
The most common type of invalid name I see is when somebody puts down an iprange (i.e. 1.2.4.0/24) as a username. Testing for that client-side would be trivial, but it might miss some others.
On Thu, Aug 19, 2021 at 4:04 PM Roy Smith roy@panix.com wrote:
I've got a tool which parses sockpuppet investigation (SPI) pages and does some analysis. One of the steps is I need to validate that all of the usernames found in the SPI report are valid. I do that by sequentially calling usercontribs on each name with uclimit=1 and seeing if I get a baduser error.
This works, but it's slow because I need to make 1 API call for each user. For a big SPI case, the time to do this swamps everything else. Is there a more efficient way to do this? Some API call where I can give it a bunch of usernames in a batch and have it tell me which ones are invalid? Alternatively, is there a regex I could apply on the client side to test if a username is valid?
The most common type of invalid name I see is when somebody puts down an iprange (i.e. 1.2.4.0/24) as a username. Testing for that client-side would be trivial, but it might miss some others.
You can do lookups in batches of 50 (500 if you have the "apihighlimits" right which is commonly granted by the "Bots" group on movement wikis) with https://en.wikipedia.org/w/api.php?action=help&modules=query%2Busers.
Here's a quick example: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=users&format=json&utf8=1&formatversion=2&ususers=Bryan%20Davis%7CBryanDavis%7CBDavis%20(WMF)%7Cbd808
The results will look something like: ``` { "batchcomplete": true, "query": { "users": [ { "name": "Bryan Davis", "missing": true }, { "userid": 2619078, "name": "BryanDavis" }, { "userid": 19474624, "name": "BDavis (WMF)" }, { "userid": 24257381, "name": "Bd808" } ] } } ```
Bryan
Ah, cool. That's exactly what I was looking for, thanks.
On Aug 19, 2021, at 6:21 PM, Bryan Davis bd808@wikimedia.org wrote:
On Thu, Aug 19, 2021 at 4:04 PM Roy Smith roy@panix.com wrote:
I've got a tool which parses sockpuppet investigation (SPI) pages and does some analysis. One of the steps is I need to validate that all of the usernames found in the SPI report are valid. I do that by sequentially calling usercontribs on each name with uclimit=1 and seeing if I get a baduser error.
This works, but it's slow because I need to make 1 API call for each user. For a big SPI case, the time to do this swamps everything else. Is there a more efficient way to do this? Some API call where I can give it a bunch of usernames in a batch and have it tell me which ones are invalid? Alternatively, is there a regex I could apply on the client side to test if a username is valid?
The most common type of invalid name I see is when somebody puts down an iprange (i.e. 1.2.4.0/24) as a username. Testing for that client-side would be trivial, but it might miss some others.
You can do lookups in batches of 50 (500 if you have the "apihighlimits" right which is commonly granted by the "Bots" group on movement wikis) with https://en.wikipedia.org/w/api.php?action=help&modules=query%2Busers.
Here's a quick example: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=users&format=json&utf8=1&formatversion=2&ususers=Bryan%20Davis%7CBryanDavis%7CBDavis%20(WMF)%7Cbd808
The results will look something like:
{ "batchcomplete": true, "query": { "users": [ { "name": "Bryan Davis", "missing": true }, { "userid": 2619078, "name": "BryanDavis" }, { "userid": 19474624, "name": "BDavis (WMF)" }, { "userid": 24257381, "name": "Bd808" } ] } }Bryan
Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
I turns out, this is a little more complicated than it appeared at first; usercontribs and list users have different concepts of "invalid". If you ask for usercontribs on "1.2.3.4", it's valid. If you pass in "1.2.3.0/24", you get baduser.. But list users returns:
{ "batchcomplete": "", "query": { "users": [ { "name": "1.2.3.4", "invalid": "" } ] } }
which I guess makes sense in that context since it can't map it to a userid. I can work around this, but mentioning it for the sake of some poor developer searching the archives N years from now trying to figure it out :-)
On Aug 19, 2021, at 6:21 PM, Bryan Davis bd808@wikimedia.org wrote:
On Thu, Aug 19, 2021 at 4:04 PM Roy Smith roy@panix.com wrote:
I've got a tool which parses sockpuppet investigation (SPI) pages and does some analysis. One of the steps is I need to validate that all of the usernames found in the SPI report are valid. I do that by sequentially calling usercontribs on each name with uclimit=1 and seeing if I get a baduser error.
This works, but it's slow because I need to make 1 API call for each user. For a big SPI case, the time to do this swamps everything else. Is there a more efficient way to do this? Some API call where I can give it a bunch of usernames in a batch and have it tell me which ones are invalid? Alternatively, is there a regex I could apply on the client side to test if a username is valid?
The most common type of invalid name I see is when somebody puts down an iprange (i.e. 1.2.4.0/24) as a username. Testing for that client-side would be trivial, but it might miss some others.
You can do lookups in batches of 50 (500 if you have the "apihighlimits" right which is commonly granted by the "Bots" group on movement wikis) with https://en.wikipedia.org/w/api.php?action=help&modules=query%2Busers.
Here's a quick example: https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=users&format=json&utf8=1&formatversion=2&ususers=Bryan%20Davis%7CBryanDavis%7CBDavis%20(WMF)%7Cbd808
The results will look something like:
{ "batchcomplete": true, "query": { "users": [ { "name": "Bryan Davis", "missing": true }, { "userid": 2619078, "name": "BryanDavis" }, { "userid": 19474624, "name": "BDavis (WMF)" }, { "userid": 24257381, "name": "Bd808" } ] } }Bryan
Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Sigh. It's even more complicated than that. It looks like the "name" entry doesn't always match the name you passed in the API call, but is subject to case mapping, trailing whitespace stripping, and maybe a few other things?
$ curl -s 'https://en.wikipedia.org/w/api.php?action=query&format=json&list=use...' | json_pp { "query" : { "users" : [ { "name" : "RoySmith", "gender" : "unknown", "groups" : [ "sysop", "*", "user", "autoconfirmed" ], "userid" : 130326, "editcount" : 58645 }, { "name" : "Roysmith", "missing" : "" } ] }, "batchcomplete" : "" }
I'm assuming the entries in the returned "users" list are guaranteed to be in the same order as the input parameters? I can't find anyplace that says this, but it seems logical. Can somebody confirm that it's true?
On Sep 4, 2021, at 6:46 PM, Roy Smith roy@panix.com wrote:
I turns out, this is a little more complicated than it appeared at first; usercontribs and list users have different concepts of "invalid". If you ask for usercontribs on "1.2.3.4", it's valid. If you pass in "1.2.3.0/24", you get baduser.. But list users returns:
{ "batchcomplete": "", "query": { "users": [ { "name": "1.2.3.4", "invalid": "" } ] } }
which I guess makes sense in that context since it can't map it to a userid. I can work around this, but mentioning it for the sake of some poor developer searching the archives N years from now trying to figure it out :-)
On Aug 19, 2021, at 6:21 PM, Bryan Davis <bd808@wikimedia.org mailto:bd808@wikimedia.org> wrote:
On Thu, Aug 19, 2021 at 4:04 PM Roy Smith <roy@panix.com mailto:roy@panix.com> wrote:
I've got a tool which parses sockpuppet investigation (SPI) pages and does some analysis. One of the steps is I need to validate that all of the usernames found in the SPI report are valid. I do that by sequentially calling usercontribs on each name with uclimit=1 and seeing if I get a baduser error.
This works, but it's slow because I need to make 1 API call for each user. For a big SPI case, the time to do this swamps everything else. Is there a more efficient way to do this? Some API call where I can give it a bunch of usernames in a batch and have it tell me which ones are invalid? Alternatively, is there a regex I could apply on the client side to test if a username is valid?
The most common type of invalid name I see is when somebody puts down an iprange (i.e. 1.2.4.0/24) as a username. Testing for that client-side would be trivial, but it might miss some others.
You can do lookups in batches of 50 (500 if you have the "apihighlimits" right which is commonly granted by the "Bots" group on movement wikis) with <https://en.wikipedia.org/w/api.php?action=help&modules=query%2Busers https://en.wikipedia.org/w/api.php?action=help&modules=query%2Busers>.
Here's a quick example: <https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=users... https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=users&format=json&utf8=1&formatversion=2&ususers=Bryan%20Davis%7CBryanDavis%7CBDavis%20(WMF)%7Cbd808>
The results will look something like:
{ "batchcomplete": true, "query": { "users": [ { "name": "Bryan Davis", "missing": true }, { "userid": 2619078, "name": "BryanDavis" }, { "userid": 19474624, "name": "BDavis (WMF)" }, { "userid": 24257381, "name": "Bd808" } ] } }Bryan
Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Cloud mailing list -- cloud@lists.wikimedia.org mailto:cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/ https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Ugh. That's not even true. It looks like all the invalid entries are emitted first, then the valid ones. And duplicates are deduplicated.
So, we're down to you give it a bunch of names, and it gives you back a a bunch of data which may not have the same number of entries as your input list, the entries aren't guaranteed to be in the same order as the input (despite the fact that the python mwclient goes out of its way to present it as an OrderedDict), and the output keys aren't guaranteed to match the input keys.
On Sep 5, 2021, at 3:18 PM, Roy Smith roy@panix.com wrote:
I'm assuming the entries in the returned "users" list are guaranteed to be in the same order as the input parameters? I can't find anyplace that says this, but it seems logical. Can somebody confirm that it's true?
On Sun, Sep 5, 2021 at 1:18 PM Roy Smith roy@panix.com wrote:
Sigh. It's even more complicated than that. It looks like the "name" entry doesn't always match the name you passed in the API call, but is subject to case mapping, trailing whitespace stripping, and maybe a few other things?
MediaWiki normalizes usernames using non-trivial rules [0]. The level of abstraction in this code will have you chase through a number of classes to figure out all of the rules. The "simple" version of canonicalizing a username is something like:
* Replace all whitespace characters with underscores (`_`) * Reduce any runs of multiple underscores to a single underscore * Trim any leading or trailing underscores from the string * Capitalize the string
The real rules are a bit more complicated than this [1] and include rejecting names containing certain special characters or runs of characters.
I'm assuming the entries in the returned "users" list are guaranteed to be in the same order as the input parameters? I can't find anyplace that says this, but it seems logical. Can somebody confirm that it's true?
I see you've already figured this out from your follow up message, but for the sake of future readers, no. Each username provided to the query is normalized before querying the database and any invalid usernames are output first [2].
[0]: https://github.com/wikimedia/mediawiki/blob/02f7392231ef40a0f928fbd5ec791eff... [1]: https://github.com/wikimedia/mediawiki/blob/02f7392231ef40a0f928fbd5ec791eff... [2]: https://github.com/wikimedia/mediawiki/blob/02f7392231ef40a0f928fbd5ec791eff...
Bryan
That can't be right. I think you meant, "Reduce any runs of multiple underscores to a single SPACE" and then "Trim any leading or trailing spaces"
On Sep 6, 2021, at 12:15 AM, Bryan Davis bd808@wikimedia.org wrote:
- Replace all whitespace characters with underscores (`_`)
- Reduce any runs of multiple underscores to a single underscore
- Trim any leading or trailing underscores from the string
- Capitalize the string
Yes, the canonical form of usernames is with spaces, but the canonical form of page titles is with underscores.
No this hasn't ever confused anyone or caused me any problems, why do you ask?
ACN
On Thu, Sep 9, 2021 at 10:56 AM Roy Smith roy@panix.com wrote:
That can't be right. I think you meant, "Reduce any runs of multiple underscores to a single SPACE" and then "Trim any leading or trailing spaces"
On Sep 6, 2021, at 12:15 AM, Bryan Davis bd808@wikimedia.org wrote:
- Replace all whitespace characters with underscores (`_`)
- Reduce any runs of multiple underscores to a single underscore
- Trim any leading or trailing underscores from the string
- Capitalize the string
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
See the first message in this thread https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/message/274WJ2XCHWZ6544MITG57EATDQHSRS5I/.
On Sep 9, 2021, at 10:57 AM, AntiCompositeNumber anticompositenumber@gmail.com wrote:
Yes, the canonical form of usernames is with spaces, but the canonical form of page titles is with underscores.
No this hasn't ever confused anyone or caused me any problems, why do you ask?
ACN
On Thu, Sep 9, 2021 at 10:56 AM Roy Smith roy@panix.com wrote:
That can't be right. I think you meant, "Reduce any runs of multiple underscores to a single SPACE" and then "Trim any leading or trailing spaces"
On Sep 6, 2021, at 12:15 AM, Bryan Davis bd808@wikimedia.org wrote:mes
- Replace all whitespace characters with underscores (`_`)
- Reduce any runs of multiple underscores to a single underscore
- Trim any leading or trailing underscores from the string
- Capitalize the string
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/
Cloud mailing list -- cloud@lists.wikimedia.org List information: https://lists.wikimedia.org/postorius/lists/cloud.lists.wikimedia.org/