Hey everyone,
As part of T195491 https://phabricator.wikimedia.org/T195491, Erik has been looking into the details of our regex processing and ways to handle ridiculously long-running regex queries. He pulled all the regex queries over the last 90 days to get a sense of what features people are using and what impact certain changes he was considering would have on users. Turns out there are a lot more users than I would have thought—which is good news! And a lot of them look like bots.
He also made the mistake of pointing me to the data and highlighting a common pattern—searches for interwiki links. I couldn't help myself—I started digging around found that the majority of the searches are looking for those interwiki links, and the vast majority of regex searches fall into three types—interwiki links, URLs, and Library of Congress collection IDs.
Overall, there are 5,613,506 regexes total across all projects and all languages, over a 90-day period. That comes out to ~62K/day—which is a lot more than I'd expected, though I hadn't thought about bots using regexes.
Read more on MediaWiki https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Regular_Expression_Searches .
—Trey
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation
Thanks for sharing. This gives nice analysis from data to insights - how do we drive actions from this report? Do we plan to use this data to make better tools? For example have a common pitfalls and how to avoid them: searching for library of congress links with regex search instead of external links query ( https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bextlinks ) (and similar for iwlinks for interwiki links) This can be even actively pushed to tools (either using User-Agent to contact the tool devs, or using warnings in the API result)
On Wed, May 30, 2018 at 11:51 PM, Trey Jones tjones@wikimedia.org wrote:
Hey everyone,
As part of T195491 https://phabricator.wikimedia.org/T195491, Erik has been looking into the details of our regex processing and ways to handle ridiculously long-running regex queries. He pulled all the regex queries over the last 90 days to get a sense of what features people are using and what impact certain changes he was considering would have on users. Turns out there are a lot more users than I would have thought—which is good news! And a lot of them look like bots.
He also made the mistake of pointing me to the data and highlighting a common pattern—searches for interwiki links. I couldn't help myself—I started digging around found that the majority of the searches are looking for those interwiki links, and the vast majority of regex searches fall into three types—interwiki links, URLs, and Library of Congress collection IDs.
Overall, there are 5,613,506 regexes total across all projects and all languages, over a 90-day period. That comes out to ~62K/day—which is a lot more than I'd expected, though I hadn't thought about bots using regexes.
Read more on MediaWiki https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Regular_Expression_Searches .
—Trey
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Sorry for the late reply, I've been out for a few days.
Right now there's no real impetus to create new action items from this report. The purpose of gathering the data wasn't to generate this specific report; that was just a nice incidental benefit. Obviously if in the future we run into problems supporting the volume of regex queries that we are getting, we now know that there's a good chance that we could have a major impact on that volume by tracking down the source of these three regex patterns and finding out what they are trying to do and help them to do it more efficiently (or throttle them if necessary, though hopefully it wouldn't come to that).
We seem to be doing okay at the moment because all these biggest users are using queries that are reasonably efficient because they contain either non-regex parts (like "LOC" or the full URL) or have some decent stretch of literal text in them (like "[[en:") so we can use "trigram acceleration"—that is, search first finds documents with character trigram literals from the regex—and then performs the regex search over non-regex/trigram results. This can cut down the pool of potential document from millions to a few thousand, which is much more manageable.
—Trey
On Fri, Jun 1, 2018 at 5:22 AM, Eran Rosenthal eranroz89@gmail.com wrote:
Thanks for sharing. This gives nice analysis from data to insights - how do we drive actions from this report? Do we plan to use this data to make better tools? For example have a common pitfalls and how to avoid them: searching for library of congress links with regex search instead of external links query ( https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bextlinks ) (and similar for iwlinks for interwiki links) This can be even actively pushed to tools (either using User-Agent to contact the tool devs, or using warnings in the API result)
On Wed, May 30, 2018 at 11:51 PM, Trey Jones tjones@wikimedia.org wrote:
Hey everyone,
As part of T195491 https://phabricator.wikimedia.org/T195491, Erik has been looking into the details of our regex processing and ways to handle ridiculously long-running regex queries. He pulled all the regex queries over the last 90 days to get a sense of what features people are using and what impact certain changes he was considering would have on users. Turns out there are a lot more users than I would have thought—which is good news! And a lot of them look like bots.
He also made the mistake of pointing me to the data and highlighting a common pattern—searches for interwiki links. I couldn't help myself—I started digging around found that the majority of the searches are looking for those interwiki links, and the vast majority of regex searches fall into three types—interwiki links, URLs, and Library of Congress collection IDs.
Overall, there are 5,613,506 regexes total across all projects and all languages, over a 90-day period. That comes out to ~62K/day—which is a lot more than I'd expected, though I hadn't thought about bots using regexes.
Read more on MediaWiki https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Regular_Expression_Searches .
—Trey
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery