On Mon, Jul 6, 2009 at 2:37 PM, Aryeh GregorSimetrical+wikilist@gmail.com wrote:
On Mon, Jul 6, 2009 at 7:43 AM, Andrew Garrettagarrett@wikimedia.org wrote:
Yes.
We wouldn't allow direct searching from the web interface with regexes for two related reasons:
1/ A single search with the most basic of regexes would take several minutes, if not hours. It isn't computationally trivial to search for a small string in a very large string of over 10 GB, let alone a regex. Words can be indexed, regexes cannot.
2/ Even if we could find a way to make the former performant, a malicious regex could significantly expand this time taken, leading to a denial of service.
I seem to recall Gregory Maxwell describing a setup that made this feasible, given the appropriate amount of dedicated hardware. It was run with the entire database in memory; it only permitted "real" regular expressions (compilable to finite-state machines, no backreferences etc.); and it limited the length of the finite-state machine generated. Running a regex took several minutes, but he'd run a lot of them in parallel, since it was mostly memory-bound, so he got fairly good throughout. Something like that.
But probably not practical without an undue amount of effort and hardware, yeah. :)
Yes, I didn't comment on the initial comment because "full PCRE" is simply far too much to ask for.
Basic regexps of the sort that can be complied into a deterministic finite state machine (i.e. no backtracking) can be merged together into a single larger state machine.
So long as the state machine fits in cache, the entire DB can be scanned in not much more time than it takes to read it in from memory, even if there are hundreds of parallel regexpes.
So you batch up user requests then run them in parallel groups. Good throughput, poor latency.
Insufficiently selective queries are problematic. I never came up with a really good solution to people feeding in patterns like '.' and stalling the whole process by wasting a lot of memory bandwidth updating the result set. (an obvious solution might just be to limit the number of results)
The latency can be reduced by partitioning the database across multiple machines (more aggregate memory bandwidth). By doing this you could achieve arbitrarily low latency and enormous throughput. Dunno if it's actually worthwhile, however.