On Mon, Jul 6, 2009 at 2:37 PM, Aryeh
Gregor<Simetrical+wikilist(a)gmail.com> wrote:
On Mon, Jul 6, 2009 at 7:43 AM, Andrew
Garrett<agarrett(a)wikimedia.org> wrote:
Yes.
We wouldn't allow direct searching from the web interface with regexes
for two related reasons:
1/ A single search with the most basic of regexes would take several
minutes, if not hours. It isn't computationally trivial to search for
a small string in a very large string of over 10 GB, let alone a
regex. Words can be indexed, regexes cannot.
2/ Even if we could find a way to make the former performant, a
malicious regex could significantly expand this time taken, leading to
a denial of service.
I seem to recall Gregory Maxwell describing a setup that made this
feasible, given the appropriate amount of dedicated hardware. It was
run with the entire database in memory; it only permitted "real"
regular expressions (compilable to finite-state machines, no
backreferences etc.); and it limited the length of the finite-state
machine generated. Running a regex took several minutes, but he'd run
a lot of them in parallel, since it was mostly memory-bound, so he got
fairly good throughout. Something like that.
But probably not practical without an undue amount of effort and
hardware, yeah. :)
Yes, I didn't comment on the initial comment because "full PCRE" is
simply far too much to ask for.
Basic regexps of the sort that can be complied into a deterministic
finite state machine (i.e. no backtracking) can be merged together
into a single larger state machine.
So long as the state machine fits in cache, the entire DB can be
scanned in not much more time than it takes to read it in from memory,
even if there are hundreds of parallel regexpes.
So you batch up user requests then run them in parallel groups. Good
throughput, poor latency.
Insufficiently selective queries are problematic. I never came up with
a really good solution to people feeding in patterns like '.' and
stalling the whole process by wasting a lot of memory bandwidth
updating the result set. (an obvious solution might just be to limit
the number of results)
The latency can be reduced by partitioning the database across
multiple machines (more aggregate memory bandwidth). By doing this
you could achieve arbitrarily low latency and enormous throughput.
Dunno if it's actually worthwhile, however.