Regular expressions searching

List overview All Threads
Download

newer

older

Defining a configuration for...

Re: [Wikitech-l] [WikiEN-l]...

Amir E. Aharoni

6 Jul 2009 6 Jul '09

2:05 p.m.

(Originally asked at [[Wikipedia talk:Searching]] and [[WP:VPT]].)

Is there any existing way to search Wikipedia or MediaWiki in general using full-fledged regular expressions? Such as those found in Perl, PCRE, Python, JavaScript?

I started writing a Perl program that uses Parse::MediaWikiDump, goes over a dump and searches for regexes, but there are two problems with this:

1. Such a program probably already exists, although i don't know where. Can anyone point me to an existing tool? It can be in any other language, not necessarily Perl, but it should be portable - not Windows-only/Mac-only/Linux-only.

2. The info won't be up-to-date. Would it be too much to ask to search the database directly using regexes?

If problem number 2 is too hard to solve and nobody knows the answer to problem number one, then i guess that i'll publish my Perl dump searching program for the common good. (Why not Python? Because i know Perl better and Parse::MediaWikiDump works well enough for me.)

-- Amir Elisha Aharoni

http://aharoni.wordpress.com

"We're living in pieces, I want to live in peace." - T. Moore

Show replies by date

Andrew Garrett

6 Jul 6 Jul

2:43 p.m.

On 06/07/2009, at 12:05 PM, Amir E. Aharoni wrote:

...

The info won't be up-to-date. Would it be too much to ask to search

the database directly using regexes?

Yes.

We wouldn't allow direct searching from the web interface with regexes for two related reasons:

1/ A single search with the most basic of regexes would take several minutes, if not hours. It isn't computationally trivial to search for a small string in a very large string of over 10 GB, let alone a regex. Words can be indexed, regexes cannot.

2/ Even if we could find a way to make the former performant, a malicious regex could significantly expand this time taken, leading to a denial of service.

-- Andrew Garrett Contract Developer, Wikimedia Foundation agarrett@wikimedia.org http://werdn.us

Amir E. Aharoni

2:50 p.m.

On Mon, Jul 6, 2009 at 14:43, Andrew Garrett agarrett@wikimedia.org wrote:

...

On 06/07/2009, at 12:05 PM, Amir E. Aharoni wrote:

...

The info won't be up-to-date. Would it be too much to ask to search

the database directly using regexes?

Yes.

We wouldn't allow direct searching from the web interface with regexes for two related reasons:

1/ A single search with the most basic of regexes would take several minutes, if not hours. [...]

2/ Even if we could find a way to make the former performant, a malicious regex could significantly expand this time taken, leading to a denial of service.

I imagined that that would be the answer. Maybe it is possible to set up a service that searches some back-up database, which is almost up-to-date, or to have someone approve the regexes to prevent DOS.

Just throwing a bunch of ideas to the air.

Mostly i was interested to find out whether there is an existing script that searches dumps, instead of writing one myself.

-- אמיר אלישע אהרוני Amir Elisha Aharoni

http://aharoni.wordpress.com

"We're living in pieces, I want to live in peace." - T. Moore

Micke Nordin

9:37 p.m.

Amir E. Aharoni <amir.aharoni <at> gmail.com> writes:

...

Mostly i was interested to find out whether there is an existing script that searches dumps, instead of writing one myself.

I wrote a c++-program that can search through a databasedump (with regexes), it's one of the first apps I wrote, so the source is not pretty to read. You can find it here:

http://meta.wikimedia.org/wiki/User:Micke/WikiFind

I would not implement it that way (and I would not write it in c++ :), if I was to write it to day, but if you are helped by it please go ahead and use it.

Use the second version of the program if you want to pipe the dump in from bzcat.

/Micke

Marcus Buck

8 Jul 8 Jul

3:33 p.m.

Micke Nordin hett schreven:

...

Amir E. Aharoni <amir.aharoni <at> gmail.com> writes:

...
Mostly i was interested to find out whether there is an existing script that searches dumps, instead of writing one myself.

I wrote a c++-program that can search through a databasedump (with regexes), it's one of the first apps I wrote, so the source is not pretty to read. You can find it here:

http://meta.wikimedia.org/wiki/User:Micke/WikiFind

I would not implement it that way (and I would not write it in c++ :), if I was to write it to day, but if you are helped by it please go ahead and use it.

Use the second version of the program if you want to pipe the dump in from bzcat.

/Micke

I tried to install it, but I couldn't manage. If somebody has managed to compile a Vista executable I would appreciate, if he could send me the file.

Thanks Marcus Buck User:Slomox

Aryeh Gregor

6 Jul 6 Jul

9:37 p.m.

On Mon, Jul 6, 2009 at 7:43 AM, Andrew Garrettagarrett@wikimedia.org wrote:

...

Yes.

We wouldn't allow direct searching from the web interface with regexes for two related reasons:

1/ A single search with the most basic of regexes would take several minutes, if not hours. It isn't computationally trivial to search for a small string in a very large string of over 10 GB, let alone a regex. Words can be indexed, regexes cannot.

2/ Even if we could find a way to make the former performant, a malicious regex could significantly expand this time taken, leading to a denial of service.

I seem to recall Gregory Maxwell describing a setup that made this feasible, given the appropriate amount of dedicated hardware. It was run with the entire database in memory; it only permitted "real" regular expressions (compilable to finite-state machines, no backreferences etc.); and it limited the length of the finite-state machine generated. Running a regex took several minutes, but he'd run a lot of them in parallel, since it was mostly memory-bound, so he got fairly good throughout. Something like that.

But probably not practical without an undue amount of effort and hardware, yeah. :)

Gregory Maxwell

10:05 p.m.

On Mon, Jul 6, 2009 at 2:37 PM, Aryeh GregorSimetrical+wikilist@gmail.com wrote:

...

On Mon, Jul 6, 2009 at 7:43 AM, Andrew Garrettagarrett@wikimedia.org wrote:

...
Yes.

We wouldn't allow direct searching from the web interface with regexes for two related reasons:

1/ A single search with the most basic of regexes would take several minutes, if not hours. It isn't computationally trivial to search for a small string in a very large string of over 10 GB, let alone a regex. Words can be indexed, regexes cannot.

2/ Even if we could find a way to make the former performant, a malicious regex could significantly expand this time taken, leading to a denial of service.

I seem to recall Gregory Maxwell describing a setup that made this feasible, given the appropriate amount of dedicated hardware. It was run with the entire database in memory; it only permitted "real" regular expressions (compilable to finite-state machines, no backreferences etc.); and it limited the length of the finite-state machine generated. Running a regex took several minutes, but he'd run a lot of them in parallel, since it was mostly memory-bound, so he got fairly good throughout. Something like that.

But probably not practical without an undue amount of effort and hardware, yeah. :)

Yes, I didn't comment on the initial comment because "full PCRE" is simply far too much to ask for.

Basic regexps of the sort that can be complied into a deterministic finite state machine (i.e. no backtracking) can be merged together into a single larger state machine.

So long as the state machine fits in cache, the entire DB can be scanned in not much more time than it takes to read it in from memory, even if there are hundreds of parallel regexpes.

So you batch up user requests then run them in parallel groups. Good throughput, poor latency.

Insufficiently selective queries are problematic. I never came up with a really good solution to people feeding in patterns like '.' and stalling the whole process by wasting a lot of memory bandwidth updating the result set. (an obvious solution might just be to limit the number of results)

The latency can be reduced by partitioning the database across multiple machines (more aggregate memory bandwidth). By doing this you could achieve arbitrarily low latency and enormous throughput. Dunno if it's actually worthwhile, however.

Steve Bennett

7 Jul 7 Jul

4:49 a.m.

On Mon, Jul 6, 2009 at 9:05 PM, Amir E. Aharoniamir.aharoni@gmail.com wrote:

...

The info won't be up-to-date. Would it be too much to ask to search

the database directly using regexes?

What's your use case? Obviously all the points below are valid and rule out directly regex searching on the entire Wikipedia database, for instance, but I wonder if you could have hybrid cases like "return pages that contain X and regex Y". Since X can be indexed, you're immediately working on a (much) smaller subset.

(Then again, AWB already supports this I think)

Not sure what cases you need full regex for though. Obviously regexes like "foo(.*)bar" have wide applicability...but the more esoteric forms?

Steve

Amir E. Aharoni

9:10 a.m.

On Tue, Jul 7, 2009 at 04:49, Steve Bennettstevagewp@gmail.com wrote:

...

On Mon, Jul 6, 2009 at 9:05 PM, Amir E. Aharoniamir.aharoni@gmail.com wrote:

...

The info won't be up-to-date. Would it be too much to ask to search

the database directly using regexes?

What's your use case? Obviously all the points below are valid and rule out directly regex searching on the entire Wikipedia database, for instance, but I wonder if you could have hybrid cases like "return pages that contain X and regex Y". Since X can be indexed, you're immediately working on a (much) smaller subset.

It is not really that important for me to search the live Wikipedia.

Currently i mostly want to satisfy my linguistic curiosity and find out statistics about usage of different spellings in the Hebrew Wikipedia (modern [[Hebrew spelling]] is wildly inconsistent). The regular search engine, which is mostly tailored for English, is almost useless for this task. But searching a dump would be enough.

(AWB is ruled out, because i frequently need to run it on GNU/Linux.)

-- אמיר אלישע אהרוני Amir Elisha Aharoni http://aharoni.wordpress.com "We're living in pieces, I want to live in peace." - T. Moore

5657

Age (days ago)

5659

Last active (days ago)

wikitech-l@lists.wikimedia.org

8 comments

7 participants

tags (0)

participants (7)

Amir E. Aharoni
Andrew Garrett
Aryeh Gregor
Gregory Maxwell
Marcus Buck
Micke Nordin
Steve Bennett