On Tue, Jul 7, 2009 at 04:49, Steve Bennettstevagewp@gmail.com wrote:
On Mon, Jul 6, 2009 at 9:05 PM, Amir E. Aharoniamir.aharoni@gmail.com wrote:
- The info won't be up-to-date. Would it be too much to ask to search
the database directly using regexes?
What's your use case? Obviously all the points below are valid and rule out directly regex searching on the entire Wikipedia database, for instance, but I wonder if you could have hybrid cases like "return pages that contain X and regex Y". Since X can be indexed, you're immediately working on a (much) smaller subset.
It is not really that important for me to search the live Wikipedia.
Currently i mostly want to satisfy my linguistic curiosity and find out statistics about usage of different spellings in the Hebrew Wikipedia (modern [[Hebrew spelling]] is wildly inconsistent). The regular search engine, which is mostly tailored for English, is almost useless for this task. But searching a dump would be enough.
(AWB is ruled out, because i frequently need to run it on GNU/Linux.)