On Thu, Sep 11, 2008 at 8:57 PM, Aryeh Gregor
<Simetrical+wikilist(a)gmail.com> wrote:
That's one of the more serious issues. I recall
David was having
issues in deciding on the schema, since he wanted efficient versioned
trees. I don't know what he ended up settling on, but I would imagine
it would have some issues when scaling up to the massive levels needed
by Wikipedia. Few schemas don't. There are lots of fiddly
MediaWiki-specific details that you need to understand properly, like
how external storage works.
Taking a quick look at the schema, there are some fairly clear issues
already. These are the indexes on the thread table:
PRIMARY KEY thread_id (thread_id),
UNIQUE INDEX thread_id (thread_id),
INDEX thread_ancestor (thread_ancestor),
INDEX thread_article_title (thread_article_namespace, thread_article_title),
INDEX thread_modified (thread_modified),
INDEX thread_created (thread_created),
INDEX thread_summary_page (thread_summary_page)
Well, first of all the duplicate key on thread_id doesn't make any
sense at all. Now, I'm not sure how the thread model works exactly,
but I find it surprising that thread_modified isn't on the end of any
indexes, for sorting. I tried to find a sample query:
$where = array(Threads::articleClause($this->article),
'thread.thread_parent is null',
'(thread.thread_summary_page is not null' .
' OR thread.thread_type = '.Threads::TYPE_MOVED.')',
'thread.thread_modified < ' . $startdate->text());
$options = array('ORDER BY thread.thread_modified DESC');
That's only partial, but it gives me a pretty good idea that you would
really want an index on (thread_article_namespace,
thread_article_title, thread_modified) here, at the very least. You
might also need thread_parent or thread_summary_page in there,
depending on the expected distribution of values there.
Poking around a bit more, I find the method
Threads::monthsWhereArticleHasThreads(), which a comment accurately
describes as "Horrible, horrible!" It retrieves *all* threads for the
entire page into an array, iterates through them, returns a PHP array
of arbitrarily many Thread objects -- the Thread class having 25
member variables which are all loaded, AFAICT including the text of
the thread -- and then iterates through that array *again*, collecting
a list of distinct months.
I spot one place where this method is used, and there it apparently
takes the list of months (assuming we haven't OOMed by now due to this
being, e.g., ANI) to . . . well, set some member variables, I can't
figure out exactly what without a fair amount of further digging. But
it suffices to say that I think we can consider LQT's performance to
be a work in progress. :)