On Thu, Sep 11, 2008 at 8:57 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
That's one of the more serious issues. I recall David was having issues in deciding on the schema, since he wanted efficient versioned trees. I don't know what he ended up settling on, but I would imagine it would have some issues when scaling up to the massive levels needed by Wikipedia. Few schemas don't. There are lots of fiddly MediaWiki-specific details that you need to understand properly, like how external storage works.
Taking a quick look at the schema, there are some fairly clear issues already. These are the indexes on the thread table:
PRIMARY KEY thread_id (thread_id), UNIQUE INDEX thread_id (thread_id), INDEX thread_ancestor (thread_ancestor), INDEX thread_article_title (thread_article_namespace, thread_article_title), INDEX thread_modified (thread_modified), INDEX thread_created (thread_created), INDEX thread_summary_page (thread_summary_page)
Well, first of all the duplicate key on thread_id doesn't make any sense at all. Now, I'm not sure how the thread model works exactly, but I find it surprising that thread_modified isn't on the end of any indexes, for sorting. I tried to find a sample query:
$where = array(Threads::articleClause($this->article), 'thread.thread_parent is null', '(thread.thread_summary_page is not null' . ' OR thread.thread_type = '.Threads::TYPE_MOVED.')', 'thread.thread_modified < ' . $startdate->text()); $options = array('ORDER BY thread.thread_modified DESC');
That's only partial, but it gives me a pretty good idea that you would really want an index on (thread_article_namespace, thread_article_title, thread_modified) here, at the very least. You might also need thread_parent or thread_summary_page in there, depending on the expected distribution of values there.
Poking around a bit more, I find the method Threads::monthsWhereArticleHasThreads(), which a comment accurately describes as "Horrible, horrible!" It retrieves *all* threads for the entire page into an array, iterates through them, returns a PHP array of arbitrarily many Thread objects -- the Thread class having 25 member variables which are all loaded, AFAICT including the text of the thread -- and then iterates through that array *again*, collecting a list of distinct months.
I spot one place where this method is used, and there it apparently takes the list of months (assuming we haven't OOMed by now due to this being, e.g., ANI) to . . . well, set some member variables, I can't figure out exactly what without a fair amount of further digging. But it suffices to say that I think we can consider LQT's performance to be a work in progress. :)