Rob Church wrote:
On 21/08/06, Tim Starling t.starling@physics.unimelb.edu.au wrote:
Because it would require a DB query with an execution time proportional to the number of articles in the category? Having categories with lots of members is fine, as long as we don't try to traverse them all the time.
That would be it. I couldn't find an effective method which Domas would agree with.
The efficient way to do it would be to enumerate the category members, saving the results to a table for later lookup. The entries in this special table could have an expiry time. Is that how you were planning on doing it?
I don't bother planning things any more. Having four months' worth of notes turn useless is not a fun experience.
More details? Enumerate them when, and save which results?
Say if you have a large category. You could find a random member like this:
$res = $dbr->select('categorylinks','cl_from',array('cl_to'=>$cat)); $rand = mt_rand(0, $dbr->numRows($res) - 1); $dbr->dataSeek($rand); $randomRow = $dbr->fetchObject($res);
By selecting all members of the category into a buffer, we have implicitly assigned each row of the result set an number between 0 and N-1. This is what I mean by enumeration. Once you have this enumeration, you pick a random number between 0 and N-1 and then quickly retrieve it. But it's slow, because you have to load all the members every time you want to pick a random one.
Here's how you could save the enumeration:
$res = $dbr->select('categorylinks','cl_from',array('cl_to'=>$cat)); $i = 0; $enumRows = array(); while ( $row = $dbr->fetchObject( $res ) ) { $enumRows[] = array( 'ce_index' => $i++, 'ce_from' => $row->cl_from, 'ce_to' => $cat ) } $dbw->insert('categoryenum', $enumRows); $dbw->insert('categoryenum_info', array( 'cei_cat' => $cat, 'cei_count' => $i, 'cei_expiry' => wfTimestamp(TS_MW, time() + $expiry) );
Once you've done that, you can quickly pick random members from the set:
$count = $dbr->selectField( 'categoryenum_info', 'cei_count', array( 'cei_cat' => $cat ) ); $rand = mt_rand( 0, $count - 1 ); $randRow = $dbr->selectRow( 'categoryenum', '*', array( 'ce_index' => $rand, 'ce_to' => $cat ) );
This is fast as long as you have indexes on cei_cat and (ce_to,ce_index). I've left out a few details, such as:
* Expiry. * Refreshing existing enumerations. This could be done with a simple delete/insert, but better performance might be obtained by deleting individual items, and then filling the gaps by moving entries from the end of the enumeration to the middle. * Verification. You can quickly verify that the element that you pick from the cached enumeration is still in the category, and if it's not, you can try again. After a few iterations, you'd have to either give up or declare the enumeration expired and regenerate it.
Hopefully you get the idea though.
Come to think of it, that incremental refresh algorithm above would be fast enough to perform on page save, in LinksUpdate.php. Then you wouldn't need expiry, and it wouldn't be a cache anymore, just a permanent data structure. You could just add a cl_index column to the categorylinks table.
-- Tim Starling