Nick Jenkins wrote:
So I'd like to suggest that we all take a deep breath before starting a discussion under this thread. I think we should discuss the matter on a detached, technical level, and avoid the use of emotionally-charged words such as "fragile" and "bogus". Does that sound reasonable?
Yes, it does. I'll try my best to do this.
To check I understand the technical situation correctly, please let me summarise the available options for storing this information in the database, and their resulting pros and cons:
- Status Quo: Leave as-is for determining Special:UncategorizedPages and Special:Deadendpages : Advantages : We know it works. It's implemented / the patch exists. It's conceptually clean (no magic values). No new database tables need to be added. Disadvantage: It's a slow database query, because it uses a LEFT JOIN.
The query pages have been causing some problems at the sysadmin level, they often have to be killed to preserve site performance.
- Introduce special magic values to indicate if a page has no page links, or categories. (e.g. pagelinks.pl_title = '' and pagelinks.pl_namespace = 0 means no links, or pagelinks.pl_title = '##FROMBLANK' and pagelinks.pl_namespace = -3 means no links, or categorylinks.cl_to = '' means no categories ) : Advantages : It's a quicker database query. The patch exists. No new database tables need to be added. Disadvantage: Compatibility and cryptic-ness. To break that out a little: Concerns that there may be a cleaner solution, and that magic values are cryptic. Concerns that other stuff may break, and that once a magic value is introduced, it has a way of slowly spreading through code, and that every time that value is used, you have to say "-3" is not really "-3", rather it actually means "no links".
I don't think stuff will break. There are in fact no special cases in the core code where you have to check for the special value on read -- see Werdna's patch. There is deleteLinksFromNonexistent() in refreshLinks.inc, but it's been commented out for years because it makes the slaves lag. Hence the persistent (0,'') values in the pagelinks table.
There is another potential advantage: pages like Special:Whatlinkshere could easily be adapted to display pages that link to nowhere, as a user interface option. The required special cases to do this would be much smaller than if there was a separate table. The data could be exposed via api.php in the same way. In a machine-readable API such as api.php, you could simply request links to (-3,'##NOLINKS') and it would give you pages with no links without any code changes.
- Add a purpose-built table to store that the page has no links, and another for no categories: Advantages : It's a quicker database query. It's conceptually clean (no magic values). Disadvantages: Adds 2 more tables, and the schema update means it may take time to apply to the cluster, and adding tables is generally painful and thus something people try to avoid doing. The patch does not exist. Is arguably more complicated that the other options - If links or categories are added, then 2 places have to be updated - the new table, and the current table. Is arguably excessive to introduce two new tables to store effectively 2 extra bits of information per page.
I don't believe it is a quicker database query, I think it would be about the same. "Conceptually clean" is subjective -- special values are used pervasively in many applications, including our own. Nulls are one common example. And like I say above, it allows the interfaces to be implicitly extended to provide this data.
The query count would be slightly higher in some cases with the separate table idea, but not as much as you might think, since the INSERT INTO pagelinks can be skipped when there are no links to insert, in the two table idea, replaced with an equivalent query to insert into the links-to-nowhere table.
Where they differ is the fact that when there *are* some links to insert into the pagelinks table, you have to run a DELETE query on the links-to-nowhere table, or a SELECT on that table to check if you have to DELETE. So that's an extra query in the most common case.
So if you're adding links-to-nowhere tracking to N tables, then you would have:
* N extra queries in the "no outbound links" case in either architecture
plus
* N extra queries in the "some outbound links" case in the extra table architecture, zero extra queries in this case in the magic value architecture.
- To the page table, add a boolean field for "page_has_category", and another one for "page_has_links", similar to the "page_is_redirect" field. Advantages : It's a quicker database query. It's conceptually clean (no magic values). No new database tables need to be added. Disadvantages: These fields would presumably have to be indexed to be of use (?), which would make adding rows to this table slower. If links or categories are initially added or all removed, then 2 places have to be updated - this field, plus the current table. The patch does not exist. Requires a schema update to add the fields, which takes time to apply to the cluster, and is something people try to avoid doing. The page table is a extremely important table to get right.
I wouldn't recommend this one.
Is the above an accurate and fair summary of the situation? If not, please correct me, so we can at least agree on the facts of the available options, before debating opinions about which option is preferable.
Also, I can see why people have divergent opinions over this - simply put, there is no quick and easy solution which has only advantages, and no disadvantages. Which option you prefer comes down to what weighting you assign to the various advantages and disadvantages, and that weighting is different for different people.
Personally I'm inclined to put the small differences between options 2 and 3 down to style, and let the implementing programmer decide which one to use.
-- Tim Starling