I'd like to voice my support for this feature, which makes us fairly evenly split by the numbers. But the problem is, the argument is already rather heated, and I'd like to see this decided on technical grounds, rather than on the basis of "who's got the biggest ego".
So I'd like to suggest that we all take a deep breath before starting a discussion under this thread. I think we should discuss the matter on a detached, technical level, and avoid the use of emotionally-charged words such as "fragile" and "bogus". Does that sound reasonable?
-- Tim Starling
First some food for thought.
The key thing about the feature is that we are using a special value in the target of a link to indicate that the source of the link has no other links.
It is not important what this special value is exactly, but it seemed appropriate to use (0,''), because it's simple, it doesn't require a schema change, and it's an invalid title and so won't appear otherwise.
The problem with it is that, as Brion hints, there are actually quite a lot of instances of this special value already present in the database. Here are the wikis with 10 or more:
dewiki 13456 enwiki 191 enwikinews 53 enwikiversity 12 huwiki 43 incubatorwiki 10 itwiki 15 kawiki 10 kuwiki 39 slwiki 15 zhwiki 40 itwikiversity 11
Now it's all very well to say "fix whatever bug is causing this", but that might not be practical, and is in fact not necessary for the feature in question. We can just use some other special invalid title, one that is more specific to this feature. For instance, we could use (-3,'##FROMBLANK'). It has a negative namespace which could be reserved in Defines.php as say NS_LINKHACK. And it starts with two "#" characters, so it doesn't conflict with any potential links from parser functions or fragments.
-- Tim Starling
On 9/12/07, Tim Starling tstarling@wikimedia.org wrote:
Now it's all very well to say "fix whatever bug is causing this", but that might not be practical, and is in fact not necessary for the feature in question. We can just use some other special invalid title, one that is more specific to this feature. For instance, we could use (-3,'##FROMBLANK'). It has a negative namespace which could be reserved in Defines.php as say NS_LINKHACK. And it starts with two "#" characters, so it doesn't conflict with any potential links from parser functions or fragments.
Well, that's at least not going to confuse people too much, since it's fairly self-explanatory. This is the sort of thing you'd like to use a named constant for, but I guess we don't have those in MySQL, at least not without adding another column.
If we don't want to use the {category,page}links tables, by the way, why create another table that will amount to two booleans per page/category? Create a category table, which I think has been discussed anyway, and then add a boolean column or two to each of the category and page tables. Is there any point in adding two more 1:1 tables to store a couple of bits each?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Simetrical wrote:
If we don't want to use the {category,page}links tables, by the way, why create another table that will amount to two booleans per page/category? Create a category table, which I think has been discussed anyway, and then add a boolean column or two to each of the category and page tables. Is there any point in adding two more 1:1 tables to store a couple of bits each?
The main advantage would be that we wouldn't have to do a big schema change operation to alter the page table. :)
I'm also not sure if there are issues of lock contention when updating link tables, but I'll leave that to the db-heads. :)
- -- brion vibber (brion @ wikimedia.org)
On 12/09/2007, Tim Starling tstarling@wikimedia.org wrote:
I'd like to voice my support for this feature, which makes us fairly evenly split by the numbers. But the problem is, the argument is already rather heated, and I'd like to see this decided on technical grounds, rather than on the basis of "who's got the biggest ego".
I'd just like to briefly apologise for derailing the discussion with a rant that should have been in a new thread, if anywhere.
Rob Church
So I'd like to suggest that we all take a deep breath before starting a discussion under this thread. I think we should discuss the matter on a detached, technical level, and avoid the use of emotionally-charged words such as "fragile" and "bogus". Does that sound reasonable?
Yes, it does. I'll try my best to do this.
To check I understand the technical situation correctly, please let me summarise the available options for storing this information in the database, and their resulting pros and cons:
1) Status Quo: Leave as-is for determining Special:UncategorizedPages and Special:Deadendpages : Advantages : We know it works. It's implemented / the patch exists. It's conceptually clean (no magic values). No new database tables need to be added. Disadvantage: It's a slow database query, because it uses a LEFT JOIN.
2) Introduce special magic values to indicate if a page has no page links, or categories. (e.g. pagelinks.pl_title = '' and pagelinks.pl_namespace = 0 means no links, or pagelinks.pl_title = '##FROMBLANK' and pagelinks.pl_namespace = -3 means no links, or categorylinks.cl_to = '' means no categories ) : Advantages : It's a quicker database query. The patch exists. No new database tables need to be added. Disadvantage: Compatibility and cryptic-ness. To break that out a little: Concerns that there may be a cleaner solution, and that magic values are cryptic. Concerns that other stuff may break, and that once a magic value is introduced, it has a way of slowly spreading through code, and that every time that value is used, you have to say "-3" is not really "-3", rather it actually means "no links".
3) Add a purpose-built table to store that the page has no links, and another for no categories: Advantages : It's a quicker database query. It's conceptually clean (no magic values). Disadvantages: Adds 2 more tables, and the schema update means it may take time to apply to the cluster, and adding tables is generally painful and thus something people try to avoid doing. The patch does not exist. Is arguably more complicated that the other options - If links or categories are added, then 2 places have to be updated - the new table, and the current table. Is arguably excessive to introduce two new tables to store effectively 2 extra bits of information per page.
4) To the page table, add a boolean field for "page_has_category", and another one for "page_has_links", similar to the "page_is_redirect" field. Advantages : It's a quicker database query. It's conceptually clean (no magic values). No new database tables need to be added. Disadvantages: These fields would presumably have to be indexed to be of use (?), which would make adding rows to this table slower. If links or categories are initially added or all removed, then 2 places have to be updated - this field, plus the current table. The patch does not exist. Requires a schema update to add the fields, which takes time to apply to the cluster, and is something people try to avoid doing. The page table is a extremely important table to get right.
Is the above an accurate and fair summary of the situation? If not, please correct me, so we can at least agree on the facts of the available options, before debating opinions about which option is preferable.
Also, I can see why people have divergent opinions over this - simply put, there is no quick and easy solution which has only advantages, and no disadvantages. Which option you prefer comes down to what weighting you assign to the various advantages and disadvantages, and that weighting is different for different people.
-- All the best, Nick.
On 9/13/07, Nick Jenkins nickpj@gmail.com wrote:
and adding tables is generally painful and thus something people try
to avoid doing.
My understanding is that adding a new field to the page table would take far longer than adding a new table - since adding a new field involves adding that field to all the existing records, whereas adding a new table just involves adding a new table definition.
Checking it out on IRC, domas says this is correct.
On 13/09/2007, Andrew Garrett andrew@epstone.net wrote:
On 9/13/07, Nick Jenkins nickpj@gmail.com wrote:
and adding tables is generally painful and thus something people try
to avoid doing.
My understanding is that adding a new field to the page table would take far longer than adding a new table - since adding a new field involves adding that field to all the existing records, whereas adding a new table just involves adding a new table definition.
Checking it out on IRC, domas says this is correct.
I agree. Adding a new table is fairly easy and quick. Populating it, on the other hand, can be rather slow and painful, but in this case I don't think it can be any slower than adding the rows to the existing table.
Nick Jenkins wrote:
So I'd like to suggest that we all take a deep breath before starting a discussion under this thread. I think we should discuss the matter on a detached, technical level, and avoid the use of emotionally-charged words such as "fragile" and "bogus". Does that sound reasonable?
Yes, it does. I'll try my best to do this.
To check I understand the technical situation correctly, please let me summarise the available options for storing this information in the database, and their resulting pros and cons:
- Status Quo: Leave as-is for determining Special:UncategorizedPages and Special:Deadendpages : Advantages : We know it works. It's implemented / the patch exists. It's conceptually clean (no magic values). No new database tables need to be added. Disadvantage: It's a slow database query, because it uses a LEFT JOIN.
The query pages have been causing some problems at the sysadmin level, they often have to be killed to preserve site performance.
- Introduce special magic values to indicate if a page has no page links, or categories. (e.g. pagelinks.pl_title = '' and pagelinks.pl_namespace = 0 means no links, or pagelinks.pl_title = '##FROMBLANK' and pagelinks.pl_namespace = -3 means no links, or categorylinks.cl_to = '' means no categories ) : Advantages : It's a quicker database query. The patch exists. No new database tables need to be added. Disadvantage: Compatibility and cryptic-ness. To break that out a little: Concerns that there may be a cleaner solution, and that magic values are cryptic. Concerns that other stuff may break, and that once a magic value is introduced, it has a way of slowly spreading through code, and that every time that value is used, you have to say "-3" is not really "-3", rather it actually means "no links".
I don't think stuff will break. There are in fact no special cases in the core code where you have to check for the special value on read -- see Werdna's patch. There is deleteLinksFromNonexistent() in refreshLinks.inc, but it's been commented out for years because it makes the slaves lag. Hence the persistent (0,'') values in the pagelinks table.
There is another potential advantage: pages like Special:Whatlinkshere could easily be adapted to display pages that link to nowhere, as a user interface option. The required special cases to do this would be much smaller than if there was a separate table. The data could be exposed via api.php in the same way. In a machine-readable API such as api.php, you could simply request links to (-3,'##NOLINKS') and it would give you pages with no links without any code changes.
- Add a purpose-built table to store that the page has no links, and another for no categories: Advantages : It's a quicker database query. It's conceptually clean (no magic values). Disadvantages: Adds 2 more tables, and the schema update means it may take time to apply to the cluster, and adding tables is generally painful and thus something people try to avoid doing. The patch does not exist. Is arguably more complicated that the other options - If links or categories are added, then 2 places have to be updated - the new table, and the current table. Is arguably excessive to introduce two new tables to store effectively 2 extra bits of information per page.
I don't believe it is a quicker database query, I think it would be about the same. "Conceptually clean" is subjective -- special values are used pervasively in many applications, including our own. Nulls are one common example. And like I say above, it allows the interfaces to be implicitly extended to provide this data.
The query count would be slightly higher in some cases with the separate table idea, but not as much as you might think, since the INSERT INTO pagelinks can be skipped when there are no links to insert, in the two table idea, replaced with an equivalent query to insert into the links-to-nowhere table.
Where they differ is the fact that when there *are* some links to insert into the pagelinks table, you have to run a DELETE query on the links-to-nowhere table, or a SELECT on that table to check if you have to DELETE. So that's an extra query in the most common case.
So if you're adding links-to-nowhere tracking to N tables, then you would have:
* N extra queries in the "no outbound links" case in either architecture
plus
* N extra queries in the "some outbound links" case in the extra table architecture, zero extra queries in this case in the magic value architecture.
- To the page table, add a boolean field for "page_has_category", and another one for "page_has_links", similar to the "page_is_redirect" field. Advantages : It's a quicker database query. It's conceptually clean (no magic values). No new database tables need to be added. Disadvantages: These fields would presumably have to be indexed to be of use (?), which would make adding rows to this table slower. If links or categories are initially added or all removed, then 2 places have to be updated - this field, plus the current table. The patch does not exist. Requires a schema update to add the fields, which takes time to apply to the cluster, and is something people try to avoid doing. The page table is a extremely important table to get right.
I wouldn't recommend this one.
Is the above an accurate and fair summary of the situation? If not, please correct me, so we can at least agree on the facts of the available options, before debating opinions about which option is preferable.
Also, I can see why people have divergent opinions over this - simply put, there is no quick and easy solution which has only advantages, and no disadvantages. Which option you prefer comes down to what weighting you assign to the various advantages and disadvantages, and that weighting is different for different people.
Personally I'm inclined to put the small differences between options 2 and 3 down to style, and let the implementing programmer decide which one to use.
-- Tim Starling
Why would you need 2 tables? Why not have one with a page_id, link_type unique key. That way you could have page_id = 3535, link_type = 'categories' and such in the "nolinks" table.
Nick Jenkins wrote:
So I'd like to suggest that we all take a deep breath before starting a discussion under this thread. I think we should discuss the matter on a detached, technical level, and avoid the use of emotionally-charged words such as "fragile" and "bogus". Does that sound reasonable?
Yes, it does. I'll try my best to do this.
To check I understand the technical situation correctly, please let me summarise the available options for storing this information in the database, and their resulting pros and cons:
- Status Quo: Leave as-is for determining Special:UncategorizedPages and
Special:Deadendpages : Advantages : We know it works. It's implemented / the patch exists. It's conceptually clean (no magic values). No new database tables need to be added. Disadvantage: It's a slow database query, because it uses a LEFT JOIN.
- Introduce special magic values to indicate if a page has no page links,
or categories. (e.g. pagelinks.pl_title = '' and pagelinks.pl_namespace = 0 means no links, or pagelinks.pl_title = '##FROMBLANK' and pagelinks.pl_namespace = -3 means no links, or categorylinks.cl_to = '' means no categories ) : Advantages : It's a quicker database query. The patch exists. No new database tables need to be added. Disadvantage: Compatibility and cryptic-ness. To break that out a little: Concerns that there may be a cleaner solution, and that magic values are cryptic. Concerns that other stuff may break, and that once a magic value is introduced, it has a way of slowly spreading through code, and that every time that value is used, you have to say "-3" is not really "-3", rather it actually means "no links".
- Add a purpose-built table to store that the page has no links, and
another for no categories: Advantages : It's a quicker database query. It's conceptually clean (no magic values). Disadvantages: Adds 2 more tables, and the schema update means it may take time to apply to the cluster, and adding tables is generally painful and thus something people try to avoid doing. The patch does not exist. Is arguably more complicated that the other options - If links or categories are added, then 2 places have to be updated - the new table, and the current table. Is arguably excessive to introduce two new tables to store effectively 2 extra bits of information per page.
- To the page table, add a boolean field for "page_has_category", and
another one for "page_has_links", similar to the "page_is_redirect" field. Advantages : It's a quicker database query. It's conceptually clean (no magic values). No new database tables need to be added. Disadvantages: These fields would presumably have to be indexed to be of use (?), which would make adding rows to this table slower. If links or categories are initially added or all removed, then 2 places have to be updated - this field, plus the current table. The patch does not exist. Requires a schema update to add the fields, which takes time to apply to the cluster, and is something people try to avoid doing. The page table is a extremely important table to get right.
Is the above an accurate and fair summary of the situation? If not, please correct me, so we can at least agree on the facts of the available options, before debating opinions about which option is preferable.
Also, I can see why people have divergent opinions over this - simply put, there is no quick and easy solution which has only advantages, and no disadvantages. Which option you prefer comes down to what weighting you assign to the various advantages and disadvantages, and that weighting is different for different people.
-- All the best, Nick.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org