This is about the question of how to best store entity redirects in the database. Below I try to describe the problem and the possible solutions. Any input welcome.
A quick primer:
Wikibase redirects are really Entity-ID aliases. They correspond to, but are not the same as, MediaWiki's page based redirects.
If Q3 is a redirect to (an alias for) Q5, the page Item:Q3 is also a redirect to Item:Q5. The JSON blob on Item:Q3 would store a redirect entry *instead* of an entity. Entities never *are* redirects.
Wikibase currently stores a mapping of entity ids to page ids in the wb_entity_per_page table (epp table for short).
MediaWiki core stores redirects as a kind of "link table", with rd_from being a page_id, and rd_to+rd_namespace being name+namespace of the redirect target.
Requirements:
* When looking up an EntityId, we need to be able to load the corresponding JSON blob, and for that we need to find the corresponding wiki page (either by id, or by name+namespace). We need to be able to do this cross-wiki, so we may not have the repo's configuration (wrt namespaces, etc) available when constructing the query.
* We need an efficient way to list all entity IDs on a wiki (without redirects). In particular, the mechanism for listing entities must support efficient paging.
* We need an efficient way to resolve redirects in bulk, or at least, to discern redirects from unknown/deleted entity ids.
Options:
1) No redirects in the epp table (current). This means we need to use the name+namespace when loading the entity-or-redirect from a page, since we don't know the page ID if it's a redirects. We also can't use core's redirect table, because for that, we also need to know the page id first. In order to use name+namespace for looking up page IDs for entities, client wikis would need to know the namespace IDs used on the repo, in order to generate queries against the repo's database.
2) Put redirects into the epp table as well, without any special marking. This makes lookups easy, but gives us no efficient way to list all entities without redirects. We'd need to check and skip redirects while iterating. This would add complexity to several maintenance and upgrade scripts.
3) Put redirects into the epp table, with a marker (or target id) in a new column. This would allow for both, simple lookup and efficient listing, but it means adding a column (and an index) to an already large table in production. It also means having the overhead of a column that's mostly null.
4) Put redirects into epp *and* a separate table. Provides simple lookup, but means a potentially slow join when listing entities. This join would happen multiple times each time we need to list all entities, because of paged access - compare how JsonDumpGenerator works.
5) Put redirects into a special table but not into epp. This means fast/simple listing of entities, but requires a not-so-nice "try" logic when looking up entities: if no entry is found in the epp table, we then need to go on and try the entity-redirect table, to see whether the id is redirected or unknown/deleted.
Assessment:
1) is nasty in terms of cross-wiki configuration. It's the simplest solution on the code and database levels, but seems brittle.
2) adds complexity to everything that lists entities. Big performance impact in cases where entity blobs would otherwise not have been loaded, but are loaded now to check whether they contain redirects.
3) is somewhat wasteful on the database level, and needs a schema change deployment on a large table. Don't know how bad that would be, though.
4) may cause performance issues because it adds complexity to big queries on large tables. Needs trivial schema change deployment (new table).
5) adds complexity to the code that reads entity blobs from the database, impacts performance for the "redirect" and "missing entity" cases by adding a database query. Could be acceptable if these cases are rare. Needs trivial schema change deployment (new table).
-- daniel
Hey :)
Already told Daniel on IRC but documenting here for good measure. imho options 2 and 4 are out. I have no objections to the others but also can't make a decision between them.
Cheers Lydia
I think option #3 is best, since the table isn't *that* big (only one row per page) and seems least bad option in terms of code design.
Cheers, Katie
On Mon, Jul 14, 2014 at 12:30 PM, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:
This is about the question of how to best store entity redirects in the database. Below I try to describe the problem and the possible solutions. Any input welcome.
A quick primer:
Wikibase redirects are really Entity-ID aliases. They correspond to, but are not the same as, MediaWiki's page based redirects.
If Q3 is a redirect to (an alias for) Q5, the page Item:Q3 is also a redirect to Item:Q5. The JSON blob on Item:Q3 would store a redirect entry *instead* of an entity. Entities never *are* redirects.
Wikibase currently stores a mapping of entity ids to page ids in the wb_entity_per_page table (epp table for short).
MediaWiki core stores redirects as a kind of "link table", with rd_from being a page_id, and rd_to+rd_namespace being name+namespace of the redirect target.
Requirements:
- When looking up an EntityId, we need to be able to load the
corresponding JSON blob, and for that we need to find the corresponding wiki page (either by id, or by name+namespace). We need to be able to do this cross-wiki, so we may not have the repo's configuration (wrt namespaces, etc) available when constructing the query.
- We need an efficient way to list all entity IDs on a wiki (without
redirects). In particular, the mechanism for listing entities must support efficient paging.
- We need an efficient way to resolve redirects in bulk, or at least, to
discern redirects from unknown/deleted entity ids.
Options:
- No redirects in the epp table (current). This means we need to use the
name+namespace when loading the entity-or-redirect from a page, since we don't know the page ID if it's a redirects. We also can't use core's redirect table, because for that, we also need to know the page id first. In order to use name+namespace for looking up page IDs for entities, client wikis would need to know the namespace IDs used on the repo, in order to generate queries against the repo's database.
- Put redirects into the epp table as well, without any special marking.
This makes lookups easy, but gives us no efficient way to list all entities without redirects. We'd need to check and skip redirects while iterating. This would add complexity to several maintenance and upgrade scripts.
- Put redirects into the epp table, with a marker (or target id) in a new
column. This would allow for both, simple lookup and efficient listing, but it means adding a column (and an index) to an already large table in production. It also means having the overhead of a column that's mostly null.
- Put redirects into epp *and* a separate table. Provides simple lookup,
but means a potentially slow join when listing entities. This join would happen multiple times each time we need to list all entities, because of paged access - compare how JsonDumpGenerator works.
- Put redirects into a special table but not into epp. This means
fast/simple listing of entities, but requires a not-so-nice "try" logic when looking up entities: if no entry is found in the epp table, we then need to go on and try the entity-redirect table, to see whether the id is redirected or unknown/deleted.
Assessment:
- is nasty in terms of cross-wiki configuration. It's the simplest
solution on the code and database levels, but seems brittle.
- adds complexity to everything that lists entities. Big performance
impact in cases where entity blobs would otherwise not have been loaded, but are loaded now to check whether they contain redirects.
- is somewhat wasteful on the database level, and needs a schema change
deployment on a large table. Don't know how bad that would be, though.
- may cause performance issues because it adds complexity to big queries
on large tables. Needs trivial schema change deployment (new table).
- adds complexity to the code that reads entity blobs from the database,
impacts performance for the "redirect" and "missing entity" cases by adding a database query. Could be acceptable if these cases are rare. Needs trivial schema change deployment (new table).
-- daniel
-- Daniel Kinzler Senior Software Developer
Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hi,
my conclusion after a good bit of thinking and discussing:
Option 1 (no redirects in epp) sounded good at first but becomes more and more painful the more I think about it. Let's avoid it. We can and should do better.
Option 5 (redirects in a special table) is my second favorite solution. It mimics core's redirect table. The database can be a bit more efficient. But the code will be more complicated, obviously. It's probably not worth the trouble. Disk space is cheap. Moore's law still applies.
So it is option 3 (redirects in epp). Please make sure to store the target EntityId in the column, not just a boolean flag.
Best
wikidata-tech@lists.wikimedia.org