How to record redirects in the database - Wikidata-tech

14 Jul 2014

      This is about the question of how to best store entity redirects in the
database. Below I try to describe the problem and the possible solutions. Any
input welcome.
A quick primer:
Wikibase redirects are really Entity-ID aliases. They correspond to, but are not
the same as, MediaWiki's page based redirects.
If Q3 is a redirect to (an alias for) Q5, the page Item:Q3 is also a redirect to
Item:Q5. The JSON blob on Item:Q3 would store a redirect entry *instead* of an
entity. Entities never *are* redirects.
Wikibase currently stores a mapping of entity ids to page ids in the
wb_entity_per_page table (epp table for short).
MediaWiki core stores redirects as a kind of "link table", with rd_from being a
page_id, and rd_to+rd_namespace being name+namespace of the redirect target.
Requirements:
* When looking up an EntityId, we need to be able to load the corresponding JSON
blob, and for that we need to find the corresponding wiki page (either by id, or
by name+namespace). We need to be able to do this cross-wiki, so we may not have
the repo's configuration (wrt namespaces, etc) available when constructing the
query.
* We need an efficient way to list all entity IDs on a wiki (without redirects).
In particular, the mechanism for listing entities must support efficient paging.
* We need an efficient way to resolve redirects in bulk, or at least, to discern
redirects from unknown/deleted entity ids.
Options:
1) No redirects in the epp table (current). This means we need to use the
name+namespace when loading the entity-or-redirect from a page, since we don't
know the page ID if it's a redirects. We also can't use core's redirect table,
because for that, we also need to know the page id first. In order to use
name+namespace for looking up page IDs for entities, client wikis would need to
know the namespace IDs used on the repo, in order to generate queries against
the repo's database.
2) Put redirects into the epp table as well, without any special marking. This
makes lookups easy, but gives us no efficient way to list all entities without
redirects. We'd need to check and skip redirects while iterating. This would add
complexity to several maintenance and upgrade scripts.
3) Put redirects into the epp table, with a marker (or target id) in a new
column. This would allow for both, simple lookup and efficient listing, but it
means adding a column (and an index) to an already large table in production. It
also means having the overhead of a column that's mostly null.
4) Put redirects into epp *and* a separate table. Provides simple lookup, but
means a potentially slow join when listing entities. This join would happen
multiple times each time we need to list all entities, because of paged access -
compare how JsonDumpGenerator works.
5) Put redirects into a special table but not into epp. This means fast/simple
listing of entities, but requires a not-so-nice "try" logic when looking up
entities: if no entry is found in the epp table, we then need to go on and try
the entity-redirect table, to see whether the id is redirected or unknown/deleted.
Assessment:
1) is nasty in terms of cross-wiki configuration. It's the simplest solution on
the code and database levels, but seems brittle.
2) adds complexity to everything that lists entities. Big performance impact in
cases where entity blobs would otherwise not have been loaded, but are loaded
now to check whether they contain redirects.
3) is somewhat wasteful on the database level, and needs a schema change
deployment on a large table. Don't know how bad that would be, though.
4) may cause performance issues because it adds complexity to big queries on
large tables. Needs trivial schema change deployment (new table).
5) adds complexity to the code that reads entity blobs from the database,
impacts performance for the "redirect" and "missing entity" cases by adding a
database query. Could be acceptable if these cases are rare. Needs trivial
schema change deployment (new table).
-- daniel
-- 
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.