The next problem with EntityId

List overview All Threads
Download

newer

older

RFC: TitleValue

FYI: Wikiquote/Wikidata proposal

Jeroen De Dauw

15 Sep 2013 15 Sep '13

4:15 p.m.

Hey,

With the recent EntityId refactoring, the assumption that these IDs consist out of a prefix (ie Q) and a numeric part (ie 42) can be removed from the EntityId class. Except that this assumption is still present in several other locations in the codebase.

If we want to support entity ids that do not follow the prefix+numeric id schema (for instance using a filename or word as id), then we need to get rid of the occurrences of this assumption. This is not trivial to do however. The biggest hurdle is that we have a lot of entity ids stored in the db using 2 fields: entity type (string) and numeric id (int). The former field being mapped from the id prefix, and the later being the numeric part. As long as this is there, we need to be able to get a numeric part from the id object, and reconstruct id objects given a type and numeric part. Removing the format assumption would mean having the same entity type field, but rather then having an int field holding just part of the id, there would be a string field that holds the whole serialization. While it is technically not that hard to change this in the software, the size of the tables of Wikidata.org makes doing a rebuild somewhat hard.

This means we essentially need to choose if we want to be able to, at any point, have entity ids that do not follow the prefix+number format. If this is not important, we can leave everything as it is, and keep the assumption around without feeling dirty about it. In that case we also accept the fact that if later on we want an id that does not follow this format, we'll be out of luck. If on the other hand we want to support ids with other formats, we need to start working on getting rid of the remaining assumptions on the old format, and accept that we'll need to write some rebuilding code for a (few) Wikidata.org table(s).

It'd be good to have this decision sooner then later, as code touching places where such assumptions are located needs to oddly hold both possible decisions into account, while both typically suggest a quit different approach.

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

Attachments:

attachment.htm (text/html — 2.3 KB)

Show replies by date

Daniel Kinzler

16 Sep 16 Sep

5:40 p.m.

Perhaps it would be a viable compromize to drop the "prefix+number" assumption for entity IDs in general, but keep it for the kinds of entities we have right now.

This would mean that there would be a base class and/or interface called something like NumericEntityId, which at least ItemID and PropertyId would derive from resp. implement. That interface would then be required by storage services that rely on the table structure described by Jeroen.

This would allow us to keep the current DB setup for the kinds of entities we currently have - and perhaps for all "top level" entities in the future, while also allowing us to have other kinds of entities, using different ID schemes (e.g. the IDs "sub-entites" like Sense or Form for Wiktionary could contain the ID of their "parent" Entity, simmilar to the way ClaimsIDs contain the EntityId of "their" entity).

-- daniel

Am 15.09.2013 16:15, schrieb Jeroen De Dauw:

...

Hey,

With the recent EntityId refactoring, the assumption that these IDs consist out of a prefix (ie Q) and a numeric part (ie 42) can be removed from the EntityId class. Except that this assumption is still present in several other locations in the codebase.

If we want to support entity ids that do not follow the prefix+numeric id schema (for instance using a filename or word as id), then we need to get rid of the occurrences of this assumption. This is not trivial to do however. The biggest hurdle is that we have a lot of entity ids stored in the db using 2 fields: entity type (string) and numeric id (int). The former field being mapped from the id prefix, and the later being the numeric part. As long as this is there, we need to be able to get a numeric part from the id object, and reconstruct id objects given a type and numeric part. Removing the format assumption would mean having the same entity type field, but rather then having an int field holding just part of the id, there would be a string field that holds the whole serialization. While it is technically not that hard to change this in the software, the size of the tables of Wikidata.org makes doing a rebuild somewhat hard.

This means we essentially need to choose if we want to be able to, at any point, have entity ids that do not follow the prefix+number format. If this is not important, we can leave everything as it is, and keep the assumption around without feeling dirty about it. In that case we also accept the fact that if later on we want an id that does not follow this format, we'll be out of luck. If on the other hand we want to support ids with other formats, we need to start working on getting rid of the remaining assumptions on the old format, and accept that we'll need to write some rebuilding code for a (few) Wikidata.org table(s).

It'd be good to have this decision sooner then later, as code touching places where such assumptions are located needs to oddly hold both possible decisions into account, while both typically suggest a quit different approach.

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Jeroen De Dauw

6:11 p.m.

Hey,

Perhaps it would be a viable compromize to drop the "prefix+number"

...

assumption for entity IDs in general, but keep it for the kinds of entities we have right now.

We need to decide if we want to support ids that have another format or not. This is a boolean thing, as you cannot "sort of support it" and expect that to lead to good design. The main goal of this threat is getting an answer to that question.

This would mean that there would be a base class and/or interface called

...

something like NumericEntityId, which at least ItemID and PropertyId would derive from resp. implement. That interface would then be required by storage services that rely on the table structure described by Jeroen.

This does not solve the issue. As these interfaces would only accept NumericEntityId, callers would need to make sure they only provide such ids. How are they going to do this? I do not see how this could be done nicely in our codebase. And what happen when we have an entity id type that does not implement this? We'd need to go tackle all these assumptions to have it integrate nicely. We'd need to do all the work we need to do now, plus fixing all new occurrences of these assumptions if we go ahead pretending we can use them without this further undermining entity id flexibility. So the approach proposed here is practically the same as stating "we will not support other formats", except that it lies about this intend, and causes _more_ work we'd need to do in case we decide the decision was wrong and we need to support them anyway.

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

Denny Vrandečić

23 Sep 23 Sep

3:12 p.m.

We probably want to get rid of that assumption, or re-interpret it. If I saw it correctly, we have this only in one table, the wb_terms table (unfortunately our largest one). There, though, we use the column with the type also for filtering the results, i.e. we would not be getting rid of that column in that table.

We have several options to proceed: * change the ID from numeric to string and change the content from 1 to Q1, and have a complex switch implementing this * not change anything but the semantics of this. We could simply assume that depending on the type, the meaning of the id is slightly different. If it is an item or property, assume it is Q or P concatenated with the numeric ID, if it is MediaFile assume the number means the pageID etc.

Just an idea.

Cheers, Denny

2013/9/15 Jeroen De Dauw jeroendedauw@gmail.com

...

Hey,

With the recent EntityId refactoring, the assumption that these IDs consist out of a prefix (ie Q) and a numeric part (ie 42) can be removed from the EntityId class. Except that this assumption is still present in several other locations in the codebase.

If we want to support entity ids that do not follow the prefix+numeric id schema (for instance using a filename or word as id), then we need to get rid of the occurrences of this assumption. This is not trivial to do however. The biggest hurdle is that we have a lot of entity ids stored in the db using 2 fields: entity type (string) and numeric id (int). The former field being mapped from the id prefix, and the later being the numeric part. As long as this is there, we need to be able to get a numeric part from the id object, and reconstruct id objects given a type and numeric part. Removing the format assumption would mean having the same entity type field, but rather then having an int field holding just part of the id, there would be a string field that holds the whole serialization. While it is technically not that hard to change this in the software, the size of the tables of Wikidata.org makes doing a rebuild somewhat hard.

This means we essentially need to choose if we want to be able to, at any point, have entity ids that do not follow the prefix+number format. If this is not important, we can leave everything as it is, and keep the assumption around without feeling dirty about it. In that case we also accept the fact that if later on we want an id that does not follow this format, we'll be out of luck. If on the other hand we want to support ids with other formats, we need to start working on getting rid of the remaining assumptions on the old format, and accept that we'll need to write some rebuilding code for a (few) Wikidata.org table(s).

It'd be good to have this decision sooner then later, as code touching places where such assumptions are located needs to oddly hold both possible decisions into account, while both typically suggest a quit different approach.

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.

4112

Age (days ago)

4120

Last active (days ago)

wikidata-tech@lists.wikimedia.org

3 comments

3 participants

tags (0)

participants (3)

Daniel Kinzler
Denny Vrandečić
Jeroen De Dauw