We probably want to get rid of that assumption, or re-interpret it. If I
saw it correctly, we have this only in one table, the wb_terms table
(unfortunately our largest one). There, though, we use the column with the
type also for filtering the results, i.e. we would not be getting rid of
that column in that table.
We have several options to proceed:
* change the ID from numeric to string and change the content from 1 to Q1,
and have a complex switch implementing this
* not change anything but the semantics of this. We could simply assume
that depending on the type, the meaning of the id is slightly different. If
it is an item or property, assume it is Q or P concatenated with the
numeric ID, if it is MediaFile assume the number means the pageID etc.
Just an idea.
Cheers,
Denny
2013/9/15 Jeroen De Dauw <jeroendedauw(a)gmail.com>
Hey,
With the recent EntityId refactoring, the assumption that these IDs
consist out of a prefix (ie Q) and a numeric part (ie 42) can be removed
from the EntityId class. Except that this assumption is still present in
several other locations in the codebase.
If we want to support entity ids that do not follow the prefix+numeric id
schema (for instance using a filename or word as id), then we need to get
rid of the occurrences of this assumption. This is not trivial to do
however. The biggest hurdle is that we have a lot of entity ids stored in
the db using 2 fields: entity type (string) and numeric id (int). The
former field being mapped from the id prefix, and the later being the
numeric part. As long as this is there, we need to be able to get a numeric
part from the id object, and reconstruct id objects given a type and
numeric part. Removing the format assumption would mean having the same
entity type field, but rather then having an int field holding just part of
the id, there would be a string field that holds the whole serialization.
While it is technically not that hard to change this in the software, the
size of the tables of
Wikidata.org makes doing a rebuild somewhat hard.
This means we essentially need to choose if we want to be able to, at any
point, have entity ids that do not follow the prefix+number format. If this
is not important, we can leave everything as it is, and keep the assumption
around without feeling dirty about it. In that case we also accept the fact
that if later on we want an id that does not follow this format, we'll be
out of luck. If on the other hand we want to support ids with other
formats, we need to start working on getting rid of the remaining
assumptions on the old format, and accept that we'll need to write some
rebuilding code for a (few)
Wikidata.org table(s).
It'd be good to have this decision sooner then later, as code touching
places where such assumptions are located needs to oddly hold both possible
decisions into account, while both typically suggest a quit different
approach.
Cheers
--
Jeroen De Dauw
http://www.bn2vs.com
Don't panic. Don't be evil. ~=[,,_,,]:3
--
_______________________________________________
Wikidata-tech mailing list
Wikidata-tech(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
--
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 |
http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.