The most important thing for my problem would be to encode quantity and
geopos. The test case is lake sizes to encode proper localized descriptions.
Unless someone already have a working solution I would encode this as
sparse logarithmic vectors, probably also with log of pairwise differences.
Encoding of qualifiers is interesting, but would require encoding of a
topic map, and that adds an additional layer of complexity.
How to encode the values are not so much the problem, but avoiding
reimplementing this yet another time… ;)
On Wed, Sep 27, 2017 at 1:23 PM, Thomas Pellissier Tanon <
thomas(a)pellissier-tanon.fr> wrote:
Just an idea of a very sparse but hopefully not so bad
encoding (I have
not actually tested it).
NB: I am going to use a lot the terms defined in the glossary [1].
A value could be encoded by a vector:
- for entity ids it is a vector V that have the dimension of the number of
existing entities such that V[q] = 1 if, and only if, it is the entity q
and V[q] = 0 if not.
- for time : a vector with year, month, day, hours, minutes, seconds,
is_precision_year, is_precision_month, ..., is_gregorian, is_julian (or
something similar)
- for geo coordinates latitude, longitude, is_earth, is_moon...
- string/language strings: an encoding depending on your use case
...
Example : To encode "Q2" you would have the vector {0,1,0....}
To encode the year 2000 you would have {2000,0..., is_precision_decade =
0,is_precision_year=1,is_precision_month=0,...,is_gregorian=true,...}
To encode a snak you build a big vector by concatenating the vector of the
value if it is P1, if it is P2... (you use the property datatype to pick a
good vector shape) + you add two cells per property to encode is_novalue,
is_somevalue. To encode "P31: Q5" you would have a vector V =
{0,....,0,0,0,0,1,0,....} with 1 only for V[P31_offset + Q5_offset]
To encode a claim you could concatenate the main snak vector + the
qualifiers vectors that is the merge of the snak vector for all qualifiers
(i.e. you build the vector for all snak and you sum them) such that the
qualifier vectors encode all qualifiers at the same time. it allows to
check that a qualifiers is set just by picking the right cell in the
vector. But it will do bad things if there are two qualifiers with the same
property and having a datatype like time or geocoordinates. But I don't
think it really a problem.
Example: to encode the claim with "P31: Q5" main snak and qualifiers
"P42:
Q42, P42: Q44" we would have a vector V such that V[P31_offset + Q5_offset]
= 1, V[qualifiers_offset + P42_offset + Q42_offset] = 1 and
V[qualifiers_offset + P42_offset + Q44_offset] = 1 and 0 elsewhere.
I am not sure how to encode statements references (merge all of them and
encode it just like the qualifiers vector is maybe a first step but is bad
if we have multiple references). For the rank you just need 3 booleans
is_preferred, is_normal and is_deprecated.
Cheers,
Thomas
[1]
https://www.wikidata.org/wiki/Wikidata:Glossary
Le 27 sept. 2017 à 12:41, John Erling Blad
<jeblad(a)gmail.com> a écrit :
Is there anyone that has done any work on how to encode statements as
features for
neural nets? I'm mostly interested in sparse encoders for
online training of live networks.
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata