You might view (my) problem as an embedding for words (and its fragments) driven by valued statements (those you discard), and then inverting this (learned encoder) into a language model. Thus when describing an object it would be possible to chose better words (lexical choice in natural language generation).

On Mon, Oct 2, 2017 at 5:00 PM, <fn@imm.dtu.dk> wrote:
I have done some work on converting Wikidata items and properties to a low-dimensional representation (graph embedding).

A webservice with a "most-similar" functionality based on computation in the low-dimensional space is running from https://tools.wmflabs.org/wembedder/most-similar/

A query may look like:

https://tools.wmflabs.org/wembedder/most-similar/Q20#language=en

It is based on a simple Gensim model https://github.com/fnielsen/wembedder and could probably be improved.

It is described in http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/7011/pdf/imm7011.pdf

It is not embedding statements but rather individual items.


There is general research on graph embedding. I have added some of the scientific articles to Wikidata. You can see them with Scholia:

https://tools.wmflabs.org/scholia/topic/Q32081746


best regards
Finn Årup Nielsen
http://people.compute.dtu.dk/faan/


On 09/27/2017 02:14 PM, John Erling Blad wrote:
The most important thing for my problem would be to encode quantity and geopos. The test case is lake sizes to encode proper localized descriptions.

Unless someone already have a working solution I would encode this as sparse logarithmic vectors, probably also with log of pairwise differences.

Encoding of qualifiers is interesting, but would require encoding of a topic map, and that adds an additional layer of complexity.

How to encode the values are not so much the problem, but avoiding reimplementing this yet another time… ;)

On Wed, Sep 27, 2017 at 1:23 PM, Thomas Pellissier Tanon <thomas@pellissier-tanon.fr <mailto:thomas@pellissier-tanon.fr>> wrote:

    Just an idea of a very sparse but hopefully not so bad encoding (I
    have not actually tested it).

    NB: I am going to use a lot the terms defined in the glossary [1].

    A value could be encoded by a vector:
    - for entity ids it is a vector V that have the dimension of the
    number of existing entities such that V[q] = 1 if, and only if, it
    is the entity q and V[q] = 0 if not.
    - for time : a vector with year, month, day, hours, minutes,
    seconds, is_precision_year, is_precision_month, ..., is_gregorian,
    is_julian (or something similar)
    - for geo coordinates latitude, longitude, is_earth, is_moon...
    - string/language strings: an encoding depending on your use case
    ...
    Example : To encode "Q2" you would have the vector {0,1,0....}
    To encode the year 2000 you would have {2000,0...,
    is_precision_decade =
    0,is_precision_year=1,is_precision_month=0,...,is_gregorian=true,...}

    To encode a snak you build a big vector by concatenating the vector
    of the value if it is P1, if it is P2... (you use the property
    datatype to pick a good vector shape) + you add two cells per
    property to encode is_novalue, is_somevalue. To encode "P31: Q5" you
    would have a vector V = {0,....,0,0,0,0,1,0,....} with 1 only for     V[P31_offset + Q5_offset]

    To encode a claim you could concatenate the main snak vector + the
    qualifiers vectors that is the merge of the snak vector for all
    qualifiers (i.e. you build the vector for all snak and you sum them)
    such that the qualifier vectors encode all qualifiers at the same
    time. it allows to check that a qualifiers is set just by picking
    the right cell in the vector. But it will do bad things if there are
    two qualifiers with the same property and having a datatype like
    time or geocoordinates. But I don't think it really a problem.
    Example: to encode the claim with "P31: Q5" main snak and qualifiers
    "P42: Q42, P42: Q44" we would have a vector V such that V[P31_offset
    + Q5_offset] = 1, V[qualifiers_offset + P42_offset + Q42_offset] = 1
    and V[qualifiers_offset + P42_offset + Q44_offset] = 1 and 0 elsewhere.

    I am not sure how to encode statements references (merge all of them
    and encode it just like the qualifiers vector is maybe a first step
    but is bad if we have multiple references).  For the rank you just
    need 3 booleans is_preferred, is_normal and is_deprecated.

    Cheers,

    Thomas

    [1] https://www.wikidata.org/wiki/Wikidata:Glossary
    <https://www.wikidata.org/wiki/Wikidata:Glossary>


    > Le 27 sept. 2017 à 12:41, John Erling Blad <jeblad@gmail.com <mailto:jeblad@gmail.com>> a écrit :
    >
    > Is there anyone that has done any work on how to encode statements as features for neural nets? I'm mostly interested in sparse encoders for online training of live networks.
    >
    >
     > _______________________________________________
     > Wikidata mailing list
     > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
     > https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>


    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata
    <https://lists.wikimedia.org/mailman/listinfo/wikidata>




_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata