Re: [Wikidata] An attribute for "famous person"

List overview All Threads
Download

newer

older

Adding a link to the project when...

What's the one thing you wish...

Finn Aarup Nielsen

2 Aug 2016 2 Aug '16

4:33 p.m.

On 08/02/2016 03:01 PM, Markus Kroetzsch wrote:

...

On 02.08.2016 13:11, Ghislain ATEMEZING wrote:

...
Thanks Yuri. I will try to define a kind a metric for those having a number of wikipedia entries. For example, a person with 127 entries would be "famous" while another with just 10 is not "famous"...

Side remark @Stas: it could be very helpful to have the number of Wikimedia project articles stored as a numeric value for a new property in RDF. Doing a SPARQL query that computes this number and does something with it afterwards almost always times out. The number could be very useful as a heuristic "popularity" measure that can also help to give the most "important" items first in a number of queries.

Best,

Markus

The kind of queries alluded to (I think) are ones like:

# Famous Danes SELECT ?person ?personLabel (count(?articles) as ?rank) WHERE { ?person wdt:P27 wd:Q35 . OPTIONAL {?articles schema:about ?person} SERVICE wikibase:label {bd:serviceParam wikibase:language "en" } } GROUP BY ?person ?personLabel ORDER BY DESC(?rank) LIMIT 50

Which is derived from the Repository Fringe Wikitalks by Navino Evans and Ewan McAndrew

This one with property counts times out if not restricted to females:

# Famous Danish females SELECT ?person ?personLabel (count(?properties) as ?rank) WHERE { ?person wdt:P27 wd:Q35 . ?person wdt:P21 wd:Q6581072 . ?person ?properties ?something SERVICE wikibase:label {bd:serviceParam wikibase:language "en" } } GROUP BY ?person ?personLabel ORDER BY DESC(?rank) LIMIT 50

Persons in Wikidata may not necessarily be famous. Persons may be in Wikidata because of structural needs.

/Finn

...

...
El mar., 2 ago. 2016 a las 12:52, Yuri Astrakhan (<yastrakhan@wikimedia.org mailto:yastrakhan@wikimedia.org>) escribió:
Any person in wikidata is "famous" - otherwise they wouldn't be
notable and therefore wouldn't be there))
If you prefer the stricter notability requirement(as used by
Wikipedia), search only for those that have a wikipedia page


On Aug 2, 2016 1:44 PM, "Ghislain ATEMEZING"
<ghislain.atemezing@gmail.com <mailto:ghislain.atemezing@gmail.com>>
wrote:

    Ahoy,
    I am curious to know if there is a way to know that a given
    person is "famous" in Wikidata. I want for example to retrieve
    "all famous French people born after a given date".

    Thanks in advance for your help.

    Best,
    Ghislain
    --
    -------
    "Love all, trust a few, do wrong to none" (W. Shakespeare)
    Web: http://atemezing.org

    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org
mailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
--

"Love all, trust a few, do wrong to none" (W. Shakespeare) Web: http://atemezing.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Show replies by date

Andrew Gray

2 Aug 2 Aug

4:41 p.m.

New subject: An attribute for "famous person"

I'd agree with both interpretations - the majority of people in Wikidata are implicitly famous, and so a property saying so wouldn't make sense, but there are (as Finn notes) a lot of nom-famous people created for structural reasons.

Using the existence of Wikipedia articles as a threshold, as suggested, seems a pretty good test - it's flawed, of course, but it's easy to check for and works as a first approximation of "probably is actually famous".

Andrew. On 2 Aug 2016 17:34, "Finn Aarup Nielsen" fn@imm.dtu.dk wrote:

...

On 08/02/2016 03:01 PM, Markus Kroetzsch wrote:

...
On 02.08.2016 13:11, Ghislain ATEMEZING wrote:

...
Thanks Yuri. I will try to define a kind a metric for those having a number of wikipedia entries. For example, a person with 127 entries would be "famous" while another with just 10 is not "famous"...

Side remark @Stas: it could be very helpful to have the number of Wikimedia project articles stored as a numeric value for a new property in RDF. Doing a SPARQL query that computes this number and does something with it afterwards almost always times out. The number could be very useful as a heuristic "popularity" measure that can also help to give the most "important" items first in a number of queries.

Best,

Markus

The kind of queries alluded to (I think) are ones like:

# Famous Danes SELECT ?person ?personLabel (count(?articles) as ?rank) WHERE { ?person wdt:P27 wd:Q35 . OPTIONAL {?articles schema:about ?person} SERVICE wikibase:label {bd:serviceParam wikibase:language "en" } } GROUP BY ?person ?personLabel ORDER BY DESC(?rank) LIMIT 50

Which is derived from the Repository Fringe Wikitalks by Navino Evans and Ewan McAndrew

This one with property counts times out if not restricted to females:

# Famous Danish females SELECT ?person ?personLabel (count(?properties) as ?rank) WHERE { ?person wdt:P27 wd:Q35 . ?person wdt:P21 wd:Q6581072 . ?person ?properties ?something SERVICE wikibase:label {bd:serviceParam wikibase:language "en" } } GROUP BY ?person ?personLabel ORDER BY DESC(?rank) LIMIT 50

Persons in Wikidata may not necessarily be famous. Persons may be in Wikidata because of structural needs.

/Finn

...
...
El mar., 2 ago. 2016 a las 12:52, Yuri Astrakhan (<yastrakhan@wikimedia.org mailto:yastrakhan@wikimedia.org>) escribió:
Any person in wikidata is "famous" - otherwise they wouldn't be
notable and therefore wouldn't be there))
If you prefer the stricter notability requirement(as used by
Wikipedia), search only for those that have a wikipedia page


On Aug 2, 2016 1:44 PM, "Ghislain ATEMEZING"
<ghislain.atemezing@gmail.com <mailto:ghislain.atemezing@gmail.com>>
wrote:

    Ahoy,
    I am curious to know if there is a way to know that a given
    person is "famous" in Wikidata. I want for example to retrieve
    "all famous French people born after a given date".

    Thanks in advance for your help.

    Best,
    Ghislain
    --
    -------
    "Love all, trust a few, do wrong to none" (W. Shakespeare)
    Web: http://atemezing.org

    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org
mailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
--

"Love all, trust a few, do wrong to none" (W. Shakespeare) Web: http://atemezing.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Daniel Kinzler

6:06 p.m.

New subject: An attribute for "famous person"

Am 02.08.2016 um 18:41 schrieb Andrew Gray:

...

I'd agree with both interpretations - the majority of people in Wikidata are Using the existence of Wikipedia articles as a threshold, as suggested, seems a pretty good test - it's flawed, of course, but it's easy to check for and works as a first approximation of "probably is actually famous".

If we want to have the number of sidelinks in RDF, let's please make sure that this number is associated with the item *document* uri, not with the concept uri. After all, the person doesn't have links, the item document does.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Markus Kroetzsch

6:19 p.m.

New subject: An attribute for "famous person"

On 02.08.2016 20:06, Daniel Kinzler wrote:

...

Am 02.08.2016 um 18:41 schrieb Andrew Gray:

...
I'd agree with both interpretations - the majority of people in Wikidata are Using the existence of Wikipedia articles as a threshold, as suggested, seems a pretty good test - it's flawed, of course, but it's easy to check for and works as a first approximation of "probably is actually famous".

If we want to have the number of sidelinks in RDF, let's please make sure that this number is associated with the item *document* uri, not with the concept uri. After all, the person doesn't have links, the item document does.

Oh, there is a little misunderstanding here. I have not suggested to create a property "number of sitelinks in this document". What I propose instead is to create a property "number of sitelinks for the document associated with this entity". The domain of this suggested property is entity. The advantage of this proposal over the thing that you understood is that it makes queries much simpler, since you usually want to sort items by this value, not documents. One could also have a property for number of sitelinks per document, but I don't think it has such a clear use case.

Markus

Daniel Kinzler

6:59 p.m.

New subject: An attribute for "famous person"

Am 02.08.2016 um 20:19 schrieb Markus Kroetzsch:

...

Oh, there is a little misunderstanding here. I have not suggested to create a property "number of sitelinks in this document". What I propose instead is to create a property "number of sitelinks for the document associated with this entity". The domain of this suggested property is entity. The advantage of this proposal over the thing that you understood is that it makes queries much simpler, since you usually want to sort items by this value, not documents. One could also have a property for number of sitelinks per document, but I don't think it has such a clear use case.

"number of sitelinks for the document associated with this entity" strikes me as semantically odd, which was the point of my earlier mail. I'd much rather have "number of sitelinks in this document". You are right that the primary use would be to "rank" items, and that it would be more conveniant to have the count assocdiated directly with the item (the entity), but I fear it will lead to a blurring of the line between information about the entity, and information about the document. That is already a common point of confusion, and I'd rather keep that separation very clear. I also don't think that one level of indirection would be orribly complicated.

To me it's just natural to include the sitelink info on the same level as we provide a timestmap or revision id: for the document.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Markus Kroetzsch

7:41 p.m.

New subject: An attribute for "famous person"

On 02.08.2016 20:59, Daniel Kinzler wrote:

...

Am 02.08.2016 um 20:19 schrieb Markus Kroetzsch:

...
Oh, there is a little misunderstanding here. I have not suggested to create a property "number of sitelinks in this document". What I propose instead is to create a property "number of sitelinks for the document associated with this entity". The domain of this suggested property is entity. The advantage of this proposal over the thing that you understood is that it makes queries much simpler, since you usually want to sort items by this value, not documents. One could also have a property for number of sitelinks per document, but I don't think it has such a clear use case.

"number of sitelinks for the document associated with this entity" strikes me as semantically odd, which was the point of my earlier mail. I'd much rather have "number of sitelinks in this document". You are right that the primary use would be to "rank" items, and that it would be more conveniant to have the count assocdiated directly with the item (the entity), but I fear it will lead to a blurring of the line between information about the entity, and information about the document. That is already a common point of confusion, and I'd rather keep that separation very clear. I also don't think that one level of indirection would be orribly complicated.

To me it's just natural to include the sitelink info on the same level as we provide a timestmap or revision id: for the document.

I just proposed the simple and straightforward way to solve the practical problem at hand. It leads to shorter, more readable queries that execute faster. (I don't claim originality for this; it is the obvious solution to the problem and most people would arrive at exactly the same conclusion).

Your concern is based on the assumption that there is some kind of psychological effect that a particular RDF encoding would have on users. I don't think that there is any such effect. Our users will not confuse the city of Paris with an RDF document just because of some data in the RDF store.

Markus

-- Prof. Dr. Markus Kroetzsch Knowledge-Based Systems Group Faculty of Computer Science TU Dresden +49 351 463 38486 https://iccl.inf.tu-dresden.de/web/KBS/en

Yuri Astrakhan

8:28 p.m.

New subject: An attribute for "famous person"

Is there a way we could have more than just the number of language links? Eg number of incoming links from other wikipedia pages?

On Aug 2, 2016 10:41 PM, "Markus Kroetzsch" markus.kroetzsch@tu-dresden.de wrote:

...

On 02.08.2016 20:59, Daniel Kinzler wrote:

...
Am 02.08.2016 um 20:19 schrieb Markus Kroetzsch:

...
Oh, there is a little misunderstanding here. I have not suggested to create a property "number of sitelinks in this document". What I propose instead is to create a property "number of sitelinks for the document associated with this entity". The domain of this suggested property is entity. The advantage of this proposal over the thing that you understood is that it makes queries much simpler, since you usually want to sort items by this value, not documents. One could also have a property for number of sitelinks per document, but I don't think it has such a clear use case.

"number of sitelinks for the document associated with this entity" strikes me as semantically odd, which was the point of my earlier mail. I'd much rather have "number of sitelinks in this document". You are right that the primary use would be to "rank" items, and that it would be more conveniant to have the count assocdiated directly with the item (the entity), but I fear it will lead to a blurring of the line between information about the entity, and information about the document. That is already a common point of confusion, and I'd rather keep that separation very clear. I also don't think that one level of indirection would be orribly complicated.

To me it's just natural to include the sitelink info on the same level as we provide a timestmap or revision id: for the document.

I just proposed the simple and straightforward way to solve the practical problem at hand. It leads to shorter, more readable queries that execute faster. (I don't claim originality for this; it is the obvious solution to the problem and most people would arrive at exactly the same conclusion).

Your concern is based on the assumption that there is some kind of psychological effect that a particular RDF encoding would have on users. I don't think that there is any such effect. Our users will not confuse the city of Paris with an RDF document just because of some data in the RDF store.

Markus

-- Prof. Dr. Markus Kroetzsch Knowledge-Based Systems Group Faculty of Computer Science TU Dresden +49 351 463 38486 https://iccl.inf.tu-dresden.de/web/KBS/en

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

3 Aug 3 Aug

12:51 a.m.

New subject: An attribute for "famous person"

Hi!

...

Is there a way we could have more than just the number of language links? Eg number of incoming links from other wikipedia pages?

If we implement T129046 we can have any page props we want to :) Of course, adding them for the whole DB would require new dump and either data reload or some manual work... which can be solved too with some effort. I wonder if Wikidata Toolkit has tools to do filtering like this (i.e. "get me specific triple from all entities in this dump"). If not, it probably should have ;)

-- Stas Malyshev smalyshev@wikimedia.org

Markus Kroetzsch

5:42 a.m.

New subject: An attribute for "famous person"

On 03.08.2016 02:51, Stas Malyshev wrote:

...

Hi!

...
Is there a way we could have more than just the number of language links? Eg number of incoming links from other wikipedia pages?

If we implement T129046 we can have any page props we want to :) Of course, adding them for the whole DB would require new dump and either data reload or some manual work... which can be solved too with some effort. I wonder if Wikidata Toolkit has tools to do filtering like this (i.e. "get me specific triple from all entities in this dump"). If not, it probably should have ;)

Are you asking if WDTK can make RDF exports that contain only some parts of the RDF data? Yes, this is possible. We did not put too much effort in the RDF export though since WMF has reimplemented this anyway. Probably would need some de-dusting.

If you were asking about triple filtering on RDF data that you already have (not "filtered generation" of new RDF), then this is not something that WDTK aims at (since WDTK does not read RDF data in the first place). However, you can often achieve this with grep if the RDF data is in ntriples format.

Markus

Stas Malyshev

5:51 a.m.

New subject: An attribute for "famous person"

Hi!

...

If you were asking about triple filtering on RDF data that you already have (not "filtered generation" of new RDF), then this is not something that WDTK aims at (since WDTK does not read RDF data in the first place). However, you can often achieve this with grep if the RDF data is in ntriples format.

In simple cases - like just finding one predicate - probably yes, but in some cases statements can span multiple lines/clauses, etc. And, we don't actually have a dump in ntriples format, only in Turtle - which is not super-greppable. We could probably produce one manually but I wonder if we already have a tool that deals with a lot of this stuff... Not reading RDF may be a problem, I though it could read RDF too, not just write. I wonder if it's worth adding, for other use cases too. But I'll think more about it.

-- Stas Malyshev smalyshev@wikimedia.org

Markus Kroetzsch

5:08 a.m.

New subject: An attribute for "famous person"

On 02.08.2016 22:28, Yuri Astrakhan wrote:

...

Is there a way we could have more than just the number of language links? Eg number of incoming links from other wikipedia pages?

One could have other data added to the store, but this may be more work depending on what you want. You ask about links from "wikipedia pages". If you really mean this (and not Wikidata items), then this would be a lot of work to do since one would have to update RDF when (any) Wikipedia page changes. I guess we do not have infrastructure for doing this in a life update mode. Also note that the number of these links is different in each language, so one would have to store many numbers. Overall, this link count would really be (meta)data about Wikipedia pages and their relations, and not so much about Wikidata. I think you could get such Wikipedia-specific data from DBpedia, but I am not sure how well their life endpoint keeps track of this data (since it is tricky). Maybe an offline solution that combines RDF dumps is the most practical approach for now if you really need this data.

Even storing the number of incoming links (properties) from other Wikidata items would actually be tricky. Currently, the RDF data about each item only depends on the content of this item's Wikidata page. The number of inlinks depends on other Wikidata pages, and therefore it is much more work to keep it up to date when there are edits.

Markus

...

On Aug 2, 2016 10:41 PM, "Markus Kroetzsch" <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:

On 02.08.2016 20:59, Daniel Kinzler wrote:

    Am 02.08.2016 um 20:19 schrieb Markus Kroetzsch:

        Oh, there is a little misunderstanding here. I have not
        suggested to create a
        property "number of sitelinks in this document". What I
        propose instead is to
        create a property "number of sitelinks for the document
        associated with this
        entity". The domain of this suggested property is entity.
        The advantage of this
        proposal over the thing that you understood is that it makes
        queries much
        simpler, since you usually want to sort items by this value,
        not documents. One
        could also have a property for number of sitelinks per
        document, but I don't
        think it has such a clear use case.


    "number of sitelinks for the document associated with this
    entity" strikes me as
    semantically odd, which was the point of my earlier mail. I'd
    much rather have
    "number of sitelinks in this document". You are right that the
    primary use would
    be to "rank" items, and that it would be more conveniant to have
    the count
    assocdiated directly with the item (the entity), but I fear it
    will lead to a
    blurring of the line between information about the entity, and
    information about
    the document. That is already a common point of confusion, and
    I'd rather keep
    that separation very clear. I also don't think that one level of
    indirection
    would be orribly complicated.

    To me it's just natural to include the sitelink info on the same
    level as we
    provide a timestmap or revision id: for the document.


I just proposed the simple and straightforward way to solve the
practical problem at hand. It leads to shorter, more readable
queries that execute faster. (I don't claim originality for this; it
is the obvious solution to the problem and most people would arrive
at exactly the same conclusion).

Your concern is based on the assumption that there is some kind of
psychological effect that a particular RDF encoding would have on
users. I don't think that there is any such effect. Our users will
not confuse the city of Paris with an RDF document just because of
some data in the RDF store.

Markus

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
https://iccl.inf.tu-dresden.de/web/KBS/en

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Andrew Gray

2 Aug 2 Aug

8:50 p.m.

New subject: An attribute for "famous person"

On 2 August 2016 at 20:41, Markus Kroetzsch markus.kroetzsch@tu-dresden.de wrote:

...

I just proposed the simple and straightforward way to solve the practical problem at hand. It leads to shorter, more readable queries that execute faster. (I don't claim originality for this; it is the obvious solution to the problem and most people would arrive at exactly the same conclusion).

As an aside, storing "pagecount" as a simple number also allows us to do a query to trivially find entries with "pagecount=0" - ie, no project links. I know this used to be challenging for WDQ; is it an easy thing to do in SPARQL or would this be useful added functionality?

-- - Andrew Gray andrew.gray@dunelm.org.uk

Stas Malyshev

3 Aug 3 Aug

12:49 a.m.

New subject: An attribute for "famous person"

Hi!

...

Oh, there is a little misunderstanding here. I have not suggested to create a property "number of sitelinks in this document". What I propose instead is to create a property "number of sitelinks for the document associated with this entity". The domain of this suggested property is

I think this is covered by https://phabricator.wikimedia.org/T129046 - which seeks to add page props (which already have sitelinks count I think but we can define any that we want) to RDF. I kind of neglected it due to the lack of demand, but it should not be that hard to do.

-- Stas Malyshev smalyshev@wikimedia.org

Markus Kroetzsch

5:35 a.m.

New subject: An attribute for "famous person"

On 03.08.2016 02:49, Stas Malyshev wrote:

...

Hi!

...
Oh, there is a little misunderstanding here. I have not suggested to create a property "number of sitelinks in this document". What I propose instead is to create a property "number of sitelinks for the document associated with this entity". The domain of this suggested property is

I think this is covered by https://phabricator.wikimedia.org/T129046 - which seeks to add page props (which already have sitelinks count I think but we can define any that we want) to RDF. I kind of neglected it due to the lack of demand, but it should not be that hard to do.

If you think it is best to implement a more general feature that adds even more properties, then I am sure nobody will complain, but it sounds like more work to me. The number I was asking for is something that you can easily compute from the data that you process already. You can also compute the number in a SPARQL query from the RDF. It is a completely redundant piece of information. It's only purpose is to make SPARQL queries that currently time out fast. In databases, such things are called "materialized views".

This leads to a slightly different perspective than the one you'd have in T129046. By adding page props, you want to add "new" information from another source, and questions like data modelling etc. come to the fore. With a materialized view, you just add some query results back to the database for technical reasons that are specific to the database. The two motivations might lead to different requirements at some point (e.g., if you want to add another materialized query result to the RDF you may have to extend page props, which involves more dependencies than if you just extend the RDF converter).

Markus

Stas Malyshev

5:44 a.m.

New subject: An attribute for "famous person"

Hi!

...

If you think it is best to implement a more general feature that adds even more properties, then I am sure nobody will complain, but it sounds like more work to me. The number I was asking for is something that you

I don't think it's *much* more work, and I planned to do this work anyway :) Of course, it may happen that I am wrong about how much work it is, and then I might reconsider.

...

compute the number in a SPARQL query from the RDF. It is a completely redundant piece of information. It's only purpose is to make SPARQL queries that currently time out fast. In databases, such things are called "materialized views".

Speaking of which, Blazegraph does have support for inferring data, but I don't want to open that particular can of worms just yet.

...

This leads to a slightly different perspective than the one you'd have in T129046. By adding page props, you want to add "new" information from another source, and questions like data modelling etc. come to the fore. With a materialized view, you just add some query results back to the database for technical reasons that are specific to the database. The two motivations might lead to different requirements at some point (e.g., if you want to add another materialized query result to the RDF you may have to extend page props, which involves more dependencies than if you just extend the RDF converter).

While in theory this is true, we don't have any process that allows us to do literally materialized views on current platform (there are named queries but that's not the same I think). Inference "kind of" might be that, but doing it that way probably would be very inefficient for this particular case. There are of course other ways to achieve the same, so I'll look into various options, but so far page props doesn't sound like that bad an idea, to me.

-- Stas Malyshev smalyshev@wikimedia.org

Jane Darnell

6:36 a.m.

New subject: An attribute for "famous person"

Would page props also give me the creation date of the Wikipedia page in that specific sitelink? Because this is something I needed when analyzing the data for the TED speakers challenge. When running an international writing challenge for Wikipedia it would be nice to rune a daily or weekly count of all articles created in the challenge. I used the Mix-n-Match "stats" page to compute this against a starting position, but it would be nice to filter out sitelink additions that actually pre-date the challenge startdate, and also check for sitelink deletions as well as additions. I think this is only possible if you can compare the sitelink creation date with the Wikipedia article creation date. For some background you can see a spot-check analysis I ran for the Women in Red project here: https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Women_in_Red/Analys...

On Wed, Aug 3, 2016 at 7:44 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

...
If you think it is best to implement a more general feature that adds even more properties, then I am sure nobody will complain, but it sounds like more work to me. The number I was asking for is something that you

I don't think it's *much* more work, and I planned to do this work anyway :) Of course, it may happen that I am wrong about how much work it is, and then I might reconsider.

...
compute the number in a SPARQL query from the RDF. It is a completely redundant piece of information. It's only purpose is to make SPARQL queries that currently time out fast. In databases, such things are called "materialized views".

Speaking of which, Blazegraph does have support for inferring data, but I don't want to open that particular can of worms just yet.

...
This leads to a slightly different perspective than the one you'd have in T129046. By adding page props, you want to add "new" information from another source, and questions like data modelling etc. come to the fore. With a materialized view, you just add some query results back to the database for technical reasons that are specific to the database. The two motivations might lead to different requirements at some point (e.g., if you want to add another materialized query result to the RDF you may have to extend page props, which involves more dependencies than if you just extend the RDF converter).

While in theory this is true, we don't have any process that allows us to do literally materialized views on current platform (there are named queries but that's not the same I think). Inference "kind of" might be that, but doing it that way probably would be very inefficient for this particular case. There are of course other ways to achieve the same, so I'll look into various options, but so far page props doesn't sound like that bad an idea, to me.

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

8:04 a.m.

New subject: An attribute for "famous person"

Hi!

On 8/2/16 11:36 PM, Jane Darnell wrote:

...

Would page props also give me the creation date of the Wikipedia page in that specific sitelink? Because this is something I needed when

Don't think so and I don't think such data should be in Wikidata or WDQS database - it's Wikipedia administrative data and should be there.

External service can combine data from these sources but I don't think it falls under WDQS tasks.

-- Stas Malyshev smalyshev@wikimedia.org

Jane Darnell

8:21 a.m.

New subject: An attribute for "famous person"

Too bad, because it would be great for all sorts of project workflows!

On Wed, Aug 3, 2016 at 10:04 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

On 8/2/16 11:36 PM, Jane Darnell wrote:

...
Would page props also give me the creation date of the Wikipedia page in that specific sitelink? Because this is something I needed when

Don't think so and I don't think such data should be in Wikidata or WDQS database - it's Wikipedia administrative data and should be there.

External service can combine data from these sources but I don't think it falls under WDQS tasks. -- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Yuri Astrakhan

8:41 a.m.

New subject: An attribute for "famous person"

Jane, now we are really going into the field of elastic search's relevancy calculation. When searching, things like popularity (pageviews), incoming links, number of different language wiki articles, article size, article quality (good/selected), and many other aspects could be used to better the results. I wish these were available together with the WDQS results, possibly as a number similar to Google's "page rank".

On Wed, Aug 3, 2016 at 11:21 AM, Jane Darnell jane023@gmail.com wrote:

...

Too bad, because it would be great for all sorts of project workflows!

On Wed, Aug 3, 2016 at 10:04 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

On 8/2/16 11:36 PM, Jane Darnell wrote:

...
Would page props also give me the creation date of the Wikipedia page in that specific sitelink? Because this is something I needed when

Don't think so and I don't think such data should be in Wikidata or WDQS database - it's Wikipedia administrative data and should be there.

External service can combine data from these sources but I don't think it falls under WDQS tasks. -- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Dimitris Kontokostas

2:40 p.m.

New subject: An attribute for "famous person"

Apologies for the brief and combined reply but I'm on mobile. Will try and answer the DBpedia-related comments.

DBpedia provides out degree metrics ( number of outgoing links from an article) and article size (in wikitext chars) directly through related extractors that can be used for ranking. In-degree is easy to calculate from the DBpedia dumps but iirc we did not include it in the last releases (not sure why). Also in-degree is not provided in DBpedia-Live but the other metrics are.

These metrics are nice but fail in some cases iirc years are heavily linked but articles about a year e.g. 2000 is not so important.

I agree that a page rank metric would be the most appropriate in this case. We have pagerank metrics iirc for en, de, nl and we are preparing a Wikidata-based pagerank that will be presented in the DBpedia meeting in Leipzig next month.

Best, Dimitris

Typed by thumb. Please forgive brevity, errors.

On Aug 3, 2016 11:41, "Yuri Astrakhan" yastrakhan@wikimedia.org wrote:

...

Jane, now we are really going into the field of elastic search's relevancy calculation. When searching, things like popularity (pageviews), incoming links, number of different language wiki articles, article size, article quality (good/selected), and many other aspects could be used to better the results. I wish these were available together with the WDQS results, possibly as a number similar to Google's "page rank".

On Wed, Aug 3, 2016 at 11:21 AM, Jane Darnell jane023@gmail.com wrote:

...
Too bad, because it would be great for all sorts of project workflows!

On Wed, Aug 3, 2016 at 10:04 AM, Stas Malyshev smalyshev@wikimedia.org wrote:

...
Hi!

On 8/2/16 11:36 PM, Jane Darnell wrote:

...
Would page props also give me the creation date of the Wikipedia page

in

...
that specific sitelink? Because this is something I needed when

Don't think so and I don't think such data should be in Wikidata or WDQS database - it's Wikipedia administrative data and should be there.

External service can combine data from these sources but I don't think it falls under WDQS tasks. -- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

11:54 p.m.

New subject: An attribute for "famous person"

Hi!

...

Jane, now we are really going into the field of elastic search's relevancy calculation. When searching, things like popularity (pageviews), incoming links, number of different language wiki articles, article size, article quality (good/selected), and many other aspects could be used to better the results. I wish these were available together with the WDQS results, possibly as a number similar to Google's "page rank".

See https://phabricator.wikimedia.org/T141813. It is on the TODO list, but I'm not sure if I get to it this Q (maybe, maybe not). If we make it talk to Elastic, I imagine anything available to Elastic would be possible to use, but no promises since I didn't research it yet.

-- Stas Malyshev smalyshev@wikimedia.org

Ghislain ATEMEZING

2 Aug 2 Aug

9:47 p.m.

New subject: An attribute for "famous person"

@Finn: I've tried to modify your second query to "Famous French females" and I get a time out...Do you have the same issue?

El mar., 2 ago. 2016 a las 18:33, Finn Aarup Nielsen (fn@imm.dtu.dk) escribió:

...

On 08/02/2016 03:01 PM, Markus Kroetzsch wrote:

...
On 02.08.2016 13:11, Ghislain ATEMEZING wrote:

...
Thanks Yuri. I will try to define a kind a metric for those having a number of wikipedia entries. For example, a person with 127 entries would be "famous" while another with just 10 is not "famous"...

Side remark @Stas: it could be very helpful to have the number of Wikimedia project articles stored as a numeric value for a new property in RDF. Doing a SPARQL query that computes this number and does something with it afterwards almost always times out. The number could be very useful as a heuristic "popularity" measure that can also help to give the most "important" items first in a number of queries.

+1. And I think this will also help in many other use cases, such as visualizations.

Best, Ghislain

...

--

------- "Love all, trust a few, do wrong to none" (W. Shakespeare) Web: http://atemezing.org

2910

Age (days ago)

2911

Last active (days ago)

wikidata@lists.wikimedia.org

21 comments

9 participants

tags (0)

participants (9)

Andrew Gray
Daniel Kinzler
Dimitris Kontokostas
Finn Aarup Nielsen
Ghislain ATEMEZING
Jane Darnell
Markus Kroetzsch
Stas Malyshev
Yuri Astrakhan