Looking for "data quality check" bots - Wikidata - lists.wikimedia.org

List overview All Threads
Download

Looking for "data quality check" bots

Semantic annotation of red links...

Fwd: [Wikimedia-l] Captioning...

Ettore RIZZA

26 Sep 2018 26 Sep '18

2:31 p.m.

Dear all, I'm looking for Wikidata bots that perform accuracy audits. For example, comparing the birth dates of persons with the same date indicated in databases linked to the item by an external-id. I do not even know if they exist. Bots are often poorly documented, so I appeal to the community to get some example. Many thanks. Ettore Rizza

Attachments:

attachment.htm (text/html — 548 bytes)

Reply

Show replies by date

Federico Leva (Nemo)

26 Sep 26 Sep

7 p.m.

Ettore RIZZA, 26/09/2018 15:31:

I'm looking for Wikidata bots that perform accuracy audits. For example, comparing the birth dates of persons with the same date indicated in databases linked to the item by an external-id.

This is mostly a screenscraping job, because most external databases are only accessibly in unstructured or poorly structured HTML form. Federico

Reply

Paul Houle

8:47 p.m.

"Poorly structured" HTML is not all that bad in 2018 thanks to HTML 5 (which builds the "rendering decisions made about broken HTML from Netscape 3" into the standard so that in common languages you can get the same DOM tree as the browser) If you try to use an official or unofficial API to fetch data from some service in 2018 you will have to add some dependencies and you just might open a can of whoop-ass that will make you reinstall Anconda or maybe you will learn something you'll never be able to unlearn about how XML processing changed between two minor versions of the JDK On the other hand I have often dusted off the old HTML-based parser I made for Flickr and found I could get it to work for other media collections, blogs, etc. by just changing the "semantic model" embodied in the application which could be as simple as some function or object that knows something about the structure of the URLs some documents. I cannot understand why so many standards have been pushed to integrate RDF and HTML that have gone nowhere but nobody has promoted the clean solution of "add a css media type for RDF" that marks the semantics of HTML up the way JSON-LD works. Often though if you look it that way much of the time these days matching patterns against CSS gets you most of the way there. I've had cases where I haven't had to change the rule sets much at all but none of them have been more than 50 lines of code, all much less. ------ Original Message ------ From: "Federico Leva (Nemo)" <nemowiki(a)gmail.com> To: "Discussion list for the Wikidata project" <wikidata(a)lists.wikimedia.org>rg>; "Ettore RIZZA" <ettorerizza(a)gmail.com> Sent: 9/26/2018 1:00:53 PM Subject: Re: [Wikidata] Looking for "data quality check" bots

Ettore RIZZA, 26/09/2018 15:31:

I'm looking for Wikidata bots that perform accuracy audits. For example, comparing the birth dates of persons with the same date indicated in databases linked to the item by an external-id.

This is mostly a screenscraping job, because most external databases are only accessibly in unstructured or poorly structured HTML form. Federico _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply

Ettore RIZZA

9:26 p.m.

Hi, Wikidata is obviously linked to a bunch of unusable external ids, but also to some very structured data. I'm interested for the moment in the state of the art - even based on poor scraping, why not?. I see for example this request for permission <https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot/Symac_bot_4> for a bot able to retrieve information on the BNF (French national library) database. It has been refused because of copyright's issues, but simply checking the information without extracting anything is allowed, isn't? On Wed, 26 Sep 2018 at 20:48, Paul Houle <ontology2(a)gmail.com> wrote:

"Poorly structured" HTML is not all that bad in 2018 thanks to HTML 5 (which builds the "rendering decisions made about broken HTML from Netscape 3" into the standard so that in common languages you can get the same DOM tree as the browser) If you try to use an official or unofficial API to fetch data from some service in 2018 you will have to add some dependencies and you just might open a can of whoop-ass that will make you reinstall Anconda or maybe you will learn something you'll never be able to unlearn about how XML processing changed between two minor versions of the JDK On the other hand I have often dusted off the old HTML-based parser I made for Flickr and found I could get it to work for other media collections, blogs, etc. by just changing the "semantic model" embodied in the application which could be as simple as some function or object that knows something about the structure of the URLs some documents. I cannot understand why so many standards have been pushed to integrate RDF and HTML that have gone nowhere but nobody has promoted the clean solution of "add a css media type for RDF" that marks the semantics of HTML up the way JSON-LD works. Often though if you look it that way much of the time these days matching patterns against CSS gets you most of the way there. I've had cases where I haven't had to change the rule sets much at all but none of them have been more than 50 lines of code, all much less. ------ Original Message ------ From: "Federico Leva (Nemo)" <nemowiki(a)gmail.com> To: "Discussion list for the Wikidata project" <wikidata(a)lists.wikimedia.org>rg>; "Ettore RIZZA" <ettorerizza(a)gmail.com> Sent: 9/26/2018 1:00:53 PM Subject: Re: [Wikidata] Looking for "data quality check" bots

Ettore RIZZA, 26/09/2018 15:31:

I'm looking for Wikidata bots that perform accuracy audits. For example, comparing the birth dates of persons with the same date indicated in databases linked to the item by an external-id.

This is mostly a screenscraping job, because most external databases are only accessibly in unstructured or poorly structured HTML form. Federico _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply

Maarten Dammers

29 Sep 29 Sep

1:02 p.m.

Hi Ettore, On 26-09-18 14:31, Ettore RIZZA wrote:

Dear all, I'm looking for Wikidata bots that perform accuracy audits. For example, comparing the birth dates of persons with the same date indicated in databases linked to the item by an external-id.

Let's have a look at the evolution of automated editing. The first step is to add missing data from anywhere. Bots importing date of birth are an example of this. The next step is to add data from somewhere with a source or add sources to existing unsourced or badly sourced statements. As far as I can see that's where we are right now, see for example edits like https://www.wikidata.org/w/index.php?title=Q41264&type=revision&dif… is . Of course the next step would be to be able to compare existing sourced statements with external data to find differences. But how would the work flow be? Take for example Johannes Vermeer ( https://www.wikidata.org/wiki/Q41264 ). Extremely well documented and researched, but http://www.getty.edu/vow/ULANFullDisplay?find=&role=&nation=&su… and https://rkd.nl/nl/explore/artists/80476 combined provide 3 different dates of birth and 3 different dates of death. When it comes to these kind of date mismatches, it's generally first come, first served (first date added doesn't get replaced). This mismatch could show up in some report. I can check it as a human and maybe do some adjustments, but how would I sign it of to prevent other people from doing the same thing over and over again? With federated SPARQL queries it becomes much easier to generate reports of mismatches. See for example https://www.wikidata.org/wiki/Property_talk:P1006/Mismatches . Maarten

Reply

Ettore RIZZA

6:21 p.m.

Hi Maarten, Thank you very much for your answer and your pointers. The page (which I did not know existed) containing a federated SPARQL query is definitely close to what I mean. It just misses one more step: deciding who is right. If we look at the first result of the table <https://www.wikidata.org/wiki/Property_talk:P1006/Mismatches> of mismatches (Dmitry Bortniansky <https://www.wikidata.org/wiki/Q316505>) and we draw a little graph, the result is: [image: Diagram.png] We can see that the error comes (probably) from Viaf, which contains a duplicate, and from NTA, which obviously created an authority based on this bad Viaf ID. My research is very close to this kind of case, and I am very interested to know what is already implemented in Wikidata. Cheers, Ettore Rizza On Sat, 29 Sep 2018 at 13:03, Maarten Dammers <maarten(a)mdammers.nl> wrote:

Hi Ettore, On 26-09-18 14:31, Ettore RIZZA wrote:

Dear all, I'm looking for Wikidata bots that perform accuracy audits. For example, comparing the birth dates of persons with the same date indicated in databases linked to the item by an external-id.

Let's have a look at the evolution of automated editing. The first step is to add missing data from anywhere. Bots importing date of birth are an example of this. The next step is to add data from somewhere with a source or add sources to existing unsourced or badly sourced statements. As far as I can see that's where we are right now, see for example edits like https://www.wikidata.org/w/index.php?title=Q41264&type=revision&dif… is . Of course the next step would be to be able to compare existing sourced statements with external data to find differences. But how would the work flow be? Take for example Johannes Vermeer ( https://www.wikidata.org/wiki/Q41264 ). Extremely well documented and researched, but http://www.getty.edu/vow/ULANFullDisplay?find=&role=&nation=&su… and https://rkd.nl/nl/explore/artists/80476 combined provide 3 different dates of birth and 3 different dates of death. When it comes to these kind of date mismatches, it's generally first come, first served (first date added doesn't get replaced). This mismatch could show up in some report. I can check it as a human and maybe do some adjustments, but how would I sign it of to prevent other people from doing the same thing over and over again? With federated SPARQL queries it becomes much easier to generate reports of mismatches. See for example https://www.wikidata.org/wiki/Property_talk:P1006/Mismatches . Maarten _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply

2029

days inactive

2032

days old

wikidata@lists.wikimedia.org

Manage subscription

5 comments

4 participants

tags (0)

participants (4)

Ettore RIZZA
Federico Leva (Nemo)
Maarten Dammers
Paul Houle