Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

List overview All Threads
Download

newer

older

Meta now has access to the data on...

Miga Classes and Properties Browser

Amir Ladsgroup

9 Dec 2015 9 Dec '15

1:48 a.m.

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php <http://tools.wmflabs.org/wd-analyst/index.php>* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: - Please tell me whether do you want this tool at all - Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): - Support more datatypes (e.g. date of birth based on year, coordinates) - Sitelink-based and reference-based analysis (to check how much of articles of, let's say, Chinese Wikipedia are unsourced) - Free-style analysis: There is a database for this tool that can be used for way more applications. You can get the most unsourced statements of P31 and then you can go to fix them. I'm trying to build a playground for this kind of tasks) I hope you like this and rock on! <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> Best

Attachments:

attachment.htm (text/html — 2.6 KB)

Show replies by date

Jane Darnell

9 Dec 9 Dec

2:06 a.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Very useful, Amir, thanks! I just ran it for occupation=painter (p=P106&q=Q1028181) Am I correct in my interpretation that in general painters have fewer claims than the entire population of items with the property occupation? On Tue, Dec 8, 2015 at 6:48 PM, Amir Ladsgroup <ladsgroup(a)gmail.com> wrote:

...

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php <http://tools.wmflabs.org/wd-analyst/index.php>* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30%7CQ183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: - Please tell me whether do you want this tool at all - Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): - Support more datatypes (e.g. date of birth based on year, coordinates) - Sitelink-based and reference-based analysis (to check how much of articles of, let's say, Chinese Wikipedia are unsourced) - Free-style analysis: There is a database for this tool that can be used for way more applications. You can get the most unsourced statements of P31 and then you can go to fix them. I'm trying to build a playground for this kind of tasks) I hope you like this and rock on! <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> Best _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Amir Ladsgroup

2:27 a.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Hey Jane, Yes. Exactly :) Best On Tue, Dec 8, 2015 at 9:37 PM Jane Darnell <jane023(a)gmail.com> wrote:

...

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php <http://tools.wmflabs.org/wd-analyst/index.php>* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30%7CQ183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: - Please tell me whether do you want this tool at all - Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): - Support more datatypes (e.g. date of birth based on year, coordinates) - Sitelink-based and reference-based analysis (to check how much of articles of, let's say, Chinese Wikipedia are unsourced) - Free-style analysis: There is a database for this tool that can be used for way more applications. You can get the most unsourced statements of P31 and then you can go to fix them. I'm trying to build a playground for this kind of tasks) I hope you like this and rock on! <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> Best _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Krötzsch

4:41 a.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

...

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: * Please tell me whether do you want this tool at all * Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): * Support more datatypes (e.g. date of birth based on year, coordinates) * Sitelink-based and reference-based analysis (to check how much of articles of, let's say, Chinese Wikipedia are unsourced) * Free-style analysis: There is a database for this tool that can be used for way more applications. You can get the most unsourced statements of P31 and then you can go to fix them. I'm trying to build a playground for this kind of tasks) I hope you like this and rock on! <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> Best _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Amir Ladsgroup

11:50 a.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Hey Markus, On Wed, Dec 9, 2015 at 12:12 AM Markus Krötzsch < markus(a)semantic-mediawiki.org> wrote:

...

I build a database based on weekly JSON dumps. we would have some delay in the data but computationally it's fast. Using Wikidata database directly makes performance so poor that it becomes a good attack point.

...

An obvious feature request would be to display entity ids as links to the appropriate page, and maybe with their labels (in a language of your choice). Done. :)

...

But overall very nice. Regards, Markus On 08.12.2015 18:48, Amir Ladsgroup wrote:

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: * Please tell me whether do you want this tool at all * Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): * Support more datatypes (e.g. date of birth based on year,

coordinates)

* Sitelink-based and reference-based analysis (to check how much of articles of, let's say, Chinese Wikipedia are unsourced) * Free-style analysis: There is a database for this tool that can be used for way more applications. You can get the most unsourced statements of P31 and then you can go to fix them. I'm trying to build a playground for this kind of tasks) I hope you like this and rock on! <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> Best _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Amir Ladsgroup

11:50 a.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

I also published the source code (it's based on python and PHP) PRs are welcome https://github.com/Ladsgroup/wd-analyst On Wed, Dec 9, 2015 at 7:20 AM Amir Ladsgroup <ladsgroup(a)gmail.com> wrote:

...

Hey Markus, On Wed, Dec 9, 2015 at 12:12 AM Markus Krötzsch < markus(a)semantic-mediawiki.org> wrote:

An obvious feature request would be to display entity ids as links to the appropriate page, and maybe with their labels (in a language of your choice). Done. :)

But overall very nice. Regards, Markus On 08.12.2015 18:48, Amir Ladsgroup wrote:

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: * Please tell me whether do you want this tool at all * Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): * Support more datatypes (e.g. date of birth based on year,

coordinates)

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Gerard Meijssen

5:36 p.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

...

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: * Please tell me whether do you want this tool at all * Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): * Support more datatypes (e.g. date of birth based on year, coordinates) * Sitelink-based and reference-based analysis (to check how much of articles of, let's say, Chinese Wikipedia are unsourced) * Free-style analysis: There is a database for this tool that can be used for way more applications. You can get the most unsourced statements of P31 and then you can go to fix them. I'm trying to build a playground for this kind of tasks) I hope you like this and rock on! <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> Best _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

André Costa

10 Dec 10 Dec

4:42 a.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Nice tool! To understand the statistics better. If a claim has two sources, one wikipedia and one other, how does that show up in the statistics? The reason I'm wondering is because I would normally care if a claim is sourced or not (but not by how many sources) and whether it is sourced by only Wikipedias or anything else. E.g. 1) a statment with 10 claims each sourced is "better" than one with 10 claims where one claim has 10 sources. 2) a statement with a wiki source + another source is "better" than on with just a wiki source and just as "good" as one without the wiki source. Also is wiki ref/source Wikipedia only or any Wikimedia project? Whilst (last I checked) the others were only 70,000 refs compared to the 21 million from Wikipedia they might be significant for certain domains and are just as "bad". Cheers, André On 9 Dec 2015 10:37, "Gerard Meijssen" <gerard.meijssen(a)gmail.com> wrote:

...

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: * Please tell me whether do you want this tool at all * Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): * Support more datatypes (e.g. date of birth based on year, coordinates) * Sitelink-based and reference-based analysis (to check how much of articles of, let's say, Chinese Wikipedia are unsourced) * Free-style analysis: There is a database for this tool that can be used for way more applications. You can get the most unsourced statements of P31 and then you can go to fix them. I'm trying to build a playground for this kind of tasks) I hope you like this and rock on! <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> Best _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Amir Ladsgroup

13 Dec 13 Dec

7:40 a.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Hey, I made some significant changes based on feedbacks * Per suggestion of Nemo_bis I added reference-based analysis: Here's <http://tools.wmflabs.org/wd-analyst/ref.php?p=P143&q=Q328|Q11920&pp=P31> an example * I added limit parameter which you can get more results if you want (both for reference-based and property-based analysis) for example: http://tools.wmflabs.org/wd-analyst/index.php?p=P31&q=&limit=50 (Maximum acceptable value is 50) * Per suggestion of André I added a column to the database and results which gives you number of percentage of unsourced statements. Obviously it doesn't apply to reference-based analysis. for example https://tools.wmflabs.org/wd-analyst/index.php?p=P1082&q= shows only 2% of statements of population are unsourced For Gerard suggestion. It's definitely a good idea but problem is it's technically hard because every week it makes the databse twice as big. We can store only a limited number (e.g. last three weeks) or apply this to a limited number of value-pair properties. I'm looking to find out which one is better. Best On Thu, Dec 10, 2015 at 12:13 AM André Costa <andre.costa(a)wikimedia.se> wrote:

...

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: * Please tell me whether do you want this tool at all * Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): * Support more datatypes (e.g. date of birth based on year, coordinates) * Sitelink-based and reference-based analysis (to check how much of articles of, let's say, Chinese Wikipedia are unsourced) * Free-style analysis: There is a database for this tool that can be used for way more applications. You can get the most unsourced statements of P31 and then you can go to fix them. I'm trying to build a playground for this kind of tasks) I hope you like this and rock on! <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> Best _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________

Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Paul Houle

14 Dec 14 Dec

9:58 p.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

It's a step in the right direction, but it took a very long time to load on my computer. After the initial load, it was pretty peppy, then I ran the default example that is grayed in but not active (I had to retype it) Then I get the page that says "results are ready" and how cool they are, then it takes me a while to figure out what I am looking at and finally realize it is a comparison of data quality metrics (which I think are all fact counts) between all of the P31 predicates and the Q5. The use of the graphic on the first row complicated this for me. There are a lot of broken links on this page too such as http://tools.wmflabs.org/wd-analyst/sitelink.php https://www.wikidata.org/wiki/P31 and of course no merged in documentation about what P31 and Q5 are. Opaque identifiers are necessary for your project, but Also some way to find the P's and Q's hooked up to this would be most welcome. It's a great start and is completely in the right direction but it could take many sprints of improvement. On Wed, Dec 9, 2015 at 4:36 AM, Gerard Meijssen <gerard.meijssen(a)gmail.com> wrote:

...

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: * Please tell me whether do you want this tool at all * Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): * Support more datatypes (e.g. date of birth based on year, coordinates) * Sitelink-based and reference-based analysis (to check how much of articles of, let's say, Chinese Wikipedia are unsourced) * Free-style analysis: There is a database for this tool that can be used for way more applications. You can get the most unsourced statements of P31 and then you can go to fix them. I'm trying to build a playground for this kind of tasks) I hope you like this and rock on! <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> Best _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- Paul Houle *Applying Schemas for Natural Language Processing, Distributed Systems, Classification and Text Mining and Data Lakes* (607) 539 6254 paul.houle on Skype ontology2(a)gmail.com :BaseKB -- Query Freebase Data With SPARQL http://basekb.com/gold/ Legal Entity Identifier Lookup https://legalentityidentifier.info/lei/lookup/ <http://legalentityidentifier.info/lei/lookup/> Join our Data Lakes group on LinkedIn https://www.linkedin.com/grp/home?gid=8267275

Amir Ladsgroup

16 Dec 16 Dec

7:23 p.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Hey, Thanks for your feedback. That's exactly what I'm looking for. On Mon, Dec 14, 2015 at 5:29 PM Paul Houle <ontology2(a)gmail.com> wrote:

...

It's a step in the right direction, but it took a very long time to load on my computer.

It's maybe related to labs recent issues. Now I get reasonable time: http://tools.pingdom.com/fpt/#!/eq1i3s/http://tools.wmflabs.org/wd-analyst/…

...

After the initial load, it was pretty peppy, then I ran the default example that is grayed in but not active (I had to retype it)

I made some modifications that might help;

...

Then I get the page that says "results are ready" and how cool they are, then it takes me a while to figure out what I am looking at and finally realize it is a comparison of data quality metrics (which I think are all fact counts) between all of the P31 predicates and the Q5.

I made some changes so you can see things easier. I appreciate if you suggest some words I put in the description;

...

The use of the graphic on the first row complicated this for me. Please sugest something I write there for people :);

...

There are a lot of broken links on this page too such as http://tools.wmflabs.org/wd-analyst/sitelink.php https://www.wikidata.org/wiki/P31

The property broken should be fixed by now and sitelink is broken because It's not there yet. I'll make it very soon;

...

and of course no merged in documentation about what P31 and Q5 are. Opaque identifiers are necessary for your project, but Also some way to find the P's and Q's hooked up to this would be most welcome. Done, Now we have label for everything;

...

It's a great start and is completely in the right direction but it could take many sprints of improvement. On Wed, Dec 9, 2015 at 4:36 AM, Gerard Meijssen <gerard.meijssen(a)gmail.com

wrote:

Hey, There has been several discussion regarding quality of information in Wikidata. I wanted to work on quality of wikidata but we don't have any source of good information to see where we are ahead and where we are behind. So I thought the best thing I can do is to make something to show people how exactly sourced our data is with details. So here we have *http://tools.wmflabs.org/wd-analyst/index.php* You can give only a property (let's say P31) and it gives you the four most used values + analyze of sources and quality in overall (check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) and then you can see about ~33% of them are sources which 29.1% of them are based on Wikipedia. You can give a property and multiple values you want. Let's say you want to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) Check this out <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And you can see US biographies are more abundant (300K over 200K) but German biographies are more descriptive (3.8 description per item over 3.2 description over item) One important note: Compare P31:Q5 (a trivial statement) 46% of them are not sourced at all and 49% of them are based on Wikipedia **but* *get this statistics for population properties (P1082 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a trivial statement and we need to be careful about them. It turns out there are slightly more than one reference per statement and only 4% of them are based on Wikipedia. So we can relax and enjoy these highly-sourced data. Requests: * Please tell me whether do you want this tool at all * Please suggest more ways to analyze and catch unsourced materials Future plan (if you agree to keep using this tool): * Support more datatypes (e.g. date of birth based on year, coordinates) * Sitelink-based and reference-based analysis (to check how much of articles of, let's say, Chinese Wikipedia are unsourced) * Free-style analysis: There is a database for this tool that can be used for way more applications. You can get the most unsourced statements of P31 and then you can go to fix them. I'm trying to build a playground for this kind of tasks) I hope you like this and rock on! <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> Best _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Jane Darnell

7:40 p.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Amir, Thanks for your work! I like this one showing how our Sum-of-all-Paintings project is doing compared to sculptures (which have many copyright issues, but you could still put the data on Wikidata) http://tools.wmflabs.org/wd-analyst/index.php?p=p31&q=Q3305213%7CQ860861 Jane On Wed, Dec 16, 2015 at 12:23 PM, Amir Ladsgroup <ladsgroup(a)gmail.com> wrote:

...

Hey, Thanks for your feedback. That's exactly what I'm looking for. On Mon, Dec 14, 2015 at 5:29 PM Paul Houle <ontology2(a)gmail.com> wrote:

It's a step in the right direction, but it took a very long time to load on my computer.

It's maybe related to labs recent issues. Now I get reasonable time: http://tools.pingdom.com/fpt/#!/eq1i3s/http://tools.wmflabs.org/wd-analyst/…

After the initial load, it was pretty peppy, then I ran the default example that is grayed in but not active (I had to retype it)

I made some modifications that might help;

I made some changes so you can see things easier. I appreciate if you suggest some words I put in the description;

The use of the graphic on the first row complicated this for me. Please sugest something I write there for people :);

There are a lot of broken links on this page too such as http://tools.wmflabs.org/wd-analyst/sitelink.php https://www.wikidata.org/wiki/P31

The property broken should be fixed by now and sitelink is broken because It's not there yet. I'll make it very soon;

It's a great start and is completely in the right direction but it could take many sprints of improvement. On Wed, Dec 9, 2015 at 4:36 AM, Gerard Meijssen < gerard.meijssen(a)gmail.com> wrote:

Hi Amir, Very nice, thanks! I like the general approach of having a stand-alone tool for analysing the data, and maybe pointing you to issues. Like a dashboard for Wikidata editors. What backend technology are you using to produce these results? Is this live data or dumped data? One could also get those numbers from the SPARQL endpoint, but performance might be problematic (since you compute averages over all items; a custom approach would of course be much faster but then you have the data update problem). An obvious feature request would be to display entity ids as links to the appropriate page, and maybe with their labels (in a language of your choice). But overall very nice. Regards, Markus On 08.12.2015 18:48, Amir Ladsgroup wrote: > Hey, > There has been several discussion regarding quality of information in > Wikidata. I wanted to work on quality of wikidata but we don't have any > source of good information to see where we are ahead and where we are > behind. So I thought the best thing I can do is to make something to > show people how exactly sourced our data is with details. So here we > have *http://tools.wmflabs.org/wd-analyst/index.php* > > You can give only a property (let's say P31) and it gives you the four > most used values + analyze of sources and quality in overall (check > this > out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) > and then you can see about ~33% of them are sources which 29.1% of > them are based on Wikipedia. > You can give a property and multiple values you want. Let's say you > want > to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) > Check this out > <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And > you can see US biographies are more abundant (300K over 200K) but > German > biographies are more descriptive (3.8 description per item over 3.2 > description over item) > > One important note: Compare P31:Q5 (a trivial statement) 46% of them > are > not sourced at all and 49% of them are based on Wikipedia **but* *get > this statistics for population properties (P1082 > <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a > trivial statement and we need to be careful about them. It turns out > there are slightly more than one reference per statement and only 4% of > them are based on Wikipedia. So we can relax and enjoy these > highly-sourced data. > > Requests: > > * Please tell me whether do you want this tool at all > * Please suggest more ways to analyze and catch unsourced materials > > Future plan (if you agree to keep using this tool): > > * Support more datatypes (e.g. date of birth based on year, > coordinates) > * Sitelink-based and reference-based analysis (to check how much of > articles of, let's say, Chinese Wikipedia are unsourced) > > * Free-style analysis: There is a database for this tool that can be > used for way more applications. You can get the most unsourced > statements of P31 and then you can go to fix them. I'm trying to > build a playground for this kind of tasks) > > I hope you like this and rock on! > <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> > Best > > > _______________________________________________ > Wikidata mailing list > Wikidata(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata > > _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Amir Ladsgroup

7:58 p.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Content created by this tools is licensed under CC-BY v4.0. I made it explicit now :) Best On Wed, Dec 16, 2015 at 3:11 PM Jane Darnell <jane023(a)gmail.com> wrote:

...

Hey, Thanks for your feedback. That's exactly what I'm looking for. On Mon, Dec 14, 2015 at 5:29 PM Paul Houle <ontology2(a)gmail.com> wrote:

It's a step in the right direction, but it took a very long time to load on my computer.

It's maybe related to labs recent issues. Now I get reasonable time: http://tools.pingdom.com/fpt/#!/eq1i3s/http://tools.wmflabs.org/wd-analyst/…

After the initial load, it was pretty peppy, then I ran the default example that is grayed in but not active (I had to retype it)

I made some modifications that might help;

I made some changes so you can see things easier. I appreciate if you suggest some words I put in the description;

The use of the graphic on the first row complicated this for me. Please sugest something I write there for people :);

There are a lot of broken links on this page too such as http://tools.wmflabs.org/wd-analyst/sitelink.php https://www.wikidata.org/wiki/P31

The property broken should be fixed by now and sitelink is broken because It's not there yet. I'll make it very soon;

It's a great start and is completely in the right direction but it could take many sprints of improvement. On Wed, Dec 9, 2015 at 4:36 AM, Gerard Meijssen < gerard.meijssen(a)gmail.com> wrote:

Hoi, What would be nice is to have an option to understand progress from one dump to the next like you can with the Statistics by Magnus. Magnus also has data on sources but this is more global. Thanks, GerardM On 8 December 2015 at 21:41, Markus Krötzsch < markus(a)semantic-mediawiki.org> wrote: > Hi Amir, > > Very nice, thanks! I like the general approach of having a stand-alone > tool for analysing the data, and maybe pointing you to issues. Like a > dashboard for Wikidata editors. > > What backend technology are you using to produce these results? Is > this live data or dumped data? One could also get those numbers from the > SPARQL endpoint, but performance might be problematic (since you compute > averages over all items; a custom approach would of course be much faster > but then you have the data update problem). > > An obvious feature request would be to display entity ids as links to > the appropriate page, and maybe with their labels (in a language of your > choice). > > But overall very nice. > > Regards, > > Markus > > > On 08.12.2015 18:48, Amir Ladsgroup wrote: > >> Hey, >> There has been several discussion regarding quality of information in >> Wikidata. I wanted to work on quality of wikidata but we don't have >> any >> source of good information to see where we are ahead and where we are >> behind. So I thought the best thing I can do is to make something to >> show people how exactly sourced our data is with details. So here we >> have *http://tools.wmflabs.org/wd-analyst/index.php* >> >> You can give only a property (let's say P31) and it gives you the four >> most used values + analyze of sources and quality in overall (check >> this >> out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) >> and then you can see about ~33% of them are sources which 29.1% of >> them are based on Wikipedia. >> You can give a property and multiple values you want. Let's say you >> want >> to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US) >> Check this out >> <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And >> you can see US biographies are more abundant (300K over 200K) but >> German >> biographies are more descriptive (3.8 description per item over 3.2 >> description over item) >> >> One important note: Compare P31:Q5 (a trivial statement) 46% of them >> are >> not sourced at all and 49% of them are based on Wikipedia **but* *get >> this statistics for population properties (P1082 >> <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a >> trivial statement and we need to be careful about them. It turns out >> there are slightly more than one reference per statement and only 4% >> of >> them are based on Wikipedia. So we can relax and enjoy these >> highly-sourced data. >> >> Requests: >> >> * Please tell me whether do you want this tool at all >> * Please suggest more ways to analyze and catch unsourced materials >> >> Future plan (if you agree to keep using this tool): >> >> * Support more datatypes (e.g. date of birth based on year, >> coordinates) >> * Sitelink-based and reference-based analysis (to check how much of >> articles of, let's say, Chinese Wikipedia are unsourced) >> >> * Free-style analysis: There is a database for this tool that can be >> used for way more applications. You can get the most unsourced >> statements of P31 and then you can go to fix them. I'm trying to >> build a playground for this kind of tasks) >> >> I hope you like this and rock on! >> <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> >> Best >> >> >> _______________________________________________ >> Wikidata mailing list >> Wikidata(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> >> > > _______________________________________________ > Wikidata mailing list > Wikidata(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata > _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Gerard Meijssen

8:43 p.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Hoi, What is achieved in this way and, on what basis can you license the output of a tool? Thanks, GerardM On 16 December 2015 at 12:58, Amir Ladsgroup <ladsgroup(a)gmail.com> wrote:

...

Content created by this tools is licensed under CC-BY v4.0. I made it explicit now :) Best On Wed, Dec 16, 2015 at 3:11 PM Jane Darnell <jane023(a)gmail.com> wrote:

Hey, Thanks for your feedback. That's exactly what I'm looking for. On Mon, Dec 14, 2015 at 5:29 PM Paul Houle <ontology2(a)gmail.com> wrote:

It's a step in the right direction, but it took a very long time to load on my computer.

It's maybe related to labs recent issues. Now I get reasonable time: http://tools.pingdom.com/fpt/#!/eq1i3s/http://tools.wmflabs.org/wd-analyst/…

After the initial load, it was pretty peppy, then I ran the default example that is grayed in but not active (I had to retype it)

I made some modifications that might help;

I made some changes so you can see things easier. I appreciate if you suggest some words I put in the description;

The use of the graphic on the first row complicated this for me. Please sugest something I write there for people :);

There are a lot of broken links on this page too such as http://tools.wmflabs.org/wd-analyst/sitelink.php https://www.wikidata.org/wiki/P31

The property broken should be fixed by now and sitelink is broken because It's not there yet. I'll make it very soon;

It's a great start and is completely in the right direction but it could take many sprints of improvement. On Wed, Dec 9, 2015 at 4:36 AM, Gerard Meijssen < gerard.meijssen(a)gmail.com> wrote: > Hoi, > What would be nice is to have an option to understand progress from > one dump to the next like you can with the Statistics by Magnus. Magnus > also has data on sources but this is more global. > Thanks, > GerardM > > On 8 December 2015 at 21:41, Markus Krötzsch < > markus(a)semantic-mediawiki.org> wrote: > >> Hi Amir, >> >> Very nice, thanks! I like the general approach of having a >> stand-alone tool for analysing the data, and maybe pointing you to issues. >> Like a dashboard for Wikidata editors. >> >> What backend technology are you using to produce these results? Is >> this live data or dumped data? One could also get those numbers from the >> SPARQL endpoint, but performance might be problematic (since you compute >> averages over all items; a custom approach would of course be much faster >> but then you have the data update problem). >> >> An obvious feature request would be to display entity ids as links to >> the appropriate page, and maybe with their labels (in a language of your >> choice). >> >> But overall very nice. >> >> Regards, >> >> Markus >> >> >> On 08.12.2015 18:48, Amir Ladsgroup wrote: >> >>> Hey, >>> There has been several discussion regarding quality of information in >>> Wikidata. I wanted to work on quality of wikidata but we don't have >>> any >>> source of good information to see where we are ahead and where we are >>> behind. So I thought the best thing I can do is to make something to >>> show people how exactly sourced our data is with details. So here we >>> have *http://tools.wmflabs.org/wd-analyst/index.php* >>> >>> You can give only a property (let's say P31) and it gives you the >>> four >>> most used values + analyze of sources and quality in overall (check >>> this >>> out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) >>> and then you can see about ~33% of them are sources which 29.1% of >>> them are based on Wikipedia. >>> You can give a property and multiple values you want. Let's say you >>> want >>> to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 >>> (US) >>> Check this out >>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. >>> And >>> you can see US biographies are more abundant (300K over 200K) but >>> German >>> biographies are more descriptive (3.8 description per item over 3.2 >>> description over item) >>> >>> One important note: Compare P31:Q5 (a trivial statement) 46% of them >>> are >>> not sourced at all and 49% of them are based on Wikipedia **but* *get >>> this statistics for population properties (P1082 >>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a >>> trivial statement and we need to be careful about them. It turns out >>> there are slightly more than one reference per statement and only 4% >>> of >>> them are based on Wikipedia. So we can relax and enjoy these >>> highly-sourced data. >>> >>> Requests: >>> >>> * Please tell me whether do you want this tool at all >>> * Please suggest more ways to analyze and catch unsourced materials >>> >>> Future plan (if you agree to keep using this tool): >>> >>> * Support more datatypes (e.g. date of birth based on year, >>> coordinates) >>> * Sitelink-based and reference-based analysis (to check how much of >>> articles of, let's say, Chinese Wikipedia are unsourced) >>> >>> * Free-style analysis: There is a database for this tool that can >>> be >>> used for way more applications. You can get the most unsourced >>> statements of P31 and then you can go to fix them. I'm trying to >>> build a playground for this kind of tasks) >>> >>> I hope you like this and rock on! >>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> >>> Best >>> >>> >>> _______________________________________________ >>> Wikidata mailing list >>> Wikidata(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>> >>> >> >> _______________________________________________ >> Wikidata mailing list >> Wikidata(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> > > > _______________________________________________ > Wikidata mailing list > Wikidata(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata > > -- Paul Houle *Applying Schemas for Natural Language Processing, Distributed Systems, Classification and Text Mining and Data Lakes* (607) 539 6254 paul.houle on Skype ontology2(a)gmail.com :BaseKB -- Query Freebase Data With SPARQL http://basekb.com/gold/ Legal Entity Identifier Lookup https://legalentityidentifier.info/lei/lookup/ <http://legalentityidentifier.info/lei/lookup/> Join our Data Lakes group on LinkedIn https://www.linkedin.com/grp/home?gid=8267275 _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Amir Ladsgroup

10:10 p.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Obviously data can't be licensed but graphs and other parts can be copyrighted. I'm just trying to make re-useability easier. Best On Wed, Dec 16, 2015 at 4:14 PM Gerard Meijssen <gerard.meijssen(a)gmail.com> wrote:

...

Hoi, What is achieved in this way and, on what basis can you license the output of a tool? Thanks, GerardM On 16 December 2015 at 12:58, Amir Ladsgroup <ladsgroup(a)gmail.com> wrote:

Content created by this tools is licensed under CC-BY v4.0. I made it explicit now :) Best On Wed, Dec 16, 2015 at 3:11 PM Jane Darnell <jane023(a)gmail.com> wrote:

Hey, Thanks for your feedback. That's exactly what I'm looking for. On Mon, Dec 14, 2015 at 5:29 PM Paul Houle <ontology2(a)gmail.com> wrote: > It's a step in the right direction, but it took a very long time to > load on my computer. > It's maybe related to labs recent issues. Now I get reasonable time: http://tools.pingdom.com/fpt/#!/eq1i3s/http://tools.wmflabs.org/wd-analyst/… > > After the initial load, it was pretty peppy, then I ran the default > example that is grayed in but not active (I had to retype it) > I made some modifications that might help; > Then I get the page that says "results are ready" and how cool they > are, then it takes me a while to figure out what I am looking at and > finally realize it is a comparison of data quality metrics (which I think > are all fact counts) between all of the P31 predicates and the Q5. > I made some changes so you can see things easier. I appreciate if you suggest some words I put in the description; > The use of the graphic on the first row complicated this for me. > > Please sugest something I write there for people :); > There are a lot of broken links on this page too such as > > http://tools.wmflabs.org/wd-analyst/sitelink.php > https://www.wikidata.org/wiki/P31 > The property broken should be fixed by now and sitelink is broken because It's not there yet. I'll make it very soon; > > > and of course no merged in documentation about what P31 and Q5 are. > Opaque identifiers are necessary for your project, but > > Also some way to find the P's and Q's hooked up to this would be most > welcome. > > Done, Now we have label for everything; > It's a great start and is completely in the right direction but it > could take many sprints of improvement. > > On Wed, Dec 9, 2015 at 4:36 AM, Gerard Meijssen < > gerard.meijssen(a)gmail.com> wrote: > >> Hoi, >> What would be nice is to have an option to understand progress from >> one dump to the next like you can with the Statistics by Magnus. Magnus >> also has data on sources but this is more global. >> Thanks, >> GerardM >> >> On 8 December 2015 at 21:41, Markus Krötzsch < >> markus(a)semantic-mediawiki.org> wrote: >> >>> Hi Amir, >>> >>> Very nice, thanks! I like the general approach of having a >>> stand-alone tool for analysing the data, and maybe pointing you to issues. >>> Like a dashboard for Wikidata editors. >>> >>> What backend technology are you using to produce these results? Is >>> this live data or dumped data? One could also get those numbers from the >>> SPARQL endpoint, but performance might be problematic (since you compute >>> averages over all items; a custom approach would of course be much faster >>> but then you have the data update problem). >>> >>> An obvious feature request would be to display entity ids as links >>> to the appropriate page, and maybe with their labels (in a language of your >>> choice). >>> >>> But overall very nice. >>> >>> Regards, >>> >>> Markus >>> >>> >>> On 08.12.2015 18:48, Amir Ladsgroup wrote: >>> >>>> Hey, >>>> There has been several discussion regarding quality of information >>>> in >>>> Wikidata. I wanted to work on quality of wikidata but we don't have >>>> any >>>> source of good information to see where we are ahead and where we >>>> are >>>> behind. So I thought the best thing I can do is to make something to >>>> show people how exactly sourced our data is with details. So here we >>>> have *http://tools.wmflabs.org/wd-analyst/index.php* >>>> >>>> You can give only a property (let's say P31) and it gives you the >>>> four >>>> most used values + analyze of sources and quality in overall (check >>>> this >>>> out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>) >>>> and then you can see about ~33% of them are sources which 29.1% of >>>> them are based on Wikipedia. >>>> You can give a property and multiple values you want. Let's say you >>>> want >>>> to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 >>>> (US) >>>> Check this out >>>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. >>>> And >>>> you can see US biographies are more abundant (300K over 200K) but >>>> German >>>> biographies are more descriptive (3.8 description per item over 3.2 >>>> description over item) >>>> >>>> One important note: Compare P31:Q5 (a trivial statement) 46% of >>>> them are >>>> not sourced at all and 49% of them are based on Wikipedia **but* >>>> *get >>>> this statistics for population properties (P1082 >>>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a >>>> trivial statement and we need to be careful about them. It turns out >>>> there are slightly more than one reference per statement and only >>>> 4% of >>>> them are based on Wikipedia. So we can relax and enjoy these >>>> highly-sourced data. >>>> >>>> Requests: >>>> >>>> * Please tell me whether do you want this tool at all >>>> * Please suggest more ways to analyze and catch unsourced >>>> materials >>>> >>>> Future plan (if you agree to keep using this tool): >>>> >>>> * Support more datatypes (e.g. date of birth based on year, >>>> coordinates) >>>> * Sitelink-based and reference-based analysis (to check how much >>>> of >>>> articles of, let's say, Chinese Wikipedia are unsourced) >>>> >>>> * Free-style analysis: There is a database for this tool that can >>>> be >>>> used for way more applications. You can get the most unsourced >>>> statements of P31 and then you can go to fix them. I'm trying to >>>> build a playground for this kind of tasks) >>>> >>>> I hope you like this and rock on! >>>> <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399> >>>> Best >>>> >>>> >>>> _______________________________________________ >>>> Wikidata mailing list >>>> Wikidata(a)lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>>> >>>> >>> >>> _______________________________________________ >>> Wikidata mailing list >>> Wikidata(a)lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>> >> >> >> _______________________________________________ >> Wikidata mailing list >> Wikidata(a)lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> >> > > > -- > Paul Houle > > *Applying Schemas for Natural Language Processing, Distributed > Systems, Classification and Text Mining and Data Lakes* > > (607) 539 6254 paul.houle on Skype ontology2(a)gmail.com > > :BaseKB -- Query Freebase Data With SPARQL > http://basekb.com/gold/ > > Legal Entity Identifier Lookup > https://legalentityidentifier.info/lei/lookup/ > <http://legalentityidentifier.info/lei/lookup/> > > Join our Data Lakes group on LinkedIn > https://www.linkedin.com/grp/home?gid=8267275 > > _______________________________________________ > Wikidata mailing list > Wikidata(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata > _______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________ Wikidata mailing list Wikidata(a)lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Federico Leva (Nemo)

9 Dec 9 Dec

4:22 p.m.

New subject: Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

Useful and very pretty, I can't wait for the analysis by import source. I'll try to dig the data to find interesting evidence/examples of data to use more. Nemo

3063

days inactive

3071

days old

wikidata@lists.wikimedia.org

Manage subscription

15 comments

7 participants

tags (0)

participants (7)

Amir Ladsgroup
André Costa
Federico Leva (Nemo)
Gerard Meijssen
Jane Darnell
Markus Krötzsch
Paul Houle