Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

12 Dec 2015

Hey,
I made some significant changes based on feedbacks

* Per suggestion of Nemo_bis I added reference-based analysis: Here's
<http://tools.wmflabs.org/wd-analyst/ref.php?p=P143&q=Q328|Q11920&pp=P31>
an example
* I added limit parameter which you can get more results if you want (both
for reference-based and property-based analysis) for example:
http://tools.wmflabs.org/wd-analyst/index.php?p=P31&q=&limit=50 (Maximum
acceptable value is 50)
* Per suggestion of André I added a column to the database and results
which gives you number of percentage of unsourced statements. Obviously it
doesn't apply to reference-based analysis. for example
https://tools.wmflabs.org/wd-analyst/index.php?p=P1082&q= shows only 2% of
statements of population are unsourced

For Gerard suggestion. It's definitely a good idea but problem is it's
technically hard because every week it makes the databse twice as big. We
can store only a limited number (e.g. last three weeks) or apply this to a
limited number of value-pair properties. I'm looking to find out which one
is better.

Best

On Thu, Dec 10, 2015 at 12:13 AM André Costa &lt;andre.costa(a)wikimedia.se&gt;
wrote:

...
  Nice tool!

 To understand the statistics better.
 If a claim has two sources, one wikipedia and one other, how does that
 show up in the statistics?

 The reason I'm wondering is because I would normally care if a claim is
 sourced or not (but not by how many sources) and whether it is sourced by
 only Wikipedias or anything else.

 E.g.
 1) a statment with 10 claims each sourced is "better" than one with 10
 claims where one claim has 10 sources.
 2) a statement with a wiki source + another source is "better" than on
 with just a wiki source and just as "good" as one without the wiki source.

 Also is wiki ref/source Wikipedia only or any Wikimedia project? Whilst
 (last I checked) the others were only 70,000 refs compared to the 21
 million from Wikipedia they might be significant for certain domains and
 are just as "bad".

 Cheers,
 André
 On 9 Dec 2015 10:37, "Gerard Meijssen" &lt;gerard.meijssen(a)gmail.com&gt;
wrote:

  Hoi,
 What would be nice is to have an option to understand progress from one
 dump to the next like you can with the Statistics by Magnus. Magnus also
 has data on sources but this is more global.
 Thanks,
      GerardM

 On 8 December 2015 at 21:41, Markus Krötzsch <
 markus(a)semantic-mediawiki.org&gt; wrote:

  Hi Amir,

 Very nice, thanks! I like the general approach of having a stand-alone
 tool for analysing the data, and maybe pointing you to issues. Like a
 dashboard for Wikidata editors.

 What backend technology are you using to produce these results? Is this
 live data or dumped data? One could also get those numbers from the SPARQL
 endpoint, but performance might be problematic (since you compute averages
 over all items; a custom approach would of course be much faster but then
 you have the data update problem).

 An obvious feature request would be to display entity ids as links to
 the appropriate page, and maybe with their labels (in a language of your
 choice).

 But overall very nice.

 Regards,

 Markus

 On 08.12.2015 18:48, Amir Ladsgroup wrote:

  Hey,
 There has been several discussion regarding quality of information in
 Wikidata. I wanted to work on quality of wikidata but we don't have any
 source of good information to see where we are ahead and where we are
 behind. So I thought the best thing I can do is to make something to
 show people how exactly sourced our data is with details. So here we
 have *http://tools.wmflabs.org/wd-analyst/index.php*

 You can give only a property (let's say P31) and it gives you the four
 most used values + analyze of sources and quality in overall (check this
 out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>)
   and then you can see about ~33% of them are sources which 29.1% of
 them are based on Wikipedia.
 You can give a property and multiple values you want. Let's say you want
 to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US)
 Check this out
 <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And
 you can see US biographies are more abundant (300K over 200K) but German
 biographies are more descriptive (3.8 description per item over 3.2
 description over item)

 One important note: Compare P31:Q5 (a trivial statement) 46% of them are
 not sourced at all and 49% of them are based on Wikipedia **but* *get
 this statistics for population properties (P1082
 <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a
 trivial statement and we need to be careful about them. It turns out
 there are slightly more than one reference per statement and only 4% of
 them are based on Wikipedia. So we can relax and enjoy these
 highly-sourced data.

 Requests:

   * Please tell me whether do you want this tool at all
   * Please suggest more ways to analyze and catch unsourced materials

 Future plan (if you agree to keep using this tool):

   * Support more datatypes (e.g. date of birth based on year,
 coordinates)
   * Sitelink-based and reference-based analysis (to check how much of
     articles of, let's say, Chinese Wikipedia are unsourced)

   * Free-style analysis: There is a database for this tool that can be
     used for way more applications. You can get the most unsourced
     statements of P31 and then you can go to fix them. I'm trying to
     build a playground for this kind of tasks)

 I hope you like this and rock on!
 <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399>
 Best

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

 _______________________________________________  Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata