Hey Markus,

On Wed, Dec 9, 2015 at 12:12 AM Markus Krötzsch <markus@semantic-mediawiki.org> wrote:
Hi Amir,

Very nice, thanks! I like the general approach of having a stand-alone
tool for analysing the data, and maybe pointing you to issues. Like a
dashboard for Wikidata editors.

What backend technology are you using to produce these results? Is this
live data or dumped data? One could also get those numbers from the
SPARQL endpoint, but performance might be problematic (since you compute
averages over all items; a custom approach would of course be much
faster but then you have the data update problem).
I build a database based on weekly JSON dumps. we would have some delay in the data but computationally it's fast. Using Wikidata database directly makes performance so poor that it becomes a good attack point.
 
An obvious feature request would be to display entity ids as links to
the appropriate page, and maybe with their labels (in a language of your
choice).

Done. :)
But overall very nice.

Regards,

Markus


On 08.12.2015 18:48, Amir Ladsgroup wrote:
> Hey,
> There has been several discussion regarding quality of information in
> Wikidata. I wanted to work on quality of wikidata but we don't have any
> source of good information to see where we are ahead and where we are
> behind. So I thought the best thing I can do is to make something to
> show people how exactly sourced our data is with details. So here we
> have *http://tools.wmflabs.org/wd-analyst/index.php*
>
> You can give only a property (let's say P31) and it gives you the four
> most used values + analyze of sources and quality in overall (check this
> out <http://tools.wmflabs.org/wd-analyst/index.php?p=P31>)
>   and then you can see about ~33% of them are sources which 29.1% of
> them are based on Wikipedia.
> You can give a property and multiple values you want. Let's say you want
> to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US)
> Check this out
> <http://tools.wmflabs.org/wd-analyst/index.php?p=P27&q=Q30|Q183>. And
> you can see US biographies are more abundant (300K over 200K) but German
> biographies are more descriptive (3.8 description per item over 3.2
> description over item)
>
> One important note: Compare P31:Q5 (a trivial statement) 46% of them are
> not sourced at all and 49% of them are based on Wikipedia **but* *get
> this statistics for population properties (P1082
> <http://tools.wmflabs.org/wd-analyst/index.php?p=P1082>) It's not a
> trivial statement and we need to be careful about them. It turns out
> there are slightly more than one reference per statement and only 4% of
> them are based on Wikipedia. So we can relax and enjoy these
> highly-sourced data.
>
> Requests:
>
>   * Please tell me whether do you want this tool at all
>   * Please suggest more ways to analyze and catch unsourced materials
>
> Future plan (if you agree to keep using this tool):
>
>   * Support more datatypes (e.g. date of birth based on year, coordinates)
>   * Sitelink-based and reference-based analysis (to check how much of
>     articles of, let's say, Chinese Wikipedia are unsourced)
>
>   * Free-style analysis: There is a database for this tool that can be
>     used for way more applications. You can get the most unsourced
>     statements of P31 and then you can go to fix them. I'm trying to
>     build a playground for this kind of tasks)
>
> I hope you like this and rock on!
> <http://tools.wmflabs.org/wd-analyst/index.php?p=P136&q=Q11399>
> Best
>
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata