P.S. If you want to do this yourself to play with it, below is the relevant information on how I wrote this code (looks a bit clumsy in email, but I don't have time now to set up a tutorial page ;-).
Markus
(1) I modified the example program "EntityStatisticsProcessor" that is part of Wikidata Toolkit [1]. (2) I added a new field to count references:
final HashMap<Reference, Integer> refStatistics = new HashMap<>();
(3) The example program already downloads and processes all items and properties in the most recent dump. You just have to add the counting. Essentially, this is the code I run on every ItemDocument and PropertyDocument:
public void countReferences(StatementDocument statementDocument) { for (StatementGroup sg : statementDocument.getStatementGroups()) { for (Statement s : sg.getStatements()) { for (Reference r : s.getReferences()) { if (!refStatistics.containsKey(r)) { refStatistics.put(r, 1); } else { refStatistics.put(r, refStatistics.get(r) + 1); } } } } }
(the example already has a method "countStatements" that does these iterations, so you can also insert the code there).
(4) To print the output to a file, I sort the hash map by values first. Here's some standard code for how to do this:
try (PrintStream out = new PrintStream( ExampleHelpers.openExampleFileOuputStream("reference-counts.txt"))) { List<Entry<Reference, Integer>> list = new LinkedList<Entry<Reference, Integer>>( refStatistics.entrySet());
Collections.sort(list, new Comparator<Entry<Reference, Integer>>() { @Override public int compare(Entry<Reference, Integer> o1, Entry<Reference, Integer> o2) { return o2.getValue().compareTo(o1.getValue()); } } );
int singleRefs = 0; for (Entry<Reference, Integer> entry : list) { if (entry.getValue() > 1) { out.println(entry.getValue() + " x " + entry.getKey()); } else { singleRefs++; } } out.println("... and another " + singleRefs + " references that occurred just once."); } catch (IOException e) { e.printStackTrace(); }
This code I put into the existing method writeFinalResults() that is called at the end.
As I said, this runs in about 30min on my laptop, but downloading the dump file first time takes a bit longer.
[1] https://github.com/Wikidata/Wikidata-Toolkit/blob/v0.5.0/wdtk-examples/src/m...
On 07.09.2015 15:49, Markus Krötzsch wrote:
Hi André,
I just made a small counting program with Wikidata Toolkit to count unique references. Running it on the most recent dump took about 30min. I uploaded the results:
http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-coun...
The file lists all references that are used at least 50 times, ordered by number of use. There were 593778 unique references for 35485364 referenced statements (out of 69942556 statements in total).
416480 of the references are used only once. If you want to see all references used at least twice, this is a slightly longer file:
http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-coun...
Best regards,
Markus
On 07.09.2015 13:25, André Costa wrote:
Hi all!
I'm wondering if there is a way (SQL, api, tool or otherwise) for finding out how often a particular source is used on Wikidata.
The background is a collaboration with two GLAMs where we have used ther open (and CC0) datasets to add and/or source statements on Wikidata for items on which they can be considered an authority. Now I figured it would be nice to give them back a number for just how big the impact was.
While I can find out how many items should be affected I couldn't find an easy way, short of analysing each of these, for how many statements were affected.
Any suggestions would be welcome.
Some details: Each reference is a P248 claim + P577 claim (where the latter may change)
Cheers, André / Lokal_Profil André Costa | GLAM-tekniker, Wikimedia Sverige |Andre.Costa@wikimedia.se mailto:Andre.Costa@wikimedia.se |+46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se http://blimedlem.wikimedia.se/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata