P.S. If you want to do this yourself to play with it, below is the
relevant information on how I wrote this code (looks a bit clumsy in
email, but I don't have time now to set up a tutorial page ;-).
Markus
(1) I modified the example program "EntityStatisticsProcessor" that is
part of Wikidata Toolkit [1].
(2) I added a new field to count references:
final HashMap<Reference, Integer> refStatistics = new HashMap<>();
(3) The example program already downloads and processes all items and
properties in the most recent dump. You just have to add the counting.
Essentially, this is the code I run on every ItemDocument and
PropertyDocument:
public void countReferences(StatementDocument statementDocument) {
for (StatementGroup sg : statementDocument.getStatementGroups()) {
for (Statement s : sg.getStatements()) {
for (Reference r : s.getReferences()) {
if (!refStatistics.containsKey(r)) {
refStatistics.put(r, 1);
} else {
refStatistics.put(r, refStatistics.get(r) + 1);
}
}
}
}
}
(the example already has a method "countStatements" that does these
iterations, so you can also insert the code there).
(4) To print the output to a file, I sort the hash map by values first.
Here's some standard code for how to do this:
try (PrintStream out = new PrintStream(
ExampleHelpers.openExampleFileOuputStream("reference-counts.txt"))) {
List<Entry<Reference, Integer>> list =
new LinkedList<Entry<Reference, Integer>>(
refStatistics.entrySet());
Collections.sort(list, new Comparator<Entry<Reference, Integer>>()
{
@Override
public int compare(Entry<Reference, Integer> o1,
Entry<Reference, Integer> o2) {
return o2.getValue().compareTo(o1.getValue());
}
}
);
int singleRefs = 0;
for (Entry<Reference, Integer> entry : list) {
if (entry.getValue() > 1) {
out.println(entry.getValue() + " x " + entry.getKey());
} else {
singleRefs++;
}
}
out.println("... and another " + singleRefs
+ " references that occurred just once.");
} catch (IOException e) {
e.printStackTrace();
}
This code I put into the existing method writeFinalResults() that is
called at the end.
As I said, this runs in about 30min on my laptop, but downloading the
dump file first time takes a bit longer.
[1]
https://github.com/Wikidata/Wikidata-Toolkit/blob/v0.5.0/wdtk-examples/src/…
On 07.09.2015 15:49, Markus Krötzsch wrote:
Hi André,
I just made a small counting program with Wikidata Toolkit to count
unique references. Running it on the most recent dump took about 30min.
I uploaded the results:
http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-cou…
The file lists all references that are used at least 50 times, ordered
by number of use. There were 593778 unique references for 35485364
referenced statements (out of 69942556 statements in total).
416480 of the references are used only once. If you want to see all
references used at least twice, this is a slightly longer file:
http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-cou…
Best regards,
Markus
On 07.09.2015 13:25, André Costa wrote:
Hi all!
I'm wondering if there is a way (SQL, api, tool or otherwise) for
finding out how often a particular source is used on Wikidata.
The background is a collaboration with two GLAMs where we have used ther
open (and CC0) datasets to add and/or source statements on Wikidata for
items on which they can be considered an authority. Now I figured it
would be nice to give them back a number for just how big the impact was.
While I can find out how many items should be affected I couldn't find
an easy way, short of analysing each of these, for how many statements
were affected.
Any suggestions would be welcome.
Some details: Each reference is a P248 claim + P577 claim (where the
latter may change)
Cheers,
André / Lokal_Profil
André Costa | GLAM-tekniker, Wikimedia Sverige |Andre.Costa(a)wikimedia.se
<mailto:Andre.Costa@wikimedia.se> |+46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige.
Läs mer på blimedlem.wikimedia.se <http://blimedlem.wikimedia.se/>
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata