Hi all!
I'm wondering if there is a way (SQL, api, tool or otherwise) for finding out how often a particular source is used on Wikidata.
The background is a collaboration with two GLAMs where we have used ther open (and CC0) datasets to add and/or source statements on Wikidata for items on which they can be considered an authority. Now I figured it would be nice to give them back a number for just how big the impact was.
While I can find out how many items should be affected I couldn't find an easy way, short of analysing each of these, for how many statements were affected.
Any suggestions would be welcome.
Some details: Each reference is a P248 claim + P577 claim (where the latter may change)
Cheers, André / Lokal_Profil André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
Is not an updated version, but
dbtrends.aksw.org
best, Edgard
On Mon, Sep 7, 2015 at 1:25 PM, André Costa andre.costa@wikimedia.se wrote:
Hi all!
I'm wondering if there is a way (SQL, api, tool or otherwise) for finding out how often a particular source is used on Wikidata.
The background is a collaboration with two GLAMs where we have used ther open (and CC0) datasets to add and/or source statements on Wikidata for items on which they can be considered an authority. Now I figured it would be nice to give them back a number for just how big the impact was.
While I can find out how many items should be affected I couldn't find an easy way, short of analysing each of these, for how many statements were affected.
Any suggestions would be welcome.
Some details: Each reference is a P248 claim + P577 claim (where the latter may change)
Cheers, André / Lokal_Profil André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 07.09.2015 14:25, Edgard Marx wrote:
Is not an updated version, but
dbtrends.aksw.org http://dbtrends.aksw.org
I am getting an error there. Is the server down maybe?
Markus
best, Edgard
On Mon, Sep 7, 2015 at 1:25 PM, André Costa <andre.costa@wikimedia.se mailto:andre.costa@wikimedia.se> wrote:
Hi all! I'm wondering if there is a way (SQL, api, tool or otherwise) for finding out how often a particular source is used on Wikidata. The background is a collaboration with two GLAMs where we have used ther open (and CC0) datasets to add and/or source statements on Wikidata for items on which they can be considered an authority. Now I figured it would be nice to give them back a number for just how big the impact was. While I can find out how many items should be affected I couldn't find an easy way, short of analysing each of these, for how many statements were affected. Any suggestions would be welcome. Some details: Each reference is a P248 claim + P577 claim (where the latter may change) Cheers, André / Lokal_Profil André Costa | GLAM-tekniker, Wikimedia Sverige |Andre.Costa@wikimedia.se <mailto:Andre.Costa@wikimedia.se> |+46 (0)733-964574 Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se <http://blimedlem.wikimedia.se/> _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi André,
I just made a small counting program with Wikidata Toolkit to count unique references. Running it on the most recent dump took about 30min. I uploaded the results:
http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-coun...
The file lists all references that are used at least 50 times, ordered by number of use. There were 593778 unique references for 35485364 referenced statements (out of 69942556 statements in total).
416480 of the references are used only once. If you want to see all references used at least twice, this is a slightly longer file:
http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-coun...
Best regards,
Markus
On 07.09.2015 13:25, André Costa wrote:
Hi all!
I'm wondering if there is a way (SQL, api, tool or otherwise) for finding out how often a particular source is used on Wikidata.
The background is a collaboration with two GLAMs where we have used ther open (and CC0) datasets to add and/or source statements on Wikidata for items on which they can be considered an authority. Now I figured it would be nice to give them back a number for just how big the impact was.
While I can find out how many items should be affected I couldn't find an easy way, short of analysing each of these, for how many statements were affected.
Any suggestions would be welcome.
Some details: Each reference is a P248 claim + P577 claim (where the latter may change)
Cheers, André / Lokal_Profil André Costa | GLAM-tekniker, Wikimedia Sverige |Andre.Costa@wikimedia.se mailto:Andre.Costa@wikimedia.se |+46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se http://blimedlem.wikimedia.se/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
P.S. If you want to do this yourself to play with it, below is the relevant information on how I wrote this code (looks a bit clumsy in email, but I don't have time now to set up a tutorial page ;-).
Markus
(1) I modified the example program "EntityStatisticsProcessor" that is part of Wikidata Toolkit [1]. (2) I added a new field to count references:
final HashMap<Reference, Integer> refStatistics = new HashMap<>();
(3) The example program already downloads and processes all items and properties in the most recent dump. You just have to add the counting. Essentially, this is the code I run on every ItemDocument and PropertyDocument:
public void countReferences(StatementDocument statementDocument) { for (StatementGroup sg : statementDocument.getStatementGroups()) { for (Statement s : sg.getStatements()) { for (Reference r : s.getReferences()) { if (!refStatistics.containsKey(r)) { refStatistics.put(r, 1); } else { refStatistics.put(r, refStatistics.get(r) + 1); } } } } }
(the example already has a method "countStatements" that does these iterations, so you can also insert the code there).
(4) To print the output to a file, I sort the hash map by values first. Here's some standard code for how to do this:
try (PrintStream out = new PrintStream( ExampleHelpers.openExampleFileOuputStream("reference-counts.txt"))) { List<Entry<Reference, Integer>> list = new LinkedList<Entry<Reference, Integer>>( refStatistics.entrySet());
Collections.sort(list, new Comparator<Entry<Reference, Integer>>() { @Override public int compare(Entry<Reference, Integer> o1, Entry<Reference, Integer> o2) { return o2.getValue().compareTo(o1.getValue()); } } );
int singleRefs = 0; for (Entry<Reference, Integer> entry : list) { if (entry.getValue() > 1) { out.println(entry.getValue() + " x " + entry.getKey()); } else { singleRefs++; } } out.println("... and another " + singleRefs + " references that occurred just once."); } catch (IOException e) { e.printStackTrace(); }
This code I put into the existing method writeFinalResults() that is called at the end.
As I said, this runs in about 30min on my laptop, but downloading the dump file first time takes a bit longer.
[1] https://github.com/Wikidata/Wikidata-Toolkit/blob/v0.5.0/wdtk-examples/src/m...
On 07.09.2015 15:49, Markus Krötzsch wrote:
Hi André,
I just made a small counting program with Wikidata Toolkit to count unique references. Running it on the most recent dump took about 30min. I uploaded the results:
http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-coun...
The file lists all references that are used at least 50 times, ordered by number of use. There were 593778 unique references for 35485364 referenced statements (out of 69942556 statements in total).
416480 of the references are used only once. If you want to see all references used at least twice, this is a slightly longer file:
http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-coun...
Best regards,
Markus
On 07.09.2015 13:25, André Costa wrote:
Hi all!
I'm wondering if there is a way (SQL, api, tool or otherwise) for finding out how often a particular source is used on Wikidata.
The background is a collaboration with two GLAMs where we have used ther open (and CC0) datasets to add and/or source statements on Wikidata for items on which they can be considered an authority. Now I figured it would be nice to give them back a number for just how big the impact was.
While I can find out how many items should be affected I couldn't find an easy way, short of analysing each of these, for how many statements were affected.
Any suggestions would be welcome.
Some details: Each reference is a P248 claim + P577 claim (where the latter may change)
Cheers, André / Lokal_Profil André Costa | GLAM-tekniker, Wikimedia Sverige |Andre.Costa@wikimedia.se mailto:Andre.Costa@wikimedia.se |+46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se http://blimedlem.wikimedia.se/
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
I'm wondering if there is a way (SQL, api, tool or otherwise) for finding out how often a particular source is used on Wikidata.
Something like this probably would work:
This runs the following query:
prefix prov: http://www.w3.org/ns/prov# prefix pr: http://www.wikidata.org/prop/reference/ PREFIX wd: http://www.wikidata.org/entity/ SELECT (count(?ref) as ?mentions) WHERE { ?statement prov:wasDerivedFrom ?ref . ?ref pr:P248 wd:Q216047 . ?ref pr:P577 ?date . }
For Q216047 which is "Le Figaro". This counts how many statements reference Le Figaro and also have dates (drop the last clause if non-dated ones are fine too).
On 07.09.2015 21:45, Stas Malyshev wrote:
Hi!
I'm wondering if there is a way (SQL, api, tool or otherwise) for finding out how often a particular source is used on Wikidata.
Something like this probably would work:
This runs the following query:
prefix prov: http://www.w3.org/ns/prov# prefix pr: http://www.wikidata.org/prop/reference/ PREFIX wd: http://www.wikidata.org/entity/ SELECT (count(?ref) as ?mentions) WHERE { ?statement prov:wasDerivedFrom ?ref . ?ref pr:P248 wd:Q216047 . ?ref pr:P577 ?date . }
For Q216047 which is "Le Figaro". This counts how many statements reference Le Figaro and also have dates (drop the last clause if non-dated ones are fine too).
Yes, that's the best technique if you already know which reference you are looking for. And it also supports more general patterns, like the Le Figaro one, which is also very interesting.
A small fix though: I think you should better use count(?statement) rather than count(?ref), right?
I have tried a similar query on the public test endpoint on labs earlier, but it timed out for me (I was using a very common reference though ;-). For rarer references, live queries are definitely the better approach.
Markus
Hi!
A small fix though: I think you should better use count(?statement) rather than count(?ref), right?
Yes, of course, my mistake - I modified it from different query and forgot to change it.
I have tried a similar query on the public test endpoint on labs earlier, but it timed out for me (I was using a very common reference though ;-). For rarer references, live queries are definitely the better approach.
Works for me for Q216047, didn't check others though. For a popular references, labs one may be too slow, indeed. A faster one is coming "real soon now" :)
Hi,
Many thanks for the many answers.
I went for the SPARQL solution in the end since I only have a short list of Q-numbers which can be sources. With the new query service it also has the benefit of being something I can just turn into a url and then give to the GLAM to look up on the fly =)
Now I also have an excuse to learn some more SPARQL to play around with this =)
Cheers, André
André Costa | GLAM-tekniker, Wikimedia Sverige | Andre.Costa@wikimedia.se | +46 (0)733-964574
Stöd fri kunskap, bli medlem i Wikimedia Sverige. Läs mer på blimedlem.wikimedia.se
On 7 September 2015 at 22:29, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
A small fix though: I think you should better use count(?statement) rather than count(?ref), right?
Yes, of course, my mistake - I modified it from different query and forgot to change it.
I have tried a similar query on the public test endpoint on labs earlier, but it timed out for me (I was using a very common reference though ;-). For rarer references, live queries are definitely the better approach.
Works for me for Q216047, didn't check others though. For a popular references, labs one may be too slow, indeed. A faster one is coming "real soon now" :)
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata