Hi! We looked at the logs. 21,740,641 requests are coming from a single IP without a user agent that we can't geolocate because it's in the 10 range.
Looking into the actual queries revealed that it's probably a broken bot. Stas said "the query makes no sense and is broken" and that it "looks like somebody trying to download whole DB in very weird way but is doing it all wrong."
We are investigating the issue.
– *Mikhail Popov* // Data Analyst, Discovery
On 06/11/15 18:04, Mikhail Popov wrote:
Hi! We looked at the logs. 21,740,641 requests are coming from a single IP without a user agent that we can't geolocate because it's in the 10 range.
Looking into the actual queries revealed that it's probably a broken bot. Stas said "the query makes no sense and is broken" and that it "looks like somebody trying to download whole DB in very weird way but is doing it all wrong."
We are investigating the issue.
– *Mikhail Popov* // Data Analyst, Discovery
Michail,
If by "in the 10 range", you mean an IPv4 address of the form 10.x.x.x, then it's an RFC1918 address, and more than likely coming from inside your own network.
Neil
Might this be affecting our searches? The following query times out very quickly on Chrome, and runs forever in Firefox before crashing the whole browser (or is there a problem with my query?)
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wikibase: http://wikiba.se/ontology# PREFIX p: http://www.wikidata.org/prop/ PREFIX v: http://www.wikidata.org/prop/statement/ PREFIX q: http://www.wikidata.org/prop/qualifier/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT ?photographer ?photographer_label ?nat ?nat_label ?dob ?dod WHERE { ?photographer wdt:P106 wd:Q33231 . # find items that have "occupation (P106): photographer (Q33231) " OPTIONAL {?photographer wdt:P27 ?nat .} # with a P19 (place of birth) claim OPTIONAL {?photographer wdt:P569 ?dob .} # ... where the pob has a Country OPTIONAL {?photographer wdt:P570 ?dod ;} #where the pob has a state
OPTIONAL {?photographer rdfs:label ?photographer_label filter (lang(?photographer_label) = "en") .} OPTIONAL {?nat rdfs:label ?nat_label filter (lang(?nat_label) = "en") .} #OPTIONAL {?cob rdfs:label ?cob_label filter (lang(?cob_label) = "en") .} #OPTIONAL {?state rdfs:label ?state_label filter (lang(?state_label) = "en") .} }
On Fri, Nov 6, 2015 at 1:27 PM, Neil Harris neil@tonal.clara.co.uk wrote:
On 06/11/15 18:04, Mikhail Popov wrote:
Hi! We looked at the logs. 21,740,641 requests are coming from a single IP without a user agent that we can't geolocate because it's in the 10 range.
Looking into the actual queries revealed that it's probably a broken bot. Stas said "the query makes no sense and is broken" and that it "looks like somebody trying to download whole DB in very weird way but is doing it all wrong."
We are investigating the issue.
– *Mikhail Popov* // Data Analyst, Discovery
Michail,
If by "in the 10 range", you mean an IPv4 address of the form 10.x.x.x, then it's an RFC1918 address, and more than likely coming from inside your own network.
Neil
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
Might this be affecting our searches? The following query times out very quickly on Chrome, and runs forever in Firefox before crashing the whole browser (or is there a problem with my query?)
The symptoms you describe seem to suggest you have too many results for this query and browser gets out of memory. Try this query with LIMIT 10 first and see what happens.
As for the bot activities affecting other users, the effect seems to be negligible, so if this query is slow, it is slow on its own merits :)
Hi David,
I think the issue with your query was with the line
OPTIONAL {?nat rdfs:label ?nat_label filter (lang(?nat_label) = "en") .}
The problem was that if the photographer didn't have a P27, so ?nat wasn't bound in the previous OPTIONAL line, then when it gets to the line above, with ?nat unbound, it will then be a directive to start binding labels for the *entire database* ... which is why it is just as well that Stas turns over an egg timer for each query. :-)
The way around this is to nest the two OPTIONAL clauses, one inside the other:
OPTIONAL {?photographer wdt:P27 ?nat . OPTIONAL {?nat rdfs:label ?nat_label filter (lang(?nat_label) = "en") .} }
This should now run fine. (Provided you remember to remove the old OPTIONAL line).
All best,
James.
On 06/11/2015 19:38, David Lowe wrote:
Might this be affecting our searches? The following query times out very quickly on Chrome, and runs forever in Firefox before crashing the whole browser (or is there a problem with my query?)
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wikibase: http://wikiba.se/ontology# PREFIX p: http://www.wikidata.org/prop/ PREFIX v: http://www.wikidata.org/prop/statement/ PREFIX q: http://www.wikidata.org/prop/qualifier/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT ?photographer ?photographer_label ?nat ?nat_label ?dob ?dod WHERE { ?photographer wdt:P106 wd:Q33231 . # find items that have "occupation (P106): photographer (Q33231) " OPTIONAL {?photographer wdt:P27 ?nat .} # with a P19 (place of birth) claim OPTIONAL {?photographer wdt:P569 ?dob .} # ... where the pob has a Country OPTIONAL {?photographer wdt:P570 ?dod ;} #where the pob has a state
OPTIONAL {?photographer rdfs:label ?photographer_label filter
(lang(?photographer_label) = "en") .} OPTIONAL {?nat rdfs:label ?nat_label filter (lang(?nat_label) = "en") .} #OPTIONAL {?cob rdfs:label ?cob_label filter (lang(?cob_label) = "en") .} #OPTIONAL {?state rdfs:label ?state_label filter (lang(?state_label) = "en") .} }
Thanks, all! I ran each query separately and reassembled the results in a spreadsheet, so I think I got what I was after. Brand new to SPARQL, so I'll try and figure out your correct query above. Thanks again, d
On Fri, Nov 6, 2015 at 5:40 PM, James Heald j.heald@ucl.ac.uk wrote:
Hi David,
I think the issue with your query was with the line
OPTIONAL {?nat rdfs:label ?nat_label filter (lang(?nat_label) = "en") .}
The problem was that if the photographer didn't have a P27, so ?nat wasn't bound in the previous OPTIONAL line, then when it gets to the line above, with ?nat unbound, it will then be a directive to start binding labels for the *entire database* ... which is why it is just as well that Stas turns over an egg timer for each query. :-)
The way around this is to nest the two OPTIONAL clauses, one inside the other:
OPTIONAL {?photographer wdt:P27 ?nat . OPTIONAL {?nat rdfs:label ?nat_label filter (lang(?nat_label) = "en") .} }
This should now run fine. (Provided you remember to remove the old OPTIONAL line).
All best,
James.
On 06/11/2015 19:38, David Lowe wrote:
Might this be affecting our searches? The following query times out very quickly on Chrome, and runs forever in Firefox before crashing the whole browser (or is there a problem with my query?)
PREFIX wd: http://www.wikidata.org/entity/ PREFIX wdt: http://www.wikidata.org/prop/direct/ PREFIX wikibase: http://wikiba.se/ontology# PREFIX p: http://www.wikidata.org/prop/ PREFIX v: http://www.wikidata.org/prop/statement/ PREFIX q: http://www.wikidata.org/prop/qualifier/ PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
SELECT ?photographer ?photographer_label ?nat ?nat_label ?dob ?dod WHERE { ?photographer wdt:P106 wd:Q33231 . # find items that have "occupation (P106): photographer (Q33231) " OPTIONAL {?photographer wdt:P27 ?nat .} # with a P19 (place of birth) claim OPTIONAL {?photographer wdt:P569 ?dob .} # ... where the pob has a Country OPTIONAL {?photographer wdt:P570 ?dod ;} #where the pob has a state
OPTIONAL {?photographer rdfs:label ?photographer_label filter
(lang(?photographer_label) = "en") .} OPTIONAL {?nat rdfs:label ?nat_label filter (lang(?nat_label) = "en") .} #OPTIONAL {?cob rdfs:label ?cob_label filter (lang(?cob_label) = "en") .} #OPTIONAL {?state rdfs:label ?state_label filter (lang(?state_label) = "en") .} }
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 11/6/15 1:04 PM, Mikhail Popov wrote:
Hi! We looked at the logs. 21,740,641 requests are coming from a single IP without a user agent that we can't geolocate because it's in the 10 range.
Looking into the actual queries revealed that it's probably a broken bot. Stas said "the query makes no sense and is broken" and that it "looks like somebody trying to download whole DB in very weird way but is doing it all wrong."
We are investigating the issue.
– *Mikhail Popov*// Data Analyst, Discovery
That will always happen, folks always want to dump the entire DB.
Takes a while for clarity to arise.
This has been the DBpedia experience for years.
[1] https://docs.google.com/document/d/12VljKl-yDNBoMGb_FnQWiXDAaZC3VnQHqy-E9iD8... -- DBpedia Usage Report
Hi!
That will always happen, folks always want to dump the entire DB.
That's fine, but the right way to do it is to download it from https://dumps.wikimedia.org/wikidatawiki/entities/, not send queries with syntax errors to the service :) Especially when it's 20M times the same query.
Also, fair warning, entire DB is big (about 80G as DB, no idea how much dumped, compressed dump is 7 G) so it may need some horsepower if you want to do non-trivial stuff with it. -- Stas Malyshev smalyshev@wikimedia.org
I sometimes search Twitter for the word "Wikidata" to get updates beyond this mailing list. The past couple of times I have experienced that a considerable part of the posts are from fake accounts issueing more or less the same post "The file numbers are also being added to Wikipedia biographical articles and are incorporated into Wikidata." I tried to report and block these posts but find that it really doesn't help.
Are there any others that have tried to block these accounts and posts? I imaging it might help. Or is Twitter going down?
https://twitter.com/search?f=tweets&vertical=default&q=wikidata&...
/Finn Årup Nielsen
Hi,
With a modern Twitter client, you can filter those out based on keywords. I agree, this is very annoying. They all use a different API to post those things, so it's not easy to trace it.
Greetings,
Sjoerd de Bruin sjoerddebruin@me.com
Op 8 nov. 2015, om 15:06 heeft Finn Årup Nielsen faan@dtu.dk het volgende geschreven:
I sometimes search Twitter for the word "Wikidata" to get updates beyond this mailing list. The past couple of times I have experienced that a considerable part of the posts are from fake accounts issueing more or less the same post "The file numbers are also being added to Wikipedia biographical articles and are incorporated into Wikidata." I tried to report and block these posts but find that it really doesn't help.
Are there any others that have tried to block these accounts and posts? I imaging it might help. Or is Twitter going down?
https://twitter.com/search?f=tweets&vertical=default&q=wikidata&...
/Finn Årup Nielsen
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 8 November 2015 at 14:06, Finn Årup Nielsen faan@dtu.dk wrote:
I sometimes search Twitter for the word "Wikidata" to get updates beyond this mailing list. The past couple of times I have experienced that a considerable part of the posts are from fake accounts issueing more or less the same post "The file numbers are also being added to Wikipedia biographical articles and are incorporated into Wikidata." I tried to report and block these posts but find that it really doesn't help.
I analysed this phenomenon in a January 2013 blog post:
http://pigsonthewing.org.uk/twitter-spam-case-study/
Sadly, Twitter seem unwilling to deal with such abuse.