Hello all!
Here is (at last!) an update on what we are doing to protect the
stability of Wikidata Query Service.
For 4 years we have been offering to Wikidata users the Query Service, a
powerful tool that allows anyone to query the content of Wikidata,
without any identification needed. This means that anyone can use the
service using a script and make heavy or very frequent requests.
However, this freedom has led to the service being overloaded by a too
big amount of queries, causing the issues or lag that you may have noticed.
A reminder about the context:
We have had a number of incidents where the public WDQS endpoint was
overloaded by bot traffic. We don't think that any of that activity was
intentionally malicious, but rather that the bot authors most probably
don't understand the cost of their queries and the impact they have on
our infrastructure. We've recently seen more distributed bots, coming
from multiple IPs from cloud providers. This kind of pattern makes it
harder and harder to filter or throttle an individual bot. The impact
has ranged from increased update lag to full service interruption.
What we have been doing:
While we would love to allow anyone to run any query they want at any
time, we're not able to sustain that load, and we need to be more
aggressive in how we throttle clients. We want to be fair to our users
and allow everyone to use the service productively. We also want the
service to be available to the casual user and provide up-to-date access
to the live Wikidata data. And while we would love to throttle only
abusive bots, to be able to do that we need to be able to identify them.
We have two main means of identifying bots:
1) their user agent and IP address
2) the pattern of their queries
Identifying patterns in queries is done manually, by a person inspecting
the logs. It takes time and can only be done after the fact. We can only
start our identification process once the service is already overloaded.
This is not going to scale.
IP addresses are starting to be problematic. We see bots running on
cloud providers and running their workloads on multiple instances, with
multiple IP addresses.
We are left with user agents. But here, we have a problem again. To
block only abusive bots, we would need those bots to use a clearly
identifiable user agent, so that we can throttle or block them and
contact the author to work together on a solution. It is unlikely that
an intentionally abusive bot will voluntarily provide a way to be
blocked. So we need to be more aggressive about bots which are using a
generic user agent. We are not blocking those, but we are limiting the
number of requests coming from generic user agents. This is a large
bucket, with a lot of bots that are in this same category of "generic
user agent". Sadly, this is also the bucket that contains many small
bots that generate only a very reasonable load. And so we are also
impacting the bots that play fair.
At the moment, if your bot is affected by our restrictions, configure a
custom user agent that identifies you; this should be sufficient to give
you enough bandwidth. If you are still running into issues, please
contact us; we'll find a solution together.
What's coming next:
First, it is unlikely that we will be able to remove the current
restrictions in the short term. We're sorry for that, but the
alternative - service being unresponsive or severely lagged for everyone
- is worse.
We are exploring a number of alternatives. Adding authentication to the
service, and allowing higher quotas to bots that authenticate. Creating
an asynchronous queue, which could allow running more expensive queries,
but with longer deadlines. And we are in the process of hiring another
engineer to work on these ideas.
Thanks for your patience!
WDQS Team
Possibly of interest to those of you working on lexemes?
--
Andy Mabbett
@pigsonthewing
http://pigsonthewing.org.uk
---------- Forwarded message ---------
From: Marieke van Erp <marieke.van.erp(a)dh.huc.knaw.nl>
Date: Tue, 23 Jul 2019 17:38
Subject: Call for papers: Special Issue on Language Technology and
Knowledge Graphs
To: semantic-web(a)w3.org <semantic-web(a)w3.org>
Special Issue on Language Technology and Knowledge Graphs
Submission deadline: 30 September 2019
Call for Papers
Language understanding and knowledge engineering are among the most active
research and development areas due to the proliferation of big data. This
special issue on Language Technology and Knowledge Graphs is devoted to
gather and present innovative research, systems and applications that
address the challenges in the broad areas of language and knowledge
intelligence, presenting a platform for researchers to share their recent
observations and achievements in the field. Special topics for this special
issue include but are not limited to:
1. Textual Entailment and Knowledge
* Textual entailment
* Fact checking
* Fake news detection
* Argumentation mining
2. Knowledge-Guided NLP
* Question answering and reading comprehension
* Dialogue systems
* Information Retrieval
* Multilinguality
* Recommender systems
* Machine Translation
* Knowledge-Guided Deep Learning
* Complex knowledge-driven Information Extraction tasks e.g., relation
extraction, event extraction
* Methods and metrics for evaluation of semantic annotations with respect
to ontologies
* Knowledge-driven entity disambiguation and resolution
3. Contextual Knowledge Graphs and Language Technology
* Extracting and modelling temporally bounded information
* Dealing with culturally-aware information
* Handling domain specificity of information
4. Information Extraction for Knowledge Graphs
* Extraction from unstructured versus semi-structured textual sources
(e.g. tables)
* Dealing with the imperfections of Information Extraction techniques in
the Semantic Web setting and their impact
* Multi-source or multilingual Information Extraction for ontology
population
* Information extraction subtasks (e.g., terminology extraction, relation
extraction, coreference resolution) for the Semantic Web
* Methods and metrics for evaluation of Information Extraction for the
Semantic Web
5. Applications and Architectures
* Knowledge-based Information Extraction for specific domains and
applications, e.g. business analytics, healthcare and biomedicine, cultural
heritage etc.
* Information Extraction for social media mining
* Scalability of tools and resources
* Platforms and architectures for automatic and semi-automatic semantic
annotation
* Tools and methodologies for building and managing complex processing
workflows
Types of papers
Research papers describing well-identified scientific contributions which
are thoroughly evaluated. Those papers are typically 15-20 pages long.
System and Resource papers that focus on the description of systems or
resources relevant to this special issue where the authors fully detail the
design, construction, implementation and usage as well as demonstrate its
usefulness. Those papers are expected to be 8-10 pages long. Please select
VSI: LT&KGs in the submission system.
More information and submission:
https://www.journals.elsevier.com/journal-of-web-semantics/call-for-papers/…
For questions contact: marieke.van.erp(a)dh.huc.knaw.nl
Important Dates
* Submission deadline: 30 September 2019
* Author notification: 17 November 2019
* Publication: Q1 2020
Guest Editors
* Marieke van Erp, KNAW Humanities Cluster, DHLab, the Netherlands
* Jeff Z. Pan, University of Aberdeen, UK
* Zhiyuan Liu, Tsinghua University, China
--
Digital Humanities Lab / dhlab.nl
KNAW Humanities Cluster / huc.knaw.nl
http://www.mariekevanerp.com
Possibly of interest to those of you working on lexemes?
--
Andy Mabbett
@pigsonthewing
http://pigsonthewing.org.uk
---------- Forwarded message ---------
From: Marieke van Erp <marieke.van.erp(a)dh.huc.knaw.nl>
Date: Tue, 23 Jul 2019 17:38
Subject: Call for papers: Special Issue on Language Technology and
Knowledge Graphs
To: semantic-web(a)w3.org <semantic-web(a)w3.org>
Special Issue on Language Technology and Knowledge Graphs
Submission deadline: 30 September 2019
Call for Papers
Language understanding and knowledge engineering are among the most active
research and development areas due to the proliferation of big data. This
special issue on Language Technology and Knowledge Graphs is devoted to
gather and present innovative research, systems and applications that
address the challenges in the broad areas of language and knowledge
intelligence, presenting a platform for researchers to share their recent
observations and achievements in the field. Special topics for this special
issue include but are not limited to:
1. Textual Entailment and Knowledge
* Textual entailment
* Fact checking
* Fake news detection
* Argumentation mining
2. Knowledge-Guided NLP
* Question answering and reading comprehension
* Dialogue systems
* Information Retrieval
* Multilinguality
* Recommender systems
* Machine Translation
* Knowledge-Guided Deep Learning
* Complex knowledge-driven Information Extraction tasks e.g., relation
extraction, event extraction
* Methods and metrics for evaluation of semantic annotations with respect
to ontologies
* Knowledge-driven entity disambiguation and resolution
3. Contextual Knowledge Graphs and Language Technology
* Extracting and modelling temporally bounded information
* Dealing with culturally-aware information
* Handling domain specificity of information
4. Information Extraction for Knowledge Graphs
* Extraction from unstructured versus semi-structured textual sources
(e.g. tables)
* Dealing with the imperfections of Information Extraction techniques in
the Semantic Web setting and their impact
* Multi-source or multilingual Information Extraction for ontology
population
* Information extraction subtasks (e.g., terminology extraction, relation
extraction, coreference resolution) for the Semantic Web
* Methods and metrics for evaluation of Information Extraction for the
Semantic Web
5. Applications and Architectures
* Knowledge-based Information Extraction for specific domains and
applications, e.g. business analytics, healthcare and biomedicine, cultural
heritage etc.
* Information Extraction for social media mining
* Scalability of tools and resources
* Platforms and architectures for automatic and semi-automatic semantic
annotation
* Tools and methodologies for building and managing complex processing
workflows
Types of papers
Research papers describing well-identified scientific contributions which
are thoroughly evaluated. Those papers are typically 15-20 pages long.
System and Resource papers that focus on the description of systems or
resources relevant to this special issue where the authors fully detail the
design, construction, implementation and usage as well as demonstrate its
usefulness. Those papers are expected to be 8-10 pages long. Please select
VSI: LT&KGs in the submission system.
More information and submission:
https://www.journals.elsevier.com/journal-of-web-semantics/call-for-papers/…
For questions contact: marieke.van.erp(a)dh.huc.knaw.nl
Important Dates
* Submission deadline: 30 September 2019
* Author notification: 17 November 2019
* Publication: Q1 2020
Guest Editors
* Marieke van Erp, KNAW Humanities Cluster, DHLab, the Netherlands
* Jeff Z. Pan, University of Aberdeen, UK
* Zhiyuan Liu, Tsinghua University, China
--
Digital Humanities Lab / dhlab.nl
KNAW Humanities Cluster / huc.knaw.nl
http://www.mariekevanerp.com
Hello all,
This message is important for people who still use the config value
mw.config.get(
'wbEntity') in their tools or scripts. This value, deprecated for two
years, will be completely dropped on July 24th.
The config value mw.config.get( 'wbEntity') has been deprecated in order to
improve the page load time, especially on large Items. Currently, a
significant proportion of the HTML on every entity page is generated by
this value. Dropping wbEntity completely will make the first paint (time
needed to load the page) faster, and will better utilize the server and
client cache.
If your code is still using
<https://phabricator.wikimedia.org/T85499#5330372> mw.config.get(
'wbEntity'), you can replace it by the hook wikibase.entityPage.entityLoaded
(see an example here
<https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/47952b3b22a…>).
On Wednesday, July 24th, we will drop mw.config.get( 'wbEntity') from
Wikidata’s codebase and calling the value will result in an error.
If you have any questions or need help, feel free to leave a comment under the
related task <https://phabricator.wikimedia.org/T85499>.
Thanks for your understanding,
Cheers,
--
Léa Lacroix
Project Manager Community Communication for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
Hello all,
The Structured Data development team will hold an IRC office hour from
17:00-18:00 UTC on Thursday, 18 July, in #wikimedia-office on the freenode
IRC network. Meta has information on joining the meeting, as well as date
and time conversion [1]. Please join us to discuss forming statements for
structured data, properties that may need to be proposed on Wikidata,
future plans for SDC, or anything else you might want to discuss. I look
forward to seeing you there, I'll post a reminder before the meeting starts.
1. https://meta.wikimedia.org/wiki/IRC_office_hours#Upcoming_office_hours
--
Keegan Peterzell
Community Relations Specialist
Wikimedia Foundation
Hello all,
The next Wikidata IRC office hour will take place on Tuesday, July 16th, at
16:00 UTC (18:00 Berlin time)
<https://www.timeanddate.com/worldclock/fixedtime.html?msg=Wikidata+IRC+offi…>.
As usual, we will meet on the #wikimedia-office
<irc://irc.freenode.net/wikimedia-office> channel to present the state of
development, the plans for the future, and answer all of your questions.
If you can't join, the notes of the meetings will be published after the
meeting.
Cheers,
--
Léa Lacroix
Project Manager Community Communication for Wikidata
Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
Dear all,
we’re really happy to announce the first ever Coolest Tool Award
<https://meta.wikimedia.org/wiki/Coolest_Tool_Award>!
Tools play an essential role at Wikimedia, and so do the many volunteer
developers who experiment with new ideas, develop & maintain local & global
solutions and bridge workflow gaps for the Wikimedia communities.
There are incredible many great tools out there. It’s time to celebrate
this & to make the work volunteer developers do more visible to everyone :-)
The Coolest Tool Award ceremony will take place at Wikimania, as part of
the official program:
https://wikimania.wikimedia.org/wiki/2019:Technology_outreach_%26_innovatio…
The award is organized & selected by this year's Coolest Tool Academy. The
plan is to award the greatest tools in a variety of categories: For
example: “best Commons tool” to recognize the value of a tool for a
specific project, or “best newcomer” to award the work of a fairly new
developer.
As no one can possibly know all the cool tools out there, we’re looking for
some help & inspiration: Please point us to the tools that you think are
great - out of any reason you can think of!
Please use this form:
https://docs.google.com/forms/d/1Ip5Sb_CDvgO6IN2f51V3WjkVYU9Sa-nneX5PoY0sjo…
to recommend tools by July 29, 2019.
You can nominate as many tools as you want by filling out the form multiple
times.
This survey will be conducted via a third-party service, which may subject
it to additional terms. For more information on privacy and data-handling,
see the survey privacy statement:
https://foundation.wikimedia.org/wiki/Coolest_Tool_Award_2019_Survey_Privac…
Thank you very much for your ideas & recommendation(s)!
We will continue to spread the word over the next 1-2 days, but if you get
the chance, please feel welcome to share this information with others too!
If you have any questions, please contact us at
https://meta.wikimedia.org/wiki/Coolest_Tool_Award.
Thanks :-)
Birgit, for the Coolest Tool Academy 2019
--
Birgit Müller (she/her)
Director of Technical Engagement
Wikimedia Foundation <https://wikimediafoundation.org/>
Sorry for cross-posting!
Reminder: Technical Advice IRC meeting this week **Wednesday 3-4 pm UTC**
on #wikimedia-tech.
Questions can be asked in English and German!
The Technical Advice IRC Meeting (TAIM) is a weekly support event for
volunteer developers. Every Wednesday, two full-time developers are
available to help you with all your questions about Mediawiki, gadgets,
tools and more! This can be anything from "how to get started" over "who
would be the best contact for X" to specific questions on your project.
If you know already what you would like to discuss or ask, please add your
topic to the next meeting:
https://www.mediawiki.org/wiki/Technical_Advice_IRC_Meeting
Hope to see you there!
--
Raz Shuty
Engineering Manager
Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
https://wikimedia.de
Imagine a world, in which every single human being can freely share in the
sum of all knowledge. That‘s our commitment.
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.