No, it's not about Jimmy :P It's about the software for parsing
dictionaries. And we are presently inside of not so stable phase of
switching from the name "dictator" to "dicteator" (etymology is
"dictionary creator").
One of my strategic goals in relation to the movement itself is to
create methodology for parsing dictionaries and adding them into the
Wikimedia projects (first Wiktionary, but Wikidata and likely some
other projects in the future) and a pool of programmers to keep that
knowledge inside of the movement.
So, besides the software itself [1], one of the tasks of Milos
Trifunovic, programmer who is inserting data for the project
Wiktionary Meets Matica Srpska [2] was to create a white paper about
the process itself, which he did [3].
I know there were numerous previous additions of the dictionaries into
the Wiktionaries, but, as far as I know, no systematic effort was put
into dissemination of that knowledge.
We are at the beginning of the process. Up to the present, there are
~26k new entries on Serbian Wiktionary. The code is in the initial
useful phase. Up to the end of this part of the process I expect from
a few hundred thousand to a few million of new entries all over
Wiktionary editions (the most optimistic estimation is a few dozen
million entries; estimation varies that much because of many factors,
including future community involvement; one million is a reachable
target).
Keep in mind that there are three different stages of the software,
with various levels of usefulness and complexity:
1) Adding the content into Wiktionary. This depends on particular
Wiktionary customs, could be easily changed and didn't require too
much of sophistication, as it's dominantly about Pywikipediabot and
wiki syntax.
2) JSON intermediate storage. That's important as it's the most formal
way for representing dictionary data for future use, not depending of
destination platform. There is a space for further development of the
particular format and your participation will be appreciated.
3) Parsing particular dictionary. That's dictionary-specific, but a
number of methods could be shared for parsing other dictionaries. As
many dictionaries we have, as much we will have developed common
methods.
So, if you are interested into the matter, please go to the talk page
[4] and give your suggestions. Also, don't hesitate to reach me
directly.
[1] https://github.com/Interglider/dictator
[2] https://meta.wikimedia.org/wiki/Grants:PEG/Interglider.ORG/Wiktionary_Meets…
[3] https://meta.wikimedia.org/wiki/Dicteator
[4] https://meta.wikimedia.org/wiki/Talk:Dicteator
*Extended Deadline November 20, 2015
CFP: Semantic Web Journal - Special Issue on Quality Management of
Semantic Web Assets (Data, Services and Systems):*
http://www.semantic-web-journal.net/blog/call-papers-special-issue-quality-…
Submission guidelines
*Deadline:October 31, 2015* > *November 20, 2015**
*
Submissions shall be made through the Semantic Web journal website at
http://www.semantic-web-journal.net. Prospective authors must take
notice of the submission guidelines posted at
http://www.semantic-web-journal.net/authors. Note that you need to
request an account on the website for submitting a paper. Please
indicate in the cover letter that it is for the Special Issue on Quality
Management of Semantic Web Assets (Data, Services and Systems).
Submissions are possible in the following categories: full research
papers, application reports, reports on tools and systems, and case
studies. While there is no upper limit, paper length must be justified
by content.
Guest editors
* Amrapali Zaveri, University of Leipzig, AKSW Group, Germany
* Dimitris Kontokostas, University of Leipzig, AKSW Group, Germany
* Sebastian Hellmann, University of Leipzig, AKSW Group, Germany
* Jürgen Umbrich, Vienna University of Economics and Business, Austria
*Overview and Topics*
The standardization and adoption of Semantic Web technologies has
resulted in a variety of assets, including an unprecedented volume of
data being semantically enriched and systems and services, which consume
or publish this data. Although gathering, processing and publishing data
is a step towards further adoption of Semantic Web, quality does not yet
play a central role in these assets (e.g., data lifecycle,
system/service development).
Quality management essentially refers to activities and tasks involved
to guarantee a certain level of consistency and to meet the quality
requirements for the assets. In general, quality management consists of
the following four phases and components: (i) quality planning, (ii)
quality control, (iii) quality assurance and (iv) quality improvement.
The quality planning phase in the Semantic Web typically involves the
design of procedures, strategies and policies to support the management
of the assets. The quality control and assurance components have their
primary aim in preventing errors and to meet quality requirements
pertaining to the Semantic Web standards. A core part for both
components are quality assessment methods which provide the necessary
input for the controlling and assurance tasks.
Quality assessment of Semantic Web Assets (data, services and systems),
in particular, presents new challenges that were not handled before in
other research areas. Thus, adopting existing approaches for data
quality assessment is not a straightforward solution. These challenges
are related to the openness of the Semantic Web, the diversity of the
information and the unbounded, dynamic set of autonomous data sources,
publishers and consumers (legal and software agents). Additionally,
detecting the quality of available data sources and making the
information explicit is yet another challenge. Moreover, noise in one
data set, or missing links between different data sets, propagates
throughout the Web of Data, and imposes great challenges on the data
value chain.
In case of systems and services, different implementations follow the
specifications for RDF and SPARQL to varying extents, or even propose
and offer new, non-standardized extensions. This causes strong
incompatibilities between systems, e.g., between the used SPARQL
features in the query engines and support features in RDF stores. The
potential heterogeneity and incompatibility poses several challenges for
the quality assessments in and for such systems and services.
Eventually, quality improvement methods are used to further enhance the
value of the Semantic Web Assets. One important step to improve the
quality of data is identifying the root cause of the problem and then
designing corresponding data improvement solutions. These solutions
select the most effective and efficient strategies and related set of
techniques and tools to improve quality. Quality improvement metrics for
products and services entails understanding and improving operational
processes and establishing valid and reliable service performance measures.
This Special Issue is addressed to those members of the community
interested in providing novel methodologies or frameworks in managing,
assessing, monitoring, maintaining and improving the quality of the
Semantic Web data, services and systems and also introduce tools and
user interfaces which can effectively assist in this management.
Topics of Interest
We welcome original high quality submissions on (but are not restricted
to) the following topics:
* Methodologies and frameworks to plan, control, assure or improve the
quality of Semantic Web Assets
* Quality exploration and analysis interfaces
* Quality monitoring
* Developing, deploying and managing quality service ecosystems
* Assessing the quality evolution of Semantic Web Assets
* Large-scale quality assessment of structured datasets
* Crowdsourcing data quality assessment
* Quality assessment leveraging background knowledge
* Use-case driven quality management
* Evaluation of trustworthiness of data
* Web Data and LOD quality benchmarks
* Data Quality improvement methods and frameworks, e.g., linkage,
alignment, cleaning, enrichment, correctness
* Service/system quality improvement methods and frameworks
* Managing sustainability issues in services
* Guarantee of service (availability, performance)
* Systems for transparent management of open data
Today I drafted some short reviews of papers about Wiktionary, from the
backlog of the Wikimedia Research Newsletter. Reviews of the reviews and
edits are welcome before the newsletter is published, in case I missed
or misunderstood something too technical for me. :)
https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-09-30/Recen…
I think Wiktionary users can be very proud, as (in its multiple language
editions) it's regularly shown to be an invaluable linguistic resource,
already better than multiple or all competitors for a long series of
purposes.
1.2 GLAWI, a free XML-encoded Machine-Readable Dictionary built from the
French Wiktionary
1.3 IWNLP: Inverse Wiktionary for Natural Language Processing
1.4 knoWitiary: A Machine Readable Incarnation of Wiktionary
1.5 Zmorge: A German Morphological Lexicon Extracted from Wiktionary
1.6 Dbnary: Wiktionary as Linked Data for 12 Language Editions with
Enhanced Translation Relations
1.7 Observing Online Dictionary Users: Studies Using Wiktionary Log Files
1.8 Multilingual Open Relation Extraction Using Cross-lingual Projection
Nemo
*CFP: Semantic Web Journal - Special Issue on Quality Management of
Semantic Web Assets (Data, Services and Systems):*
http://www.semantic-web-journal.net/blog/call-papers-special-issue-quality-…
<http://www.semantic-web-journal.net/blog/call-papers-special-issue-quality-…>
Submission guidelines
*Deadline (_only 1 month left_):October 31, 2015
*
Submissions shall be made through the Semantic Web journal website at
http://www.semantic-web-journal.net
<http://www.semantic-web-journal.net/>. Prospective authors must take
notice of the submission guidelines posted at
http://www.semantic-web-journal.net/authors
<http://www.semantic-web-journal.net/authors>. Note that you need to
request an account on the website for submitting a paper. Please
indicate in the cover letter that it is for the Special Issue on Quality
Management of Semantic Web Assets (Data, Services and Systems).
Submissions are possible in the following categories: full research
papers, application reports, reports on tools and systems, and case
studies. While there is no upper limit, paper length must be justified
by content.
Guest editors
* Amrapali Zaveri, University of Leipzig, AKSW Group, Germany
* Dimitris Kontokostas, University of Leipzig, AKSW Group, Germany
* Sebastian Hellmann, University of Leipzig, AKSW Group, Germany
* Jürgen Umbrich, Vienna University of Economics and Business, Austria
*Overview and Topics*
The standardization and adoption of Semantic Web technologies has
resulted in a variety of assets, including an unprecedented volume of
data being semantically enriched and systems and services, which consume
or publish this data. Although gathering, processing and publishing data
is a step towards further adoption of Semantic Web, quality does not yet
play a central role in these assets (e.g., data lifecycle,
system/service development).
Quality management essentially refers to activities and tasks involved
to guarantee a certain level of consistency and to meet the quality
requirements for the assets. In general, quality management consists of
the following four phases and components: (i) quality planning, (ii)
quality control, (iii) quality assurance and (iv) quality improvement.
The quality planning phase in the Semantic Web typically involves the
design of procedures, strategies and policies to support the management
of the assets. The quality control and assurance components have their
primary aim in preventing errors and to meet quality requirements
pertaining to the Semantic Web standards. A core part for both
components are quality assessment methods which provide the necessary
input for the controlling and assurance tasks.
Quality assessment of Semantic Web Assets (data, services and systems),
in particular, presents new challenges that were not handled before in
other research areas. Thus, adopting existing approaches for data
quality assessment is not a straightforward solution. These challenges
are related to the openness of the Semantic Web, the diversity of the
information and the unbounded, dynamic set of autonomous data sources,
publishers and consumers (legal and software agents). Additionally,
detecting the quality of available data sources and making the
information explicit is yet another challenge. Moreover, noise in one
data set, or missing links between different data sets, propagates
throughout the Web of Data, and imposes great challenges on the data
value chain.
In case of systems and services, different implementations follow the
specifications for RDF and SPARQL to varying extents, or even propose
and offer new, non-standardized extensions. This causes strong
incompatibilities between systems, e.g., between the used SPARQL
features in the query engines and support features in RDF stores. The
potential heterogeneity and incompatibility poses several challenges for
the quality assessments in and for such systems and services.
Eventually, quality improvement methods are used to further enhance the
value of the Semantic Web Assets. One important step to improve the
quality of data is identifying the root cause of the problem and then
designing corresponding data improvement solutions. These solutions
select the most effective and efficient strategies and related set of
techniques and tools to improve quality. Quality improvement metrics for
products and services entails understanding and improving operational
processes and establishing valid and reliable service performance measures.
This Special Issue is addressed to those members of the community
interested in providing novel methodologies or frameworks in managing,
assessing, monitoring, maintaining and improving the quality of the
Semantic Web data, services and systems and also introduce tools and
user interfaces which can effectively assist in this management.
Topics of Interest
We welcome original high quality submissions on (but are not restricted
to) the following topics:
* Methodologies and frameworks to plan, control, assure or improve the
quality of Semantic Web Assets
* Quality exploration and analysis interfaces
* Quality monitoring
* Developing, deploying and managing quality service ecosystems
* Assessing the quality evolution of Semantic Web Assets
* Large-scale quality assessment of structured datasets
* Crowdsourcing data quality assessment
* Quality assessment leveraging background knowledge
* Use-case driven quality management
* Evaluation of trustworthiness of data
* Web Data and LOD quality benchmarks
* Data Quality improvement methods and frameworks, e.g., linkage,
alignment, cleaning, enrichment, correctness
* Service/system quality improvement methods and frameworks
* Managing sustainability issues in services
* Guarantee of service (availability, performance)
* Systems for transparent management of open data
Help
2015-10-05 19:00 GMT+07:00, wiktionary-l-request(a)lists.wikimedia.org
<wiktionary-l-request(a)lists.wikimedia.org>:
> Send Wiktionary-l mailing list submissions to
> wiktionary-l(a)lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
> or, via email, send a message with subject or body 'help' to
> wiktionary-l-request(a)lists.wikimedia.org
>
> You can reach the person managing the list at
> wiktionary-l-owner(a)lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wiktionary-l digest..."
>
>
> Today's Topics:
>
> 1. Re: Names of Wikimedia languages and other matrix
> (Federico Leva (Nemo))
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 4 Oct 2015 19:47:58 +0200
> From: "Federico Leva (Nemo)" <nemowiki(a)gmail.com>
> To: "The Wiktionary (http://www.wiktionary.org) mailing list"
> <wiktionary-l(a)lists.wikimedia.org>
> Cc: mediawiki internationalisation
> <mediawiki-i18n(a)lists.wikimedia.org>
> Subject: Re: [Wiktionary-l] Names of Wikimedia languages and other
> matrix
> Message-ID: <5611664E.6080707(a)gmail.com>
> Content-Type: text/plain; charset=utf-8; format=flowed
>
> Milos Rancic, 20/05/2015 10:36:
>> May somebody clarify that Unicode data is free? PD or CC-BY-SA
>> compatible.
>
> Sure. The license __for data and software__ is
> http://unicode.org/copyright.html#Exhibit1 which is a BSD 3-clause
> license with trivial changes to make it even clearer and freer, such as
> the removal of the term "binary form" and "other materials".
>
> CLDR data is embedded in most free software, e.g. via ICU, packages for
> which are available e.g. in Debian which is notoriously restrictive as
> regards licenses. https://packages.debian.org/search?keywords=icu
>
> Besides, language names are evidently not copyrightable and Unicode
> makes no attempt to state the contrary. (Their license is still useful
> for sad places like EU where the set of names could be considered
> subject to sui generis database rights.)
>
> Nemo
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Wiktionary-l mailing list
> Wiktionary-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
>
>
> ------------------------------
>
> End of Wiktionary-l Digest, Vol 87, Issue 1
> *******************************************
>
Cross-list posting, as it's relevant to Wiktionary and Wikimedia as a whole.
First of all, please go to Meta page "Names of Wikimedia languages"
[1] and do the best to proofread or translate items. That's
strategically important set of lists for the movement. We have to know
the names of Wikimedia languages in Wikimedia languages.
This is the first mobilization for this kind of simple translations:
few hundred terms, of which this list is the most complex, as it
requires additional column "in <this> language".
The next one will be about lexicographical and grammatical terms and
abbreviations. That one is of strategic importance for Wiktionary, as
it allows anyone to generate sane dictionary entries.
After those two lists we'll be able to start working on the
Ornithological dictionary, with something less than 400 species.
And now about the number of tanks...
Let's say that there are 250 Wikimedia languages and that we have
three matrix sets: names of languages, 100 lexicographical and
grammatical abbreviations and terms and 400 species from
ornithological dictionary. And that we have those lists translated in
all (250) Wikimedia languages. The numbers are...
* The names of 250 languages *times* in 250 languages (=62,500 entries
per project) *times* on 250 projects (=15,625,000 entries on all
projects).
* 100 lexicographical and grammatical terms and abbreviations *times*
250 languages (=25,000 entries per project) *times* on 250 projects
(=6,250,000 entries on all projects).
* 400 bird species * 250 languages (=100,000 entries per project)
*times* on 250 projects (=25,000,000 entries on all projects).
OK. That calculation is too optimistic. I would be happy if we get
translations in 50 languages. The numbers would be then 125,000
entries for languages, 250,000 entries for lexicographical and
grammatical terms and abbreviations and 1,000,000 for birds.
Besides obvious fact that traditional lexicography isn't that
optimized (note that it's about traditional lexicography, not about
Wiktionary itself, thus not that fixable) and that we need a bit
better method (OmegaWiki, Wikidata, we are developing the proof of
concept, as well), there are two other consequences:
1) If we have a set of 400 words and we translate them in 50
languages, we are getting one million of entries. We should be doing
that on monthly basis. It's not hard at all!
2) In a bit more complex form, which requires more work per matrix set
and smaller output ("just" multiplication of the first and third
number), this could be used for Wikipedia articles, as well. (You need
much more information in encyclopedic article for German language than
in a dictionary entry. But it's quite possible to do it. And it's
especially important for languages with small number of speakers.)
Please go to [1] and help this translation! Having the names of
Wikimedia languages in Wikimedia languages *is* important no matter if
it's about Wiktionary or generating the content. We should know the
names of our languages in our languages.
[1] https://meta.wikimedia.org/wiki/Names_of_Wikimedia_languages