Wiktionary-l October 2015

wiktionary-l@lists.wikimedia.org

6 participants
6 discussions

Dictator
by Milos Rancic 27 Oct '15

27 Oct '15

No, it's not about Jimmy :P It's about the software for parsing dictionaries. And we are presently inside of not so stable phase of switching from the name "dictator" to "dicteator" (etymology is "dictionary creator"). One of my strategic goals in relation to the movement itself is to create methodology for parsing dictionaries and adding them into the Wikimedia projects (first Wiktionary, but Wikidata and likely some other projects in the future) and a pool of programmers to keep that knowledge inside of the movement. So, besides the software itself [1], one of the tasks of Milos Trifunovic, programmer who is inserting data for the project Wiktionary Meets Matica Srpska [2] was to create a white paper about the process itself, which he did [3]. I know there were numerous previous additions of the dictionaries into the Wiktionaries, but, as far as I know, no systematic effort was put into dissemination of that knowledge. We are at the beginning of the process. Up to the present, there are ~26k new entries on Serbian Wiktionary. The code is in the initial useful phase. Up to the end of this part of the process I expect from a few hundred thousand to a few million of new entries all over Wiktionary editions (the most optimistic estimation is a few dozen million entries; estimation varies that much because of many factors, including future community involvement; one million is a reachable target). Keep in mind that there are three different stages of the software, with various levels of usefulness and complexity: 1) Adding the content into Wiktionary. This depends on particular Wiktionary customs, could be easily changed and didn't require too much of sophistication, as it's dominantly about Pywikipediabot and wiki syntax. 2) JSON intermediate storage. That's important as it's the most formal way for representing dictionary data for future use, not depending of destination platform. There is a space for further development of the particular format and your participation will be appreciated. 3) Parsing particular dictionary. That's dictionary-specific, but a number of methods could be shared for parsing other dictionaries. As many dictionaries we have, as much we will have developed common methods. So, if you are interested into the matter, please go to the talk page [4] and give your suggestions. Also, don't hesitate to reach me directly. [1] https://github.com/Interglider/dictator [2] https://meta.wikimedia.org/wiki/Grants:PEG/Interglider.ORG/Wiktionary_Meets… [3] https://meta.wikimedia.org/wiki/Dicteator [4] https://meta.wikimedia.org/wiki/Talk:Dicteator

2 2

[CFP - Extended Deadline 20/11] Semantic Web Journal - Special Issue on Quality Management of Semantic Web Assets (Data, Services and Systems)
by Sebastian Hellmann 23 Oct '15

23 Oct '15

*Extended Deadline November 20, 2015 CFP: Semantic Web Journal - Special Issue on Quality Management of Semantic Web Assets (Data, Services and Systems):* http://www.semantic-web-journal.net/blog/call-papers-special-issue-quality-… Submission guidelines *Deadline:October 31, 2015* > *November 20, 2015** * Submissions shall be made through the Semantic Web journal website at http://www.semantic-web-journal.net. Prospective authors must take notice of the submission guidelines posted at http://www.semantic-web-journal.net/authors. Note that you need to request an account on the website for submitting a paper. Please indicate in the cover letter that it is for the Special Issue on Quality Management of Semantic Web Assets (Data, Services and Systems). Submissions are possible in the following categories: full research papers, application reports, reports on tools and systems, and case studies. While there is no upper limit, paper length must be justified by content. Guest editors * Amrapali Zaveri, University of Leipzig, AKSW Group, Germany * Dimitris Kontokostas, University of Leipzig, AKSW Group, Germany * Sebastian Hellmann, University of Leipzig, AKSW Group, Germany * Jürgen Umbrich, Vienna University of Economics and Business, Austria *Overview and Topics* The standardization and adoption of Semantic Web technologies has resulted in a variety of assets, including an unprecedented volume of data being semantically enriched and systems and services, which consume or publish this data. Although gathering, processing and publishing data is a step towards further adoption of Semantic Web, quality does not yet play a central role in these assets (e.g., data lifecycle, system/service development). Quality management essentially refers to activities and tasks involved to guarantee a certain level of consistency and to meet the quality requirements for the assets. In general, quality management consists of the following four phases and components: (i) quality planning, (ii) quality control, (iii) quality assurance and (iv) quality improvement. The quality planning phase in the Semantic Web typically involves the design of procedures, strategies and policies to support the management of the assets. The quality control and assurance components have their primary aim in preventing errors and to meet quality requirements pertaining to the Semantic Web standards. A core part for both components are quality assessment methods which provide the necessary input for the controlling and assurance tasks. Quality assessment of Semantic Web Assets (data, services and systems), in particular, presents new challenges that were not handled before in other research areas. Thus, adopting existing approaches for data quality assessment is not a straightforward solution. These challenges are related to the openness of the Semantic Web, the diversity of the information and the unbounded, dynamic set of autonomous data sources, publishers and consumers (legal and software agents). Additionally, detecting the quality of available data sources and making the information explicit is yet another challenge. Moreover, noise in one data set, or missing links between different data sets, propagates throughout the Web of Data, and imposes great challenges on the data value chain. In case of systems and services, different implementations follow the specifications for RDF and SPARQL to varying extents, or even propose and offer new, non-standardized extensions. This causes strong incompatibilities between systems, e.g., between the used SPARQL features in the query engines and support features in RDF stores. The potential heterogeneity and incompatibility poses several challenges for the quality assessments in and for such systems and services. Eventually, quality improvement methods are used to further enhance the value of the Semantic Web Assets. One important step to improve the quality of data is identifying the root cause of the problem and then designing corresponding data improvement solutions. These solutions select the most effective and efficient strategies and related set of techniques and tools to improve quality. Quality improvement metrics for products and services entails understanding and improving operational processes and establishing valid and reliable service performance measures. This Special Issue is addressed to those members of the community interested in providing novel methodologies or frameworks in managing, assessing, monitoring, maintaining and improving the quality of the Semantic Web data, services and systems and also introduce tools and user interfaces which can effectively assist in this management. Topics of Interest We welcome original high quality submissions on (but are not restricted to) the following topics: * Methodologies and frameworks to plan, control, assure or improve the quality of Semantic Web Assets * Quality exploration and analysis interfaces * Quality monitoring * Developing, deploying and managing quality service ecosystems * Assessing the quality evolution of Semantic Web Assets * Large-scale quality assessment of structured datasets * Crowdsourcing data quality assessment * Quality assessment leveraging background knowledge * Use-case driven quality management * Evaluation of trustworthiness of data * Web Data and LOD quality benchmarks * Data Quality improvement methods and frameworks, e.g., linkage, alignment, cleaning, enrichment, correctness * Service/system quality improvement methods and frameworks * Managing sustainability issues in services * Guarantee of service (availability, performance) * Systems for transparent management of open data

1 0

7 reviews of papers on Wiktionary
by Federico Leva (Nemo) 08 Oct '15

08 Oct '15

Today I drafted some short reviews of papers about Wiktionary, from the backlog of the Wikimedia Research Newsletter. Reviews of the reviews and edits are welcome before the newsletter is published, in case I missed or misunderstood something too technical for me. :) https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2015-09-30/Recen… I think Wiktionary users can be very proud, as (in its multiple language editions) it's regularly shown to be an invaluable linguistic resource, already better than multiple or all competitors for a long series of purposes. 1.2 GLAWI, a free XML-encoded Machine-Readable Dictionary built from the French Wiktionary 1.3 IWNLP: Inverse Wiktionary for Natural Language Processing 1.4 knoWitiary: A Machine Readable Incarnation of Wiktionary 1.5 Zmorge: A German Morphological Lexicon Extracted from Wiktionary 1.6 Dbnary: Wiktionary as Linked Data for 12 Language Editions with Enhanced Translation Relations 1.7 Observing Online Dictionary Users: Studies Using Wiktionary Log Files 1.8 Multilingual Open Relation Extraction Using Cross-lingual Projection Nemo

2 1

[CFP] Semantic Web Journal - Special Issue on Quality Management of Semantic Web Assets (Data, Services and Systems) - Only 1 month left
by Amrapali Zaveri 07 Oct '15

07 Oct '15

*CFP: Semantic Web Journal - Special Issue on Quality Management of Semantic Web Assets (Data, Services and Systems):* http://www.semantic-web-journal.net/blog/call-papers-special-issue-quality-… <http://www.semantic-web-journal.net/blog/call-papers-special-issue-quality-…> Submission guidelines *Deadline (_only 1 month left_):October 31, 2015 * Submissions shall be made through the Semantic Web journal website at http://www.semantic-web-journal.net <http://www.semantic-web-journal.net/>. Prospective authors must take notice of the submission guidelines posted at http://www.semantic-web-journal.net/authors <http://www.semantic-web-journal.net/authors>. Note that you need to request an account on the website for submitting a paper. Please indicate in the cover letter that it is for the Special Issue on Quality Management of Semantic Web Assets (Data, Services and Systems). Submissions are possible in the following categories: full research papers, application reports, reports on tools and systems, and case studies. While there is no upper limit, paper length must be justified by content. Guest editors * Amrapali Zaveri, University of Leipzig, AKSW Group, Germany * Dimitris Kontokostas, University of Leipzig, AKSW Group, Germany * Sebastian Hellmann, University of Leipzig, AKSW Group, Germany * Jürgen Umbrich, Vienna University of Economics and Business, Austria *Overview and Topics* The standardization and adoption of Semantic Web technologies has resulted in a variety of assets, including an unprecedented volume of data being semantically enriched and systems and services, which consume or publish this data. Although gathering, processing and publishing data is a step towards further adoption of Semantic Web, quality does not yet play a central role in these assets (e.g., data lifecycle, system/service development). Quality management essentially refers to activities and tasks involved to guarantee a certain level of consistency and to meet the quality requirements for the assets. In general, quality management consists of the following four phases and components: (i) quality planning, (ii) quality control, (iii) quality assurance and (iv) quality improvement. The quality planning phase in the Semantic Web typically involves the design of procedures, strategies and policies to support the management of the assets. The quality control and assurance components have their primary aim in preventing errors and to meet quality requirements pertaining to the Semantic Web standards. A core part for both components are quality assessment methods which provide the necessary input for the controlling and assurance tasks. Quality assessment of Semantic Web Assets (data, services and systems), in particular, presents new challenges that were not handled before in other research areas. Thus, adopting existing approaches for data quality assessment is not a straightforward solution. These challenges are related to the openness of the Semantic Web, the diversity of the information and the unbounded, dynamic set of autonomous data sources, publishers and consumers (legal and software agents). Additionally, detecting the quality of available data sources and making the information explicit is yet another challenge. Moreover, noise in one data set, or missing links between different data sets, propagates throughout the Web of Data, and imposes great challenges on the data value chain. In case of systems and services, different implementations follow the specifications for RDF and SPARQL to varying extents, or even propose and offer new, non-standardized extensions. This causes strong incompatibilities between systems, e.g., between the used SPARQL features in the query engines and support features in RDF stores. The potential heterogeneity and incompatibility poses several challenges for the quality assessments in and for such systems and services. Eventually, quality improvement methods are used to further enhance the value of the Semantic Web Assets. One important step to improve the quality of data is identifying the root cause of the problem and then designing corresponding data improvement solutions. These solutions select the most effective and efficient strategies and related set of techniques and tools to improve quality. Quality improvement metrics for products and services entails understanding and improving operational processes and establishing valid and reliable service performance measures. This Special Issue is addressed to those members of the community interested in providing novel methodologies or frameworks in managing, assessing, monitoring, maintaining and improving the quality of the Semantic Web data, services and systems and also introduce tools and user interfaces which can effectively assist in this management. Topics of Interest We welcome original high quality submissions on (but are not restricted to) the following topics: * Methodologies and frameworks to plan, control, assure or improve the quality of Semantic Web Assets * Quality exploration and analysis interfaces * Quality monitoring * Developing, deploying and managing quality service ecosystems * Assessing the quality evolution of Semantic Web Assets * Large-scale quality assessment of structured datasets * Crowdsourcing data quality assessment * Quality assessment leveraging background knowledge * Use-case driven quality management * Evaluation of trustworthiness of data * Web Data and LOD quality benchmarks * Data Quality improvement methods and frameworks, e.g., linkage, alignment, cleaning, enrichment, correctness * Service/system quality improvement methods and frameworks * Managing sustainability issues in services * Guarantee of service (availability, performance) * Systems for transparent management of open data

1 0

Re: [Wiktionary-l] Wiktionary-l Digest, Vol 87, Issue 1
by Đại Lộc Nguyễn Hữu 05 Oct '15

05 Oct '15

Help 2015-10-05 19:00 GMT+07:00, wiktionary-l-request(a)lists.wikimedia.org <wiktionary-l-request(a)lists.wikimedia.org>: > Send Wiktionary-l mailing list submissions to > wiktionary-l(a)lists.wikimedia.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.wikimedia.org/mailman/listinfo/wiktionary-l > or, via email, send a message with subject or body 'help' to > wiktionary-l-request(a)lists.wikimedia.org > > You can reach the person managing the list at > wiktionary-l-owner(a)lists.wikimedia.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Wiktionary-l digest..." > > > Today's Topics: > > 1. Re: Names of Wikimedia languages and other matrix > (Federico Leva (Nemo)) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Sun, 4 Oct 2015 19:47:58 +0200 > From: "Federico Leva (Nemo)" <nemowiki(a)gmail.com> > To: "The Wiktionary (http://www.wiktionary.org) mailing list" > <wiktionary-l(a)lists.wikimedia.org> > Cc: mediawiki internationalisation > <mediawiki-i18n(a)lists.wikimedia.org> > Subject: Re: [Wiktionary-l] Names of Wikimedia languages and other > matrix > Message-ID: <5611664E.6080707(a)gmail.com> > Content-Type: text/plain; charset=utf-8; format=flowed > > Milos Rancic, 20/05/2015 10:36: >> May somebody clarify that Unicode data is free? PD or CC-BY-SA >> compatible. > > Sure. The license __for data and software__ is > http://unicode.org/copyright.html#Exhibit1 which is a BSD 3-clause > license with trivial changes to make it even clearer and freer, such as > the removal of the term "binary form" and "other materials". > > CLDR data is embedded in most free software, e.g. via ICU, packages for > which are available e.g. in Debian which is notoriously restrictive as > regards licenses. https://packages.debian.org/search?keywords=icu > > Besides, language names are evidently not copyrightable and Unicode > makes no attempt to state the contrary. (Their license is still useful > for sad places like EU where the set of names could be considered > subject to sui generis database rights.) > > Nemo > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > Wiktionary-l mailing list > Wiktionary-l(a)lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiktionary-l > > > ------------------------------ > > End of Wiktionary-l Digest, Vol 87, Issue 1 > ******************************************* >

1 0

Names of Wikimedia languages and other matrix
by Milos Rancic 04 Oct '15

04 Oct '15

Cross-list posting, as it's relevant to Wiktionary and Wikimedia as a whole. First of all, please go to Meta page "Names of Wikimedia languages" [1] and do the best to proofread or translate items. That's strategically important set of lists for the movement. We have to know the names of Wikimedia languages in Wikimedia languages. This is the first mobilization for this kind of simple translations: few hundred terms, of which this list is the most complex, as it requires additional column "in <this> language". The next one will be about lexicographical and grammatical terms and abbreviations. That one is of strategic importance for Wiktionary, as it allows anyone to generate sane dictionary entries. After those two lists we'll be able to start working on the Ornithological dictionary, with something less than 400 species. And now about the number of tanks... Let's say that there are 250 Wikimedia languages and that we have three matrix sets: names of languages, 100 lexicographical and grammatical abbreviations and terms and 400 species from ornithological dictionary. And that we have those lists translated in all (250) Wikimedia languages. The numbers are... * The names of 250 languages *times* in 250 languages (=62,500 entries per project) *times* on 250 projects (=15,625,000 entries on all projects). * 100 lexicographical and grammatical terms and abbreviations *times* 250 languages (=25,000 entries per project) *times* on 250 projects (=6,250,000 entries on all projects). * 400 bird species * 250 languages (=100,000 entries per project) *times* on 250 projects (=25,000,000 entries on all projects). OK. That calculation is too optimistic. I would be happy if we get translations in 50 languages. The numbers would be then 125,000 entries for languages, 250,000 entries for lexicographical and grammatical terms and abbreviations and 1,000,000 for birds. Besides obvious fact that traditional lexicography isn't that optimized (note that it's about traditional lexicography, not about Wiktionary itself, thus not that fixable) and that we need a bit better method (OmegaWiki, Wikidata, we are developing the proof of concept, as well), there are two other consequences: 1) If we have a set of 400 words and we translate them in 50 languages, we are getting one million of entries. We should be doing that on monthly basis. It's not hard at all! 2) In a bit more complex form, which requires more work per matrix set and smaller output ("just" multiplication of the first and third number), this could be used for Wikipedia articles, as well. (You need much more information in encyclopedic article for German language than in a dictionary entry. But it's quite possible to do it. And it's especially important for languages with small number of speakers.) Please go to [1] and help this translation! Having the names of Wikimedia languages in Wikimedia languages *is* important no matter if it's about Wiktionary or generating the content. We should know the names of our languages in our languages. [1] https://meta.wikimedia.org/wiki/Names_of_Wikimedia_languages

5 14

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Wiktionary-l October 2015