Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information-access-on-a-budg...
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
Best regards, Olya & the Wikibabel crew
Hoi, I am giving a lot of attention to content that deals with Africa. At that I also target the Swahili wikipedia [1] (I have not filled in all the red links yet). At this moment I am adding information in Wikidata about Tanzanian wards based on sw.wikipedia categories and templates.
Many of the African language Wikipedias are struggling. By making the lists as complete as possible based on categories and lists, the information becomes more useful and better, it can be and is used in the same manner on multiple Wikipedias. At this moment zu yo en sw Wikipedia. As the information is made available using Listeria lists, the information gets updated as and when new information becomes available.
Another notion of mine is that it will help with individual info boxes eg for politicians, or indeed Tanzanian wards .. :)
NB I am a big fan of providing information using machine translation. However, PLEASE consider the lessons learned from the Cebuano Wikipedia and make the texts available in a cached way; not in the final form as saved text. Thanks, GerardM
PS when there is something where we can collaborate, please let me know.
[1] https://sw.wikipedia.org/wiki/Mtumiaji:GerardM
On 18 June 2018 at 01:12, Olya Irzak oirzak@gmail.com wrote:
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information-access-on-a- budget-4038f750e90e
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
Best regards, Olya & the Wikibabel crew
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
HI Irene, Wikibabel, Gerd, and Wikidatans,
How does Wikidata's new lexicographical project work with regard to Swahili (since it is a Wikipedia / Wikidata language) and Google Translate / GNMT re your "Our approach leverages Google Translate https://cloud.google.com/translate/docs/ to make English Wikipedia articles accessible to underserved communities" (re: https://medium.com/@oirzak/wikibabel-equalizing-information-access-on-a-budg... )?
Will a Wikibabel team you help create add Swahili lexemes to the lexicographical project - https://www.wikidata.org/wiki/Wikidata:Lexicographical_data - and then Google GNMT - which is end-to-end translation software ... https://1.bp.blogspot.com/-jwgtcgkgG2o/WDSBrwu9jeI/AAAAAAAABbM/2Eobq-N9_nYeA... (https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html) - use this new Swahili lexicographical data by processing this through its algorithms?
(WUaS seeks to facilitate machine translation in all 7097 living languages, and by growing out of Google GNMT; WUaS donated itself for co-development to Wikidata in 2015).
Cheers, Scott
On Sun, Jun 17, 2018 at 10:28 PM, Gerard Meijssen <gerard.meijssen@gmail.com
wrote:
Hoi, I am giving a lot of attention to content that deals with Africa. At that I also target the Swahili wikipedia [1] (I have not filled in all the red links yet). At this moment I am adding information in Wikidata about Tanzanian wards based on sw.wikipedia categories and templates.
Many of the African language Wikipedias are struggling. By making the lists as complete as possible based on categories and lists, the information becomes more useful and better, it can be and is used in the same manner on multiple Wikipedias. At this moment zu yo en sw Wikipedia. As the information is made available using Listeria lists, the information gets updated as and when new information becomes available.
Another notion of mine is that it will help with individual info boxes eg for politicians, or indeed Tanzanian wards .. :)
NB I am a big fan of providing information using machine translation. However, PLEASE consider the lessons learned from the Cebuano Wikipedia and make the texts available in a cached way; not in the final form as saved text. Thanks, GerardM
PS when there is something where we can collaborate, please let me know.
[1] https://sw.wikipedia.org/wiki/Mtumiaji:GerardM
On 18 June 2018 at 01:12, Olya Irzak oirzak@gmail.com wrote:
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information- access-on-a-budget-4038f750e90e
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
Best regards, Olya & the Wikibabel crew
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi, There is no explicit link between the data and the lexicographic data. As a consequence it will not be easy to make use of the existing labels for automated translation servies. This has been an explicit architectural decision..
For me it will be interesting to learn how these links will be realised and how existing differences will be reconciled and how this will impact services like translation services. Thanks, GerardM
On 18 June 2018 at 08:52, Info WorldUniversity < info@worlduniversityandschool.org> wrote:
HI Irene, Wikibabel, Gerd, and Wikidatans,
How does Wikidata's new lexicographical project work with regard to Swahili (since it is a Wikipedia / Wikidata language) and Google Translate / GNMT re your "Our approach leverages Google Translate https://cloud.google.com/translate/docs/ to make English Wikipedia articles accessible to underserved communities" (re: https://medium.com/@oirzak/wikibabel-equalizing-information-access-on-a- budget-4038f750e90e)?
Will a Wikibabel team you help create add Swahili lexemes to the lexicographical project - https://www.wikidata.org/wiki/ Wikidata:Lexicographical_data - and then Google GNMT - which is end-to-end translation software ... https://1.bp.blogspot.com/ -jwgtcgkgG2o/WDSBrwu9jeI/AAAAAAAABbM/2Eobq-N9_nYeAdeH- sB_NZGbhyoSWgReACLcB/s1600/image01.gif (https://ai.googleblog.com/ 2016/11/zero-shot-translation-with-googles.html) - use this new Swahili lexicographical data by processing this through its algorithms?
(WUaS seeks to facilitate machine translation in all 7097 living languages, and by growing out of Google GNMT; WUaS donated itself for co-development to Wikidata in 2015).
Cheers, Scott
On Sun, Jun 17, 2018 at 10:28 PM, Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, I am giving a lot of attention to content that deals with Africa. At that I also target the Swahili wikipedia [1] (I have not filled in all the red links yet). At this moment I am adding information in Wikidata about Tanzanian wards based on sw.wikipedia categories and templates.
Many of the African language Wikipedias are struggling. By making the lists as complete as possible based on categories and lists, the information becomes more useful and better, it can be and is used in the same manner on multiple Wikipedias. At this moment zu yo en sw Wikipedia. As the information is made available using Listeria lists, the information gets updated as and when new information becomes available.
Another notion of mine is that it will help with individual info boxes eg for politicians, or indeed Tanzanian wards .. :)
NB I am a big fan of providing information using machine translation. However, PLEASE consider the lessons learned from the Cebuano Wikipedia and make the texts available in a cached way; not in the final form as saved text. Thanks, GerardM
PS when there is something where we can collaborate, please let me know.
[1] https://sw.wikipedia.org/wiki/Mtumiaji:GerardM
On 18 June 2018 at 01:12, Olya Irzak oirzak@gmail.com wrote:
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information- access-on-a-budget-4038f750e90e
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
Best regards, Olya & the Wikibabel crew
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
--
Scott MacLeod - Founder & President
World University and School
CC World University and School - like CC Wikipedia with best
STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hello Olya and everyone,
Very interesting project! I am working on underserved languages in Wikipedia as well, mainly as part of my research. In our most recent work we experimented with generating Wikipedia summaries from Wikidata facts in underserved languages, which worked quite well [1][2]. The idea was based on the ArticlePlaceholder [3] that display Wikidata triples on Wikipedia dynamically. Learning from existing Wikipedia articles in the language has the advantage that we can keep cultural and linguistic attributes as they are. Which is similar to a human-in-the-loop approach as you suggest. While no humans are needed for our summary generation, one of the main drawbacks is that currently it is just one single introductory sentence. We are planning on experimenting extending this with a project at the end of the year. I am always happy to discuss these topics more and would be interested if there is something in our approach that is helpful for you and vice versa!
All the best, Lucie
[1] preprint: https://2018.eswc-conferences.org/wp-content/uploads/2018/02/ESWC2018_paper_... [2] https://arxiv.org/pdf/1803.07116.pdf [3] https://www.mediawiki.org/wiki/Extension:ArticlePlaceholder
On 18 June 2018 at 10:43, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, There is no explicit link between the data and the lexicographic data. As a consequence it will not be easy to make use of the existing labels for automated translation servies. This has been an explicit architectural decision..
For me it will be interesting to learn how these links will be realised and how existing differences will be reconciled and how this will impact services like translation services. Thanks, GerardM
On 18 June 2018 at 08:52, Info WorldUniversity <info@ worlduniversityandschool.org> wrote:
HI Irene, Wikibabel, Gerd, and Wikidatans,
How does Wikidata's new lexicographical project work with regard to Swahili (since it is a Wikipedia / Wikidata language) and Google Translate / GNMT re your "Our approach leverages Google Translate https://cloud.google.com/translate/docs/ to make English Wikipedia articles accessible to underserved communities" (re: https://medium.com/@oirzak/wikibabel-equalizing-information- access-on-a-budget-4038f750e90e)?
Will a Wikibabel team you help create add Swahili lexemes to the lexicographical project - https://www.wikidata.org/wiki/ Wikidata:Lexicographical_data - and then Google GNMT - which is end-to-end translation software ... https://1.bp.blogspot.com/ -jwgtcgkgG2o/WDSBrwu9jeI/AAAAAAAABbM/2Eobq-N9_nYeAdeH-sB_ NZGbhyoSWgReACLcB/s1600/image01.gif (https://ai.googleblog.com/201 6/11/zero-shot-translation-with-googles.html) - use this new Swahili lexicographical data by processing this through its algorithms?
(WUaS seeks to facilitate machine translation in all 7097 living languages, and by growing out of Google GNMT; WUaS donated itself for co-development to Wikidata in 2015).
Cheers, Scott
On Sun, Jun 17, 2018 at 10:28 PM, Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, I am giving a lot of attention to content that deals with Africa. At that I also target the Swahili wikipedia [1] (I have not filled in all the red links yet). At this moment I am adding information in Wikidata about Tanzanian wards based on sw.wikipedia categories and templates.
Many of the African language Wikipedias are struggling. By making the lists as complete as possible based on categories and lists, the information becomes more useful and better, it can be and is used in the same manner on multiple Wikipedias. At this moment zu yo en sw Wikipedia. As the information is made available using Listeria lists, the information gets updated as and when new information becomes available.
Another notion of mine is that it will help with individual info boxes eg for politicians, or indeed Tanzanian wards .. :)
NB I am a big fan of providing information using machine translation. However, PLEASE consider the lessons learned from the Cebuano Wikipedia and make the texts available in a cached way; not in the final form as saved text. Thanks, GerardM
PS when there is something where we can collaborate, please let me know.
[1] https://sw.wikipedia.org/wiki/Mtumiaji:GerardM
On 18 June 2018 at 01:12, Olya Irzak oirzak@gmail.com wrote:
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information- access-on-a-budget-4038f750e90e
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
Best regards, Olya & the Wikibabel crew
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
--
Scott MacLeod - Founder & President
World University and School
CC World University and School - like CC Wikipedia with best
STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Ho Lucie, I would really love to work with you on content that is of particular relevance for African Wikipedias. At this time I have added many statements for African politicians, I started with African awards and African geography.. The structure from wards to country for Tanzania..
What I am really interested in at this time are politicians that held multiple offices.
One reason why the geographical structures are relevant; it is customary to identify place of birth and death precisely; without these structures it is either fixed text or it is a higher level like a region for Africans,.
By working on the content in Wikidata and sharing Listeria lists, I do find that people become interested, it would help when the representation issues for people who held an office multiple time gets resolved. With individual representation, things become even more relevant.
NB having texts is a definite boon. Thanks, GerardM
On 18 June 2018 at 12:21, Lucie Kaffee lucie.kaffee@gmail.com wrote:
Hello Olya and everyone,
Very interesting project! I am working on underserved languages in Wikipedia as well, mainly as part of my research. In our most recent work we experimented with generating Wikipedia summaries from Wikidata facts in underserved languages, which worked quite well [1][2]. The idea was based on the ArticlePlaceholder [3] that display Wikidata triples on Wikipedia dynamically. Learning from existing Wikipedia articles in the language has the advantage that we can keep cultural and linguistic attributes as they are. Which is similar to a human-in-the-loop approach as you suggest. While no humans are needed for our summary generation, one of the main drawbacks is that currently it is just one single introductory sentence. We are planning on experimenting extending this with a project at the end of the year. I am always happy to discuss these topics more and would be interested if there is something in our approach that is helpful for you and vice versa!
All the best, Lucie
[1] preprint: https://2018.eswc-conferences.org/wp-content/uploads/2018/ 02/ESWC2018_paper_131.pdf [2] https://arxiv.org/pdf/1803.07116.pdf [3] https://www.mediawiki.org/wiki/Extension:ArticlePlaceholder
On 18 June 2018 at 10:43, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, There is no explicit link between the data and the lexicographic data. As a consequence it will not be easy to make use of the existing labels for automated translation servies. This has been an explicit architectural decision..
For me it will be interesting to learn how these links will be realised and how existing differences will be reconciled and how this will impact services like translation services. Thanks, GerardM
On 18 June 2018 at 08:52, Info WorldUniversity < info@worlduniversityandschool.org> wrote:
HI Irene, Wikibabel, Gerd, and Wikidatans,
How does Wikidata's new lexicographical project work with regard to Swahili (since it is a Wikipedia / Wikidata language) and Google Translate / GNMT re your "Our approach leverages Google Translate https://cloud.google.com/translate/docs/ to make English Wikipedia articles accessible to underserved communities" (re: https://medium.com/@oirzak/wikibabel-equalizing-information- access-on-a-budget-4038f750e90e)?
Will a Wikibabel team you help create add Swahili lexemes to the lexicographical project - https://www.wikidata.org/wiki/ Wikidata:Lexicographical_data - and then Google GNMT - which is end-to-end translation software ... https://1.bp.blogspot.com/ -jwgtcgkgG2o/WDSBrwu9jeI/AAAAAAAABbM/2Eobq-N9_nYeAdeH-sB_NZG bhyoSWgReACLcB/s1600/image01.gif (https://ai.googleblog.com/201 6/11/zero-shot-translation-with-googles.html) - use this new Swahili lexicographical data by processing this through its algorithms?
(WUaS seeks to facilitate machine translation in all 7097 living languages, and by growing out of Google GNMT; WUaS donated itself for co-development to Wikidata in 2015).
Cheers, Scott
On Sun, Jun 17, 2018 at 10:28 PM, Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, I am giving a lot of attention to content that deals with Africa. At that I also target the Swahili wikipedia [1] (I have not filled in all the red links yet). At this moment I am adding information in Wikidata about Tanzanian wards based on sw.wikipedia categories and templates.
Many of the African language Wikipedias are struggling. By making the lists as complete as possible based on categories and lists, the information becomes more useful and better, it can be and is used in the same manner on multiple Wikipedias. At this moment zu yo en sw Wikipedia. As the information is made available using Listeria lists, the information gets updated as and when new information becomes available.
Another notion of mine is that it will help with individual info boxes eg for politicians, or indeed Tanzanian wards .. :)
NB I am a big fan of providing information using machine translation. However, PLEASE consider the lessons learned from the Cebuano Wikipedia and make the texts available in a cached way; not in the final form as saved text. Thanks, GerardM
PS when there is something where we can collaborate, please let me know.
[1] https://sw.wikipedia.org/wiki/Mtumiaji:GerardM
On 18 June 2018 at 01:12, Olya Irzak oirzak@gmail.com wrote:
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information- access-on-a-budget-4038f750e90e
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
Best regards, Olya & the Wikibabel crew
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
--
Scott MacLeod - Founder & President
World University and School
CC World University and School - like CC Wikipedia with best
STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Lucie-Aimée Kaffee
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
2018-06-18 2:12 GMT+03:00 Olya Irzak oirzak@gmail.com:
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information-access-on-a- budget-4038f750e90e
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
I'm not sure how has the Wikidata community think deeply about it.
One project that does something related to what you're doing is GapFinder ( https://www.mediawiki.org/wiki/GapFinder ). As far as I know, the GapFinder frontend is not developed actively, but the recommendation API behind it is being actively maintained and developed, but you should ask the Research team for more info (see https://www.mediawiki.org/wiki/Wikimedia_Research ).
Project Tiger is also doing something similar: https://meta.wikimedia.org/wiki/Project_Tiger_Editathon_2018
As a general comment, displaying machine-translated text in a way that appears that is had been written by humans is misleading and damaging. I don't know any Swahili, but in languages that I can read (Russian, Hebrew, Catalan, Spanish, French, German), the quality of machine translation is at its best good as an aid during writing a translation by a human, and it's never good for actually reading. I also don't understand why do you invest credits into pre-machine-translating articles that people can machine-translate for free, but maybe I'm missing something about how your project works.
Hoi, On average there is little or no support for subjects that have to do with Africa. When I check the articles for politicians for instance, I find that even current presidents let alone ministers are missing in African Wikipedias. So it is wonderful that there have been projects that deal with gaps but what if there is hardly anything?
What this approach brings us is at least information. Basic information in lists, info boxes maybe an additional line of text.
What we apparently have not done is learn from the Cebuano experience. The biggest issue was not the quality of the new information, it is the integration with Wikidata. Everything is new and it did not link with what we already knew. What we bring in this way is integrated information and as long as data is not saved as an article, the quality provided improves as Wikidata gains better intel.
If anything, the experience of the Welsh Wikipedia brings us more than gapfinder or tiger editathon because of this is more in line with this approach. Thanks, GerardM
On 18 June 2018 at 13:19, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:
2018-06-18 2:12 GMT+03:00 Olya Irzak oirzak@gmail.com:
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information- access-on-a-budget-4038f750e90e
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
I'm not sure how has the Wikidata community think deeply about it.
One project that does something related to what you're doing is GapFinder ( https://www.mediawiki.org/wiki/GapFinder ). As far as I know, the GapFinder frontend is not developed actively, but the recommendation API behind it is being actively maintained and developed, but you should ask the Research team for more info (see https://www.mediawiki.org/ wiki/Wikimedia_Research ).
Project Tiger is also doing something similar: https://meta.wikimedia.org/ wiki/Project_Tiger_Editathon_2018
As a general comment, displaying machine-translated text in a way that appears that is had been written by humans is misleading and damaging. I don't know any Swahili, but in languages that I can read (Russian, Hebrew, Catalan, Spanish, French, German), the quality of machine translation is at its best good as an aid during writing a translation by a human, and it's never good for actually reading. I also don't understand why do you invest credits into pre-machine-translating articles that people can machine-translate for free, but maybe I'm missing something about how your project works.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Olya, Lucie, and Wikidatans,
Very interesting projects. And thanks for publishing, Lucie - very helpful!
With regard to Swahili, Arabic (both African languages!) and Esperanto, and leveraging Google Translate / GNMT, I've been looking at this Google GNMT gif image - https://1.bp.blogspot.com/-jwgtcgkgG2o/WDSBrwu9jeI/AAAAAAAABbM/2Eobq-N9_nYeA... - and wondering how the triplets of the Linked Open Data of Wikidata structured Knowledge Base (KB) would stream through this in multiple smaller languages?
I couldn't deduce from this paper - https://arxiv.org/pdf/1803.07116.pdf - here, for example ...
2.1 Encoding the Triples The encoder part of the model is a feed-forward architecture that encodes the set of input triples into a fixed dimensionality vector, which is subsequently used to initialise the decoder. Given a set of un-ordered triples FE = {f1, f2, . . . , fR : fj = (sj , pj , oj )}, where sj , pj and oj are the onehot vector representations of the respective subject, property and object of the j-th triple, we compute an embedding hfj for the j-th triple by forward propagating as follows: hfj = q(Wh[Winsj ;Winpj ;Winoj ]) , (1) hFE = WF[hf1 ; . . . ; hfR−1 ; hfR ] , (2) where hfj is the embedding vector of each triple fj , hFE is a fixed-length vector representation for all the input triples FE. q is a non-linear activation function, [. . . ; . . .] represents vector concatenation. Win,Wh,WF are trainable weight matrices. Unlike (Chisholm et al., 2017), our encoder is agnostic with respect to the order of input triples. As a result, the order of a particular triple fj in the triples set does not change its significance towards the computation of the vector representation of the whole triples set, hFE .
... whether this would address streaming triplets through GNMT?
Would this? And since Swahili, Arabic and Esperanto, are all active languages in - https://translate.google.com/ - no further coding on the GNMT side would be necessary. (I'm curious how best for WUaS to grow small languages not yet in either Wikipedia/Wikidata's 287-301 languages or in GNMT's ~100+ languages?).
How could your Wikidata / Wikibabel work interface with Google GNMT more fully with time, building on your great Wikidata coding/papers?
Cheers, Scott
https://en.wikipedia.org/wiki/User:Scott_WUaS
On Mon, Jun 18, 2018 at 5:17 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, On average there is little or no support for subjects that have to do with Africa. When I check the articles for politicians for instance, I find that even current presidents let alone ministers are missing in African Wikipedias. So it is wonderful that there have been projects that deal with gaps but what if there is hardly anything?
What this approach brings us is at least information. Basic information in lists, info boxes maybe an additional line of text.
What we apparently have not done is learn from the Cebuano experience. The biggest issue was not the quality of the new information, it is the integration with Wikidata. Everything is new and it did not link with what we already knew. What we bring in this way is integrated information and as long as data is not saved as an article, the quality provided improves as Wikidata gains better intel.
If anything, the experience of the Welsh Wikipedia brings us more than gapfinder or tiger editathon because of this is more in line with this approach. Thanks, GerardM
On 18 June 2018 at 13:19, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:
2018-06-18 2:12 GMT+03:00 Olya Irzak oirzak@gmail.com:
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information- access-on-a-budget-4038f750e90e
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
I'm not sure how has the Wikidata community think deeply about it.
One project that does something related to what you're doing is GapFinder ( https://www.mediawiki.org/wiki/GapFinder ). As far as I know, the GapFinder frontend is not developed actively, but the recommendation API behind it is being actively maintained and developed, but you should ask the Research team for more info (see https://www.mediawiki.org/wiki /Wikimedia_Research ).
Project Tiger is also doing something similar: https://meta.wikimedia.org/wiki/Project_Tiger_Editathon_2018
As a general comment, displaying machine-translated text in a way that appears that is had been written by humans is misleading and damaging. I don't know any Swahili, but in languages that I can read (Russian, Hebrew, Catalan, Spanish, French, German), the quality of machine translation is at its best good as an aid during writing a translation by a human, and it's never good for actually reading. I also don't understand why do you invest credits into pre-machine-translating articles that people can machine-translate for free, but maybe I'm missing something about how your project works.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Dear Gerard, Scott, Lucie and Amir & everyone,
Thank you for the helpful responses!
Gerard - great to hear about your work and thank you for the reference on Cebuano Wikipedia. We weren't familiar with that case, but had similar fears. We are putting our machine translated content on a separate site (Wikibabel) rather than Wikipedia, and we never expect it to show up nearly as high in search engine results, and so the native content will always take precedence.
Scott - for the time being, the Wikibabel experiment will not have a Wikidata or Lexicographical portion, as for now this is an experiment to see if machine translated content in certain categories is useful given the state of machine translation currently (and we are doing this on a separate site in order to avoid any contamination). If we'll measure good user engagement and survey results, we'd love to think about how to integrate this better. We'd love to eventually have an editing interface on top of the translation and with existing native content if exists and has fewer details. If this editing interface becomes popular, then we'll start accumulating a dataset which may be useful to machine translation services. We, however, are not developing anything novel in the machine translation algorithm space.
Lucie - ah, that's what this is! We noticed the large number of recent one sentence articles and were wondering what project that was. Those are awesome! Both in terms of information availability and because it allows us to measure relative interest within the generated set. I would love to discuss further your plans to expand beyond the introductory sentence, and if we can be helpful in any way. Thank you for the publication links as well.
Amir - thank you for the pointers to these projects. Your 2 points of feedback, if I understand correctly: 1. Machine translation might not be good enough to yield useful information. 2. People can translate the pages for free. Those are excellent points that we've thought about deeply before starting the projects (though we find a much higher translation quality recently than you perhaps). Here's what we think: 1. This is exactly the central question of our experiment, and very much still open. Machine translation (or at least Google translate) has improved significantly in the last 12 months or so. The quality of the translation, particularly when there is context (longer sentences) has improved leaps and bounds. For the Wikibabel project, we spot checked with Swahili speakers that some pages translate very well (not perfect human level, but very understandable with a few awkward turns of phrase) and some are bad enough to not be useful. Given how little information there is on the Internet in Swahili, particularly on technical topics (easier to translate in some ways) and that there aren't many participants in the Swahili Wikipedia, we hypothesize that the best X% of translations would be useful, and that we can measure and tell the difference between well translated pages and not from page analytics and surveys. We are, however, careful to have this in a separate site (Wikibabel), rather than checking any of this into wikipedia, because as you mentioned that would be misleading. 2. That is absolutely true, and we're fundamentally solving a discoverability problem. If a Swahili speaker currently Searches for a term in Swahili, they will not land on the English results for it. They will get some potentially bad results in their language and potentially give up. In general, they would need to know about Wikipedia, or some other good source, that it's better in English and that Google translate exists. Some of the folks we're targeting are fairly new to the internet, so this is not a low bar. Moreover, if this is successful we're very interested to being added to free mobile data services such as Free Basics. Those services don't work well with Google Translate as it requires heavy javascript that isn't included in the free data package yet, and some phones they surveyed as common have trouble handling javascript well.
Thank you! Olya & the Wikibabel crew
On Mon, Jun 18, 2018 at 12:43 PM, Info WorldUniversity < info@worlduniversityandschool.org> wrote:
Hi Olya, Lucie, and Wikidatans,
Very interesting projects. And thanks for publishing, Lucie - very helpful!
With regard to Swahili, Arabic (both African languages!) and Esperanto, and leveraging Google Translate / GNMT, I've been looking at this Google GNMT gif image - https://1.bp.blogspot.com/-jwgtcgkgG2o/WDSBrwu9jeI/ AAAAAAAABbM/2Eobq-N9_nYeAdeH-sB_NZGbhyoSWgReACLcB/s1600/image01.gif - and wondering how the triplets of the Linked Open Data of Wikidata structured Knowledge Base (KB) would stream through this in multiple smaller languages?
I couldn't deduce from this paper - https://arxiv.org/pdf/1803.07116.pdf
- here, for example ...
2.1 Encoding the Triples The encoder part of the model is a feed-forward architecture that encodes the set of input triples into a fixed dimensionality vector, which is subsequently used to initialise the decoder. Given a set of un-ordered triples FE = {f1, f2, . . . , fR : fj = (sj , pj , oj )}, where sj , pj and oj are the onehot vector representations of the respective subject, property and object of the j-th triple, we compute an embedding hfj for the j-th triple by forward propagating as follows: hfj = q(Wh[Winsj ;Winpj ;Winoj ]) , (1) hFE = WF[hf1 ; . . . ; hfR−1 ; hfR ] , (2) where hfj is the embedding vector of each triple fj , hFE is a fixed-length vector representation for all the input triples FE. q is a non-linear activation function, [. . . ; . . .] represents vector concatenation. Win,Wh,WF are trainable weight matrices. Unlike (Chisholm et al., 2017), our encoder is agnostic with respect to the order of input triples. As a result, the order of a particular triple fj in the triples set does not change its significance towards the computation of the vector representation of the whole triples set, hFE .
... whether this would address streaming triplets through GNMT?
Would this? And since Swahili, Arabic and Esperanto, are all active languages in - https://translate.google.com/ - no further coding on the GNMT side would be necessary. (I'm curious how best for WUaS to grow small languages not yet in either Wikipedia/Wikidata's 287-301 languages or in GNMT's ~100+ languages?).
How could your Wikidata / Wikibabel work interface with Google GNMT more fully with time, building on your great Wikidata coding/papers?
Cheers, Scott
https://en.wikipedia.org/wiki/User:Scott_WUaS
On Mon, Jun 18, 2018 at 5:17 AM, Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, On average there is little or no support for subjects that have to do with Africa. When I check the articles for politicians for instance, I find that even current presidents let alone ministers are missing in African Wikipedias. So it is wonderful that there have been projects that deal with gaps but what if there is hardly anything?
What this approach brings us is at least information. Basic information in lists, info boxes maybe an additional line of text.
What we apparently have not done is learn from the Cebuano experience. The biggest issue was not the quality of the new information, it is the integration with Wikidata. Everything is new and it did not link with what we already knew. What we bring in this way is integrated information and as long as data is not saved as an article, the quality provided improves as Wikidata gains better intel.
If anything, the experience of the Welsh Wikipedia brings us more than gapfinder or tiger editathon because of this is more in line with this approach. Thanks, GerardM
On 18 June 2018 at 13:19, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:
2018-06-18 2:12 GMT+03:00 Olya Irzak oirzak@gmail.com:
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information- access-on-a-budget-4038f750e90e
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
I'm not sure how has the Wikidata community think deeply about it.
One project that does something related to what you're doing is GapFinder ( https://www.mediawiki.org/wiki/GapFinder ). As far as I know, the GapFinder frontend is not developed actively, but the recommendation API behind it is being actively maintained and developed, but you should ask the Research team for more info (see https://www.mediawiki.org/wiki/Wikimedia_Research ).
Project Tiger is also doing something similar: https://meta.wikimedia.org/wiki/Project_Tiger_Editathon_2018
As a general comment, displaying machine-translated text in a way that appears that is had been written by humans is misleading and damaging. I don't know any Swahili, but in languages that I can read (Russian, Hebrew, Catalan, Spanish, French, German), the quality of machine translation is at its best good as an aid during writing a translation by a human, and it's never good for actually reading. I also don't understand why do you invest credits into pre-machine-translating articles that people can machine-translate for free, but maybe I'm missing something about how your project works.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
--
Scott MacLeod - Founder & President
World University and School
CC World University and School - like CC Wikipedia with best
STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Olya,
There is also another topic to consider with translations. Normally a text reflects the reality of a speaker, which doesn't mean that that reality is interesting for a speaker of another language who might have different circumstances.
For instance the translation of an article about metabolic pathways might not be so interesting to a farmer in Rwanda, however an article about the planting seasons might be more interesting. That article might not exist in English, yet the speakers of that language I'm sure they have the knowledge to generate an article on the topic.
What I am trying to say is, that while translations are interesting to a certain extent, you cannot translate the interest that a person might have. Normally that interest is already shared with the community so sometimes it is more empowering to give the tools and explain how to use them, so that the community can create articles about what they find interesting. With that in mind, it seems that when articles are translated from English into Swahili, their speakers are benefiting from our global culture. But also when translating Swahili articles into English the global culture would benefit from knowing better a local culture. For that to happen there should be some articles written natively in Swahili.
I find also that it might be more interesting to translate to/from neighbouring languages for several reasons: a) The languages tend to be similar, so it is easier to use machine-translation, b) there is familiarity with the interest that neighbouring cultures have, c) understanding your neighbours is a good basis for having peaceful relationships.
In any case I wish you a lot of success in your project, and I hope it benefits many people!
Regards, Micru
On Fri, Jun 22, 2018 at 3:04 PM Olya Irzak oirzak@gmail.com wrote:
Dear Gerard, Scott, Lucie and Amir & everyone,
Thank you for the helpful responses!
Gerard - great to hear about your work and thank you for the reference on Cebuano Wikipedia. We weren't familiar with that case, but had similar fears. We are putting our machine translated content on a separate site (Wikibabel) rather than Wikipedia, and we never expect it to show up nearly as high in search engine results, and so the native content will always take precedence.
Scott - for the time being, the Wikibabel experiment will not have a Wikidata or Lexicographical portion, as for now this is an experiment to see if machine translated content in certain categories is useful given the state of machine translation currently (and we are doing this on a separate site in order to avoid any contamination). If we'll measure good user engagement and survey results, we'd love to think about how to integrate this better. We'd love to eventually have an editing interface on top of the translation and with existing native content if exists and has fewer details. If this editing interface becomes popular, then we'll start accumulating a dataset which may be useful to machine translation services. We, however, are not developing anything novel in the machine translation algorithm space.
Lucie - ah, that's what this is! We noticed the large number of recent one sentence articles and were wondering what project that was. Those are awesome! Both in terms of information availability and because it allows us to measure relative interest within the generated set. I would love to discuss further your plans to expand beyond the introductory sentence, and if we can be helpful in any way. Thank you for the publication links as well.
Amir - thank you for the pointers to these projects. Your 2 points of feedback, if I understand correctly:
- Machine translation might not be good enough to yield useful
information. 2. People can translate the pages for free. Those are excellent points that we've thought about deeply before starting the projects (though we find a much higher translation quality recently than you perhaps). Here's what we think:
- This is exactly the central question of our experiment, and very much
still open. Machine translation (or at least Google translate) has improved significantly in the last 12 months or so. The quality of the translation, particularly when there is context (longer sentences) has improved leaps and bounds. For the Wikibabel project, we spot checked with Swahili speakers that some pages translate very well (not perfect human level, but very understandable with a few awkward turns of phrase) and some are bad enough to not be useful. Given how little information there is on the Internet in Swahili, particularly on technical topics (easier to translate in some ways) and that there aren't many participants in the Swahili Wikipedia, we hypothesize that the best X% of translations would be useful, and that we can measure and tell the difference between well translated pages and not from page analytics and surveys. We are, however, careful to have this in a separate site (Wikibabel), rather than checking any of this into wikipedia, because as you mentioned that would be misleading. 2. That is absolutely true, and we're fundamentally solving a discoverability problem. If a Swahili speaker currently Searches for a term in Swahili, they will not land on the English results for it. They will get some potentially bad results in their language and potentially give up. In general, they would need to know about Wikipedia, or some other good source, that it's better in English and that Google translate exists. Some of the folks we're targeting are fairly new to the internet, so this is not a low bar. Moreover, if this is successful we're very interested to being added to free mobile data services such as Free Basics. Those services don't work well with Google Translate as it requires heavy javascript that isn't included in the free data package yet, and some phones they surveyed as common have trouble handling javascript well.
Thank you! Olya & the Wikibabel crew
On Mon, Jun 18, 2018 at 12:43 PM, Info WorldUniversity < info@worlduniversityandschool.org> wrote:
Hi Olya, Lucie, and Wikidatans,
Very interesting projects. And thanks for publishing, Lucie - very helpful!
With regard to Swahili, Arabic (both African languages!) and Esperanto, and leveraging Google Translate / GNMT, I've been looking at this Google GNMT gif image - https://1.bp.blogspot.com/-jwgtcgkgG2o/WDSBrwu9jeI/AAAAAAAABbM/2Eobq-N9_nYeA...
- and wondering how the triplets of the Linked Open Data of Wikidata
structured Knowledge Base (KB) would stream through this in multiple smaller languages?
I couldn't deduce from this paper - https://arxiv.org/pdf/1803.07116.pdf
- here, for example ...
2.1 Encoding the Triples The encoder part of the model is a feed-forward architecture that encodes the set of input triples into a fixed dimensionality vector, which is subsequently used to initialise the decoder. Given a set of un-ordered triples FE = {f1, f2, . . . , fR : fj = (sj , pj , oj )}, where sj , pj and oj are the onehot vector representations of the respective subject, property and object of the j-th triple, we compute an embedding hfj for the j-th triple by forward propagating as follows: hfj = q(Wh[Winsj ;Winpj ;Winoj ]) , (1) hFE = WF[hf1 ; . . . ; hfR−1 ; hfR ] , (2) where hfj is the embedding vector of each triple fj , hFE is a fixed-length vector representation for all the input triples FE. q is a non-linear activation function, [. . . ; . . .] represents vector concatenation. Win,Wh,WF are trainable weight matrices. Unlike (Chisholm et al., 2017), our encoder is agnostic with respect to the order of input triples. As a result, the order of a particular triple fj in the triples set does not change its significance towards the computation of the vector representation of the whole triples set, hFE .
... whether this would address streaming triplets through GNMT?
Would this? And since Swahili, Arabic and Esperanto, are all active languages in - https://translate.google.com/ - no further coding on the GNMT side would be necessary. (I'm curious how best for WUaS to grow small languages not yet in either Wikipedia/Wikidata's 287-301 languages or in GNMT's ~100+ languages?).
How could your Wikidata / Wikibabel work interface with Google GNMT more fully with time, building on your great Wikidata coding/papers?
Cheers, Scott
https://en.wikipedia.org/wiki/User:Scott_WUaS
On Mon, Jun 18, 2018 at 5:17 AM, Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, On average there is little or no support for subjects that have to do with Africa. When I check the articles for politicians for instance, I find that even current presidents let alone ministers are missing in African Wikipedias. So it is wonderful that there have been projects that deal with gaps but what if there is hardly anything?
What this approach brings us is at least information. Basic information in lists, info boxes maybe an additional line of text.
What we apparently have not done is learn from the Cebuano experience. The biggest issue was not the quality of the new information, it is the integration with Wikidata. Everything is new and it did not link with what we already knew. What we bring in this way is integrated information and as long as data is not saved as an article, the quality provided improves as Wikidata gains better intel.
If anything, the experience of the Welsh Wikipedia brings us more than gapfinder or tiger editathon because of this is more in line with this approach. Thanks, GerardM
On 18 June 2018 at 13:19, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:
2018-06-18 2:12 GMT+03:00 Olya Irzak oirzak@gmail.com:
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop:
https://medium.com/@oirzak/wikibabel-equalizing-information-access-on-a-budg...
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
I'm not sure how has the Wikidata community think deeply about it.
One project that does something related to what you're doing is GapFinder ( https://www.mediawiki.org/wiki/GapFinder ). As far as I know, the GapFinder frontend is not developed actively, but the recommendation API behind it is being actively maintained and developed, but you should ask the Research team for more info (see https://www.mediawiki.org/wiki/Wikimedia_Research ).
Project Tiger is also doing something similar: https://meta.wikimedia.org/wiki/Project_Tiger_Editathon_2018
As a general comment, displaying machine-translated text in a way that appears that is had been written by humans is misleading and damaging. I don't know any Swahili, but in languages that I can read (Russian, Hebrew, Catalan, Spanish, French, German), the quality of machine translation is at its best good as an aid during writing a translation by a human, and it's never good for actually reading. I also don't understand why do you invest credits into pre-machine-translating articles that people can machine-translate for free, but maybe I'm missing something about how your project works.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
--
--
Scott MacLeod - Founder & President
World University and School
CC World University and School - like CC Wikipedia with best
STEM-centric CC OpenCourseWare - incorporated as a nonprofit university and school in California, and is a U.S. 501 (c) (3) tax-exempt educational organization.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Olya,
Sorry for the late reply, but I just wondered if you were aware of Wikitrans[1], which "provides machine-translated versions of Wikipedia articles, completely linked and searchable in the target language, as well as cross language simultaneous Wikipedia searches".
It doesn't use Wikidata, but on the other hand use some formalized grammar of targeted languages. For what I know the translation software is not open source, but it might be interesting to have a wikimedia hosted backup of translated versions and links toward them in Wikidata, then maybe usable in Wikipedia.
Let me know if this kind of late feedback is welcome/undesired
Cheers, Mathieu
Le 18/06/2018 à 01:12, Olya Irzak a écrit :
Dear Wikidata community,
We're working on a project called Wikibabel to machine-translate parts of Wikipedia into underserved languages, starting with Swahili.
In hopes that some of our ideas can be helpful to machine translation projects, we wrote a blogpost about how we prioritized which pages to translate, and what categories need a human in the loop: https://medium.com/@oirzak/wikibabel-equalizing-information-access-on-a-budg...
Rumor has it that the Wikidata community has thought deeply about information access. We'd love your feedback on our work. Please let us know about past / ongoing machine translation related projects so we can learn from & collaborate with them.
Best regards, Olya & the Wikibabel crew
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata