Responding to Houcemeddine Turki's original thread (apologies for the
delay).
At the moment, it looks like we can't budge on the NFC normalization. That
takes place in a part of the stack that Abstract Wikipedia/Wikifunctions
doesn't control.
However, there is a workaround. We have a built-in type, Z86, which
represents a single code point. If you can decompose your string into code
points beforehand and pass them as a list, then, instead of calling your
Python function on the raw input, you can first join the input into a
single string (and likewise decompose returned strings into lists of code
points). To demonstrate, if you originally had a function like
def f(some_string):
# do something
# result = some_other_string
return result
you could just do
def f(list_of_code_points):
some_string = list_of_code_points.join('')
# do something
# result = some_other_string
return list(result)
We *might* need to add a third-party module to our Python environment to
make this more robust (I haven't tested this thoroughly with wide
characters). I'd be happy to talk more about this if you'd like.
But of course, the easier solution will be to work with NFC-normalized form
if at all possible :-/.
Hope that helps!
-Cory
On Mon, Sep 20, 2021 at 3:10 PM Cory Massaro <cmassaro(a)wikimedia.org> wrote:
I think I have reproduced this issue. I see that if I
pass the string, "
كَرَّر", into a Python function using Wikifunctions, I get this Unicode
representation in Python: "\u0643\u064e\u0631\u064e\u0651\u0631", which is
NFC normalized. But if I work with that string in a local Python
interpreter, I get the unnormalized representation (with \u0651 and \u064e
in their original order).
I see that part of our stack expects NFC-normalized input, but it
unfortunately isn't a part of the stack our team controls. We might be able
to bypass this behavior; perhaps one of my colleagues can chime in. I will
ask around.
If not, I can think of things our team could do to work around this, but
we'd need to discuss (and I'm not sure the user experience would be good).
This looks really important for us to address, so thanks for bringing it
up! Will follow up if I get more information.
On Mon, Sep 20, 2021 at 2:46 PM Houcemeddine A. Turki <
turkiabdelwaheb(a)hotmail.fr> wrote:
Dear Mr.,
I thank you for your answer. I have just discussed with Mahir Morshed who
is a proficient contributor to Lexicographical Data and has an advanced
knowledge on natural language processing techniques. We found out that the
matter is caused by the use of Unicode NFC Normalization that seems to
alter the order of Arabic diacritics in words. I ask whether this can be
fixed now or not.
Yours Sincerely,
Houcemeddine Turki
------------------------------
*De :* Cory Massaro <cmassaro(a)wikimedia.org>
*Envoyé :* lundi 20 septembre 2021 19:54
*À :* General public mailing list for the discussion of Abstract
Wikipedia and Wikifunctions <abstract-wikipedia(a)lists.wikimedia.org>
*Objet :* [Abstract-wikipedia] Re: Abstract Wikipedia and Arabic
Hi Houcemeddine,
It is awesome that you're writing these functions already! I'm curious
about your first point concerning the order of diacritics in Arabic. Can
you say more? Can you perhaps give an example of what you tried to do, what
you expected to happen, and the result you got? I would like to try to
reproduce (and hopefully fix) that problem!
Thank you,
Cory
On Sun, Sep 19, 2021 at 3:48 AM Houcemeddine A. Turki <
turkiabdelwaheb(a)hotmail.fr> wrote:
Dear all,
I thank you for your contributions to the Wikifunctions Project. As an
end user of the Wikifunctions Project, I have been invited to speak at
WikiArabia about Wikifunctions and Abstract Wikipedia in Arabic. That is
why I developed and implemented several linguistic functions for Arabic
Languages:
- Root and Pattern-Based Generator of Lexemes for Arabic Languages
(Z10157)
- Pattern-Root Compatibility Verifier for Arabic Languages (Z10160)
- IPA Generator for Diacritized Arabic Script Texts in Tunisian
Arabic (Z10163)
This implies the creation of Python codes for the three functions, the
development of test functions and the description of the developed
functions. When developing the functions, I have found several matters that
can be solved in the next few months:
1. When a word assigns two Arabic Diacritics to a letter, this can
cause a deficiency to the system. For example, كَرَّر has two Arabic
diacritics (a shaddah and a fatha) on its second letter. The shaddah should
be below the Fatha as its effect should come first. The Wikifunctions
compilers do not efficiently consider that and this can harm the processing
of the languages using the Arabic Script. This should be fixed.
2. The identation of the source code should be done by hand after
pasting the code into the field. There is no automatic identation for
pasted source codes. This can alter the user experience.
3. The mobile edition of the website does not work. Lucas Werkmeister
has raised a ticket about this (T291325).
4. All these linguistic functions are taken from reference grammar
books. It will be interesting to have a function that assigns a Wikidata
item as a reference of a Wikifunctions function.
5. The runtime of the website is signficantly important. Several
efforts should be done to make this project quicker.
6. It will be interesting to align inputs with their corresponding
Wikidata items to have better semantics for the functions.
7. System messages are not absolutely user-friendly. This can be
fixed.
8. The token for the connection to NotWikiLambda does not allow a
long connection. It almost disconnects every fifteen minutes.
Yours Sincerely,
Houcemeddine Turki
_______________________________________________
Abstract-Wikipedia mailing list -- abstract-wikipedia(a)lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikime…
_______________________________________________
Abstract-Wikipedia mailing list -- abstract-wikipedia(a)lists.wikimedia.org
List information:
https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikime…