Responding to Houcemeddine Turki's original thread (apologies for the delay).

At the moment, it looks like we can't budge on the NFC normalization. That takes place in a part of the stack that Abstract Wikipedia/Wikifunctions doesn't control.

However, there is a workaround. We have a built-in type, Z86, which represents a single code point. If you can decompose your string into code points beforehand and pass them as a list, then, instead of calling your Python function on the raw input, you can first join the input into a single string (and likewise decompose returned strings into lists of code points). To demonstrate, if you originally had a function like

def f(some_string):
    # do something
    # result = some_other_string
    return result
   
you could just do

def f(list_of_code_points):
    some_string = list_of_code_points.join('')
    # do something
    # result = some_other_string
    return list(result)

We might need to add a third-party module to our Python environment to make this more robust (I haven't tested this thoroughly with wide characters). I'd be happy to talk more about this if you'd like.

But of course, the easier solution will be to work with NFC-normalized form if at all possible :-/.

Hope that helps!
-Cory

On Mon, Sep 20, 2021 at 3:10 PM Cory Massaro <cmassaro@wikimedia.org> wrote:
I think I have reproduced this issue. I see that if I pass the string, "كَرَّر", into a Python function using Wikifunctions, I get this Unicode representation in Python: "\u0643\u064e\u0631\u064e\u0651\u0631", which is NFC normalized. But if I work with that string in a local Python interpreter, I get the unnormalized representation (with \u0651 and \u064e in their original order).

I see that part of our  stack expects NFC-normalized input, but it unfortunately isn't a part of the stack our team controls. We might be able to bypass this behavior; perhaps one of my colleagues can chime in. I will ask around.

If not, I can think of things our team could do to work around this, but we'd need to discuss (and I'm not sure the user experience would be good).

This looks really important for us to address, so thanks for bringing it up! Will follow up if I get more information.

On Mon, Sep 20, 2021 at 2:46 PM Houcemeddine A. Turki <turkiabdelwaheb@hotmail.fr> wrote:
Dear Mr.,
I thank you for your answer. I have just discussed with Mahir Morshed who is a proficient contributor to Lexicographical Data and has an advanced knowledge on natural language processing techniques. We found out that the matter is caused by the use of Unicode NFC Normalization that seems to alter the order of Arabic diacritics in words. I ask whether this can be fixed now or not.
Yours Sincerely,
Houcemeddine Turki

De : Cory Massaro <cmassaro@wikimedia.org>
Envoyé : lundi 20 septembre 2021 19:54
À : General public mailing list for the discussion of Abstract Wikipedia and Wikifunctions <abstract-wikipedia@lists.wikimedia.org>
Objet : [Abstract-wikipedia] Re: Abstract Wikipedia and Arabic
 
Hi Houcemeddine,

It is awesome that you're writing these functions already! I'm curious about your first point concerning the order of diacritics in Arabic. Can you say more? Can you perhaps give an example of what you tried to do, what you expected to happen, and the result you got? I would like to try to reproduce (and hopefully fix) that problem!

Thank you,
Cory

On Sun, Sep 19, 2021 at 3:48 AM Houcemeddine A. Turki <turkiabdelwaheb@hotmail.fr> wrote:
Dear all,
I thank you for your contributions to the Wikifunctions Project. As an end user of the Wikifunctions Project, I have been invited to speak at WikiArabia about Wikifunctions and Abstract Wikipedia in Arabic. That is why I developed and implemented several linguistic functions for Arabic Languages:
  • Root and Pattern-Based Generator of Lexemes for Arabic Languages (Z10157)
  • Pattern-Root Compatibility Verifier for Arabic Languages (Z10160)
  • IPA Generator for Diacritized Arabic Script Texts in Tunisian Arabic (Z10163)
This implies the creation of Python codes for the three functions, the development of test functions and the description of the developed functions. When developing the functions, I have found several matters that can be solved in the next few months:
  1. When a word assigns two Arabic Diacritics to a letter, this can cause a deficiency to the system. For example, كَرَّر has two Arabic diacritics (a shaddah and a fatha) on its second letter. The shaddah should be below the Fatha as its effect should come first. The Wikifunctions compilers do not efficiently consider that and this can harm the processing of the languages using the Arabic Script. This should be fixed.
  2. The identation of the source code should be done by hand after pasting the code into the field. There is no automatic identation for pasted source codes. This can alter the user experience.  
  3. The mobile edition of the website does not work. Lucas Werkmeister has raised a ticket about this (T291325).
  4. All these linguistic functions are taken from reference grammar books. It will be interesting to have a function that assigns a Wikidata item as a reference of a Wikifunctions function.
  5. The runtime of the website is signficantly important. Several efforts should be done to make this project quicker.
  6. It will be interesting to align inputs with their corresponding Wikidata items to have better semantics for the functions.
  7. System messages are not absolutely user-friendly. This can be fixed.
  8. The token for the connection to NotWikiLambda does not allow a long connection. It almost disconnects every fifteen minutes.
Yours Sincerely,
Houcemeddine Turki
_______________________________________________
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org
List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimedia.org/
_______________________________________________
Abstract-Wikipedia mailing list -- abstract-wikipedia@lists.wikimedia.org
List information: https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikimedia.org/