[Abstract-wikipedia] Re: Abstract Wikipedia and Arabic

7 Oct 2021

Responding to Houcemeddine Turki's original thread (apologies for the
delay).

At the moment, it looks like we can't budge on the NFC normalization. That
takes place in a part of the stack that Abstract Wikipedia/Wikifunctions
doesn't control.

However, there is a workaround. We have a built-in type, Z86, which
represents a single code point. If you can decompose your string into code
points beforehand and pass them as a list, then, instead of calling your
Python function on the raw input, you can first join the input into a
single string (and likewise decompose returned strings into lists of code
points). To demonstrate, if you originally had a function like

def f(some_string):
    # do something
    # result = some_other_string
    return result

you could just do

def f(list_of_code_points):
    some_string = list_of_code_points.join('')
    # do something
    # result = some_other_string
    return list(result)

We *might* need to add a third-party module to our Python environment to
make this more robust (I haven't tested this thoroughly with wide
characters). I'd be happy to talk more about this if you'd like.

But of course, the easier solution will be to work with NFC-normalized form
if at all possible :-/.

Hope that helps!
-Cory

On Mon, Sep 20, 2021 at 3:10 PM Cory Massaro &lt;cmassaro(a)wikimedia.org&gt; wrote:

...
  I think I have reproduced this issue. I see that if I
pass the string, "
 كَرَّر", into a Python function using Wikifunctions, I get this Unicode
 representation in Python: "\u0643\u064e\u0631\u064e\u0651\u0631", which is
 NFC normalized. But if I work with that string in a local Python
 interpreter, I get the unnormalized representation (with \u0651 and \u064e
 in their original order).

 I see that part of our  stack expects NFC-normalized input, but it
 unfortunately isn't a part of the stack our team controls. We might be able
 to bypass this behavior; perhaps one of my colleagues can chime in. I will
 ask around.

 If not, I can think of things our team could do to work around this, but
 we'd need to discuss (and I'm not sure the user experience would be good).

 This looks really important for us to address, so thanks for bringing it
 up! Will follow up if I get more information.

 On Mon, Sep 20, 2021 at 2:46 PM Houcemeddine A. Turki <
 turkiabdelwaheb(a)hotmail.fr&gt; wrote:

  Dear Mr.,
 I thank you for your answer. I have just discussed with Mahir Morshed who
 is a proficient contributor to Lexicographical Data and has an advanced
 knowledge on natural language processing techniques. We found out that the
 matter is caused by the use of Unicode NFC Normalization that seems to
 alter the order of Arabic diacritics in words. I ask whether this can be
 fixed now or not.
 Yours Sincerely,
 Houcemeddine Turki
 ------------------------------
 *De :* Cory Massaro &lt;cmassaro(a)wikimedia.org&gt;
 *Envoyé :* lundi 20 septembre 2021 19:54
 *À :* General public mailing list for the discussion of Abstract
 Wikipedia and Wikifunctions &lt;abstract-wikipedia(a)lists.wikimedia.org&gt;
 *Objet :* [Abstract-wikipedia] Re: Abstract Wikipedia and Arabic

 Hi Houcemeddine,

 It is awesome that you're writing these functions already! I'm curious
 about your first point concerning the order of diacritics in Arabic. Can
 you say more? Can you perhaps give an example of what you tried to do, what
 you expected to happen, and the result you got? I would like to try to
 reproduce (and hopefully fix) that problem!

 Thank you,
 Cory

 On Sun, Sep 19, 2021 at 3:48 AM Houcemeddine A. Turki <
 turkiabdelwaheb(a)hotmail.fr&gt; wrote:

 Dear all,
 I thank you for your contributions to the Wikifunctions Project. As an
 end user of the Wikifunctions Project, I have been invited to speak at
 WikiArabia about Wikifunctions and Abstract Wikipedia in Arabic. That is
 why I developed and implemented several linguistic functions for Arabic
 Languages:

    - Root and Pattern-Based Generator of Lexemes for Arabic Languages
    (Z10157)
    - Pattern-Root Compatibility Verifier for Arabic Languages (Z10160)
    - IPA Generator for Diacritized Arabic Script Texts in Tunisian
    Arabic (Z10163)

 This implies the creation of Python codes for the three functions, the
 development of test functions and the description of the developed
 functions. When developing the functions, I have found several matters that
 can be solved in the next few months:

    1. When a word assigns two Arabic Diacritics to a letter, this can
    cause a deficiency to the system. For example, كَرَّر has two Arabic
    diacritics (a shaddah and a fatha) on its second letter. The shaddah should
    be below the Fatha as its effect should come first. The Wikifunctions
    compilers do not efficiently consider that and this can harm the processing
    of the languages using the Arabic Script. This should be fixed.
    2. The identation of the source code should be done by hand after
    pasting the code into the field. There is no automatic identation for
    pasted source codes. This can alter the user experience.
    3. The mobile edition of the website does not work. Lucas Werkmeister
    has raised a ticket about this (T291325).
    4. All these linguistic functions are taken from reference grammar
    books. It will be interesting to have a function that assigns a Wikidata
    item as a reference of a Wikifunctions function.
    5. The runtime of the website is signficantly important. Several
    efforts should be done to make this project quicker.
    6. It will be interesting to align inputs with their corresponding
    Wikidata items to have better semantics for the functions.
    7. System messages are not absolutely user-friendly. This can be
    fixed.
    8. The token for the connection to NotWikiLambda does not allow a
    long connection. It almost disconnects every fifteen minutes.

 Yours Sincerely,
 Houcemeddine Turki
 _______________________________________________
 Abstract-Wikipedia mailing list -- abstract-wikipedia(a)lists.wikimedia.org
 List information:
 https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikime…

 _______________________________________________
 Abstract-Wikipedia mailing list -- abstract-wikipedia(a)lists.wikimedia.org
 List information:
 https://lists.wikimedia.org/postorius/lists/abstract-wikipedia.lists.wikime…

2024

2023

2022

2021

2020

[Abstract-wikipedia] Re: Abstract Wikipedia and Arabic