Hello Ambassadors - This technical question may be relevant to multiple (particularly CJK) language communities so I'm asking it here.
What is the advice for writing a Lua script that needs to look up data from a big table (~10k rows at first deployment, potentially increasing in the future)? Does one hard-code the data into a Lua script, or is there a recommended data structure for storing those?
The design problem at hand is that the Cantonese Wikipedia wants to re-sort articles by Jyutping rather than Unicode. This will probably involve automating the generation of Jyutping phonetic guides by looking up the Jyutping transcription of common Chinese characters using a Lua module. Where do we store the data?
If another wiki has done similar things, we'd be interested in sharing the infrastructure.
Deryck On behalf of the Cantonese Wikipedia community
Consider using https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.l... , keeping in mind that lua isn't really made with the usecase of huge data tables in mind, so there might be limits you run into if your data is really big.
-- Bawolff
On Sun, Mar 15, 2020 at 2:13 PM Deryck Chan deryckchan@gmail.com wrote:
Hello Ambassadors - This technical question may be relevant to multiple (particularly CJK) language communities so I'm asking it here.
What is the advice for writing a Lua script that needs to look up data from a big table (~10k rows at first deployment, potentially increasing in the future)? Does one hard-code the data into a Lua script, or is there a recommended data structure for storing those?
The design problem at hand is that the Cantonese Wikipedia wants to re-sort articles by Jyutping rather than Unicode. This will probably involve automating the generation of Jyutping phonetic guides by looking up the Jyutping transcription of common Chinese characters using a Lua module. Where do we store the data?
If another wiki has done similar things, we'd be interested in sharing the infrastructure.
Deryck On behalf of the Cantonese Wikipedia community
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
Deryck,
I am not sure what you mean by "re-sort" articles, but if what you means is that categories should be sorted differently, then I don't think Lua is the answer, and it would need to be handled on the back end.
In general, also, I think that for a problem like yours, Lua is not the right answer. I would recommend investing the time to create a MediaWiki extension instead, and then work through the WMF processes to have it enabled on Cantonese Wikipedia (and possibly, the entire family of Cantonese WMF wikis).
Lastly, as far as where to store the data, have you considered wikidata? I'm not sure if wikidata already supports storing pronunciations of words or not but I'm assuming that would be of interest to that project anyway.
Hope this helps! Huji
On Sun, Mar 15, 2020 at 3:55 PM bawolff bawolff+wn@gmail.com wrote:
Consider using https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.l... , keeping in mind that lua isn't really made with the usecase of huge data tables in mind, so there might be limits you run into if your data is really big.
-- Bawolff
On Sun, Mar 15, 2020 at 2:13 PM Deryck Chan deryckchan@gmail.com wrote:
Hello Ambassadors - This technical question may be relevant to multiple (particularly CJK) language communities so I'm asking it here.
What is the advice for writing a Lua script that needs to look up data from a big table (~10k rows at first deployment, potentially increasing in the future)? Does one hard-code the data into a Lua script, or is there a recommended data structure for storing those?
The design problem at hand is that the Cantonese Wikipedia wants to re-sort articles by Jyutping rather than Unicode. This will probably involve automating the generation of Jyutping phonetic guides by looking up the Jyutping transcription of common Chinese characters using a Lua module. Where do we store the data?
If another wiki has done similar things, we'd be interested in sharing the infrastructure.
Deryck On behalf of the Cantonese Wikipedia community
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
On that note, I'll mention there was a previous attempt by Liangent to do better sorting of categories for zhwiki, that unfortunately never got reviewed so ended up bitrotting.https://phabricator.wikimedia.org/T46667
I agree generally, that if you want to change the sort order for categories, that lua is not the best place to implement that.
-- bawolff
On Sun, Mar 15, 2020 at 8:25 PM Huji Lee huji.huji@gmail.com wrote:
Deryck,
I am not sure what you mean by "re-sort" articles, but if what you means is that categories should be sorted differently, then I don't think Lua is the answer, and it would need to be handled on the back end.
In general, also, I think that for a problem like yours, Lua is not the right answer. I would recommend investing the time to create a MediaWiki extension instead, and then work through the WMF processes to have it enabled on Cantonese Wikipedia (and possibly, the entire family of Cantonese WMF wikis).
Lastly, as far as where to store the data, have you considered wikidata? I'm not sure if wikidata already supports storing pronunciations of words or not but I'm assuming that would be of interest to that project anyway.
Hope this helps! Huji
On Sun, Mar 15, 2020 at 3:55 PM bawolff bawolff+wn@gmail.com wrote:
Consider using https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.l... , keeping in mind that lua isn't really made with the usecase of huge data tables in mind, so there might be limits you run into if your data is really big.
-- Bawolff
On Sun, Mar 15, 2020 at 2:13 PM Deryck Chan deryckchan@gmail.com wrote:
Hello Ambassadors - This technical question may be relevant to multiple (particularly CJK) language communities so I'm asking it here.
What is the advice for writing a Lua script that needs to look up data from a big table (~10k rows at first deployment, potentially increasing in the future)? Does one hard-code the data into a Lua script, or is there a recommended data structure for storing those?
The design problem at hand is that the Cantonese Wikipedia wants to re-sort articles by Jyutping rather than Unicode. This will probably involve automating the generation of Jyutping phonetic guides by looking up the Jyutping transcription of common Chinese characters using a Lua module. Where do we store the data?
If another wiki has done similar things, we'd be interested in sharing the infrastructure.
Deryck On behalf of the Cantonese Wikipedia community
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
bawolff - Would you be able to point me to an example of mw.loadData?
Also, I've subscribed to https://phabricator.wikimedia.org/T46667 .
Huji - I was inspired by Japanese Wikipedia's approach to sorting - they have a {{DEFAULTSORT:[article name in hiragana]}} on all articles. Since Cantonese pronunciation is even more predictable than Japanese, we could potentially have a template that automatically adds {{DEFAULTSORT:[article title in Jyutping]}} using a Lua lookup table of all common Chinese characters. Exceptional pronunciations should then be coded individually. The Pinyin implementation of this would be equivalent, though it would depend on the zh.wp community agreeing on sorting things by Pinyin.
In terms of storing the data, Wikidata is not a good answer. First up, the Wikidata property creators community has rejected the notion of creating separate properties for each common phonetic transcription system of CJK languages, so the retrieval of the phonetic transcriptions from Jyutping will be unnecessarily complicated. Second, Wikidata items refer to concepts, not titles. We could theoretically ask the script to go to Lexemes to fetch the phonetic transcription but that'll involve untangling the multiple Lexemes that refer to the same Chinese character. In general, the way Wikidata is structured makes it a bad fit for the problem at hand.
Liangent's formulation of the problem is more general than the one I described, because T46667 aims to allow multiple ways of sorting Chinese characters within the same interface. That will be much welcome too.
On Sun, 15 Mar 2020 at 19:55, bawolff bawolff+wn@gmail.com wrote:
Consider using https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.l... , keeping in mind that lua isn't really made with the usecase of huge data tables in mind, so there might be limits you run into if your data is really big.
-- Bawolff
On Sun, Mar 15, 2020 at 2:13 PM Deryck Chan deryckchan@gmail.com wrote:
Hello Ambassadors - This technical question may be relevant to multiple (particularly CJK) language communities so I'm asking it here.
What is the advice for writing a Lua script that needs to look up data from a big table (~10k rows at first deployment, potentially increasing in the future)? Does one hard-code the data into a Lua script, or is there a recommended data structure for storing those?
The design problem at hand is that the Cantonese Wikipedia wants to re-sort articles by Jyutping rather than Unicode. This will probably involve automating the generation of Jyutping phonetic guides by looking up the Jyutping transcription of common Chinese characters using a Lua module. Where do we store the data?
If another wiki has done similar things, we'd be interested in sharing the infrastructure.
Deryck On behalf of the Cantonese Wikipedia community
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
The additional information you provided was helpful. I still think the best approach is to have an extension that returns the Jyutping value for the article title. Let's say that extension introduces a new magic word called {{TITLEINJYUPTING}}. That way you can add {{DEFAULTSORT:{{TITLEINJYUPTING}}}} to the bottom of the pages; and for exceptional pronunciations you can use {{DEFAULTSORT:[special Jyutping pronunciation]}} instead. Alternatively, you could make it a parser function like {{DEFAULTSORT:{{#JYUPTING:{{PAGETITLE}}}}}} or something like that.
If Juytping is as predictable as you state then making an extension should be a good idea because (a) it can be used by non-WMF wikis too, without having to set up Scrbunto, etc. and (b) it will be notably faster, because it runs directly on PHP, and not through the additional layer of Lua.
On Sun, Mar 15, 2020 at 7:03 PM Deryck Chan deryckchan@gmail.com wrote:
bawolff - Would you be able to point me to an example of mw.loadData?
Also, I've subscribed to https://phabricator.wikimedia.org/T46667 .
Huji - I was inspired by Japanese Wikipedia's approach to sorting - they have a {{DEFAULTSORT:[article name in hiragana]}} on all articles. Since Cantonese pronunciation is even more predictable than Japanese, we could potentially have a template that automatically adds {{DEFAULTSORT:[article title in Jyutping]}} using a Lua lookup table of all common Chinese characters. Exceptional pronunciations should then be coded individually. The Pinyin implementation of this would be equivalent, though it would depend on the zh.wp community agreeing on sorting things by Pinyin.
In terms of storing the data, Wikidata is not a good answer. First up, the Wikidata property creators community has rejected the notion of creating separate properties for each common phonetic transcription system of CJK languages, so the retrieval of the phonetic transcriptions from Jyutping will be unnecessarily complicated. Second, Wikidata items refer to concepts, not titles. We could theoretically ask the script to go to Lexemes to fetch the phonetic transcription but that'll involve untangling the multiple Lexemes that refer to the same Chinese character. In general, the way Wikidata is structured makes it a bad fit for the problem at hand.
Liangent's formulation of the problem is more general than the one I described, because T46667 aims to allow multiple ways of sorting Chinese characters within the same interface. That will be much welcome too.
On Sun, 15 Mar 2020 at 19:55, bawolff bawolff+wn@gmail.com wrote:
Consider using https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.l... , keeping in mind that lua isn't really made with the usecase of huge data tables in mind, so there might be limits you run into if your data is really big.
-- Bawolff
On Sun, Mar 15, 2020 at 2:13 PM Deryck Chan deryckchan@gmail.com wrote:
Hello Ambassadors - This technical question may be relevant to multiple (particularly CJK) language communities so I'm asking it here.
What is the advice for writing a Lua script that needs to look up data from a big table (~10k rows at first deployment, potentially increasing in the future)? Does one hard-code the data into a Lua script, or is there a recommended data structure for storing those?
The design problem at hand is that the Cantonese Wikipedia wants to re-sort articles by Jyutping rather than Unicode. This will probably involve automating the generation of Jyutping phonetic guides by looking up the Jyutping transcription of common Chinese characters using a Lua module. Where do we store the data?
If another wiki has done similar things, we'd be interested in sharing the infrastructure.
Deryck On behalf of the Cantonese Wikipedia community
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
So in this scenario, are all categories planned to be sorted via jyupting? If so, we could make a collation, in which case categories would automatically be sorted that way, and you would just put the category in a page in the normal way (by doing [[Category:Foo]] with no sortkey). The downside would be that all categories would have to use jyupting.
mw.loadData format
Its just a normal lua table format. https://www.mediawiki.org/wiki/Module:ExtensionJson is an example. There are of course size limits for max sizes of a page (I think its 1 or 2 mb). But based on the size of https://raw.githubusercontent.com/MacroYau/PyJyutping/master/pyjyutping/data... you will probably be within the size limits.
(b) it will be notably faster, because it runs directly on PHP, and not
through the additional layer of Lua.
Personally, I am unconvinced the speed will be significantly different.
-- Brian
On Sun, Mar 15, 2020 at 11:19 PM Huji Lee huji.huji@gmail.com wrote:
The additional information you provided was helpful. I still think the best approach is to have an extension that returns the Jyutping value for the article title. Let's say that extension introduces a new magic word called {{TITLEINJYUPTING}}. That way you can add {{DEFAULTSORT:{{TITLEINJYUPTING}}}} to the bottom of the pages; and for exceptional pronunciations you can use {{DEFAULTSORT:[special Jyutping pronunciation]}} instead. Alternatively, you could make it a parser function like {{DEFAULTSORT:{{#JYUPTING:{{PAGETITLE}}}}}} or something like that.
If Juytping is as predictable as you state then making an extension should be a good idea because (a) it can be used by non-WMF wikis too, without having to set up Scrbunto, etc. and (b) it will be notably faster, because it runs directly on PHP, and not through the additional layer of Lua.
On Sun, Mar 15, 2020 at 7:03 PM Deryck Chan deryckchan@gmail.com wrote:
bawolff - Would you be able to point me to an example of mw.loadData?
Also, I've subscribed to https://phabricator.wikimedia.org/T46667 .
Huji - I was inspired by Japanese Wikipedia's approach to sorting - they have a {{DEFAULTSORT:[article name in hiragana]}} on all articles. Since Cantonese pronunciation is even more predictable than Japanese, we could potentially have a template that automatically adds {{DEFAULTSORT:[article title in Jyutping]}} using a Lua lookup table of all common Chinese characters. Exceptional pronunciations should then be coded individually. The Pinyin implementation of this would be equivalent, though it would depend on the zh.wp community agreeing on sorting things by Pinyin.
In terms of storing the data, Wikidata is not a good answer. First up, the Wikidata property creators community has rejected the notion of creating separate properties for each common phonetic transcription system of CJK languages, so the retrieval of the phonetic transcriptions from Jyutping will be unnecessarily complicated. Second, Wikidata items refer to concepts, not titles. We could theoretically ask the script to go to Lexemes to fetch the phonetic transcription but that'll involve untangling the multiple Lexemes that refer to the same Chinese character. In general, the way Wikidata is structured makes it a bad fit for the problem at hand.
Liangent's formulation of the problem is more general than the one I described, because T46667 aims to allow multiple ways of sorting Chinese characters within the same interface. That will be much welcome too.
On Sun, 15 Mar 2020 at 19:55, bawolff bawolff+wn@gmail.com wrote:
Consider using https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual#mw.l... , keeping in mind that lua isn't really made with the usecase of huge data tables in mind, so there might be limits you run into if your data is really big.
-- Bawolff
On Sun, Mar 15, 2020 at 2:13 PM Deryck Chan deryckchan@gmail.com wrote:
Hello Ambassadors - This technical question may be relevant to multiple (particularly CJK) language communities so I'm asking it here.
What is the advice for writing a Lua script that needs to look up data from a big table (~10k rows at first deployment, potentially increasing in the future)? Does one hard-code the data into a Lua script, or is there a recommended data structure for storing those?
The design problem at hand is that the Cantonese Wikipedia wants to re-sort articles by Jyutping rather than Unicode. This will probably involve automating the generation of Jyutping phonetic guides by looking up the Jyutping transcription of common Chinese characters using a Lua module. Where do we store the data?
If another wiki has done similar things, we'd be interested in sharing the infrastructure.
Deryck On behalf of the Cantonese Wikipedia community
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
Wikitech-ambassadors mailing list Wikitech-ambassadors@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors
wikitech-ambassadors@lists.wikimedia.org