MALEX: a possible model for under-resourced Asian languages

Author: Zuraidah Mohd Don (English Language Standards and Quality Council)
Speaker: Zuraidah Mohd Don
Topic: Language, dialect, sociolect, genre
The (SCOPUS / ISI) SOAS GLOCAL CALA 2019 General Session


Like several other Asian languages, Malay is regarded as an “under-resourced” language for linguistic research, particularly research involving speech and language technologies. Computer scientists and engineers need access to reliable information about languages, as do linguists who have branched out into pragmatics and discourse analysis. Linguists have traditionally carried the relevant information around in their heads, but modern research requires a computer-readable resource that can be accessed by researchers who do not possess the information themselves. This paper describes a resource for Malay which could provide a model for other Asian languages.

The MALEX (“MALay LEXicon”) project began with the development of an automatic grammatical tagger for Malay, i.e. a means of associating a grammatical class with the words of a text. Tagging English draws on well known linguistic knowledge, but for Malay it was necessary to create a computer-readable infrastructure, beginning with a lexicon with spellings and grammatical class. Since one of the researchers knew no Malay, it was necessary to include a meaning for each word, and to devise a rudimentary spelling-to-phoneme algorithm. Adding new words required a stemmer to identify the structure of words, and morphological rules to predict the grammatical class. The tagging of a Malay text required a collection of tables and procedures which worked together in the analysis of the language, and which have since provided a resource for further research. Current research involves the development of a parser so designed that it enables two procedures that might seem totally unconnected, namely translation into English, and generating phonetic specifications for speech synthesis.

In the absence of a unified theory of language, traditional linguistics places no constraints on combinations of theoretical components. For example, a lecturer might cover phoneme theory and x-bar syntax, and go on to the Sapir-Whorf hypothesis and Grice’s maxims, and not question their compatibility. The MALEX infrastructure, by contrast, is based entirely on the evidence of spoken and written Malay texts, and has built-in compatibility between components. It is for this reason that MALEX could indicate a way forward for researchers working on other under-resourced Asian languages.

Keywords: MALay LEXicon, Sapir-Whorf, Asian language