With the South African Centre for Digital Language Resources’ (SADiLaR’s) enabling function, and its focus on all the official languages in South Africa, it plays a supporting role in research and development of these languages. The research and development pertains to language technology in the fields of Engineering and Computer Science, as well as language-related studies in the fields of Humanities and Social Sciences. SADiLaR exercises its mandate by collaborating with several nodes, one of which is the Speech Node at the Council for Scientific and Industrial Research (CSIR), situated in Pretoria.
The Speech Node is involved in localised language technology development, and focuses on speech technologies such as automatic speech recognition (ASR) and text-to-speech (TTS) and controlled natural language processing for machine-aided translation. The CSIR’s TTS offering, known as Qfrency TTS (www.qfrency.com) is the only commercial TTS product catering for all of the South African official languages. The Qfrency TTS suite consists of 17 TTS voices, in male and female genders, in all of the official languages, as well as a boy voice.
The Node’s TTS research focuses on improving the naturalness of their TTS voices with a particular emphasis on tone and prosody in the African languages, and building TTS voices using state-of-the-art techniques. Their ASR research focuses on semi-supervised harvesting of audio data required to develop speech recognition systems on a par with international offerings, and ASR system development for the local languages using state-of-the-art techniques. On the machine translation side, the Speech Node is following an approach known as grammar-based/rule-based machine translation as opposed to a more data-driven/statistical approach. (The latter approach is followed by the Text Node known as CTexT.) The CSIR uses Grammatical Framework as a formalism which enables the application of rule-based machine translation to contexts that require high levels of accuracy in limited domains.
There are many applications for the speech and language technologies being developed by the Speech Node, including transcription services, keyword spotting, computer-assisted language learning, literacy development, augmented ebooks, TTS as a service, and so forth. The flagship project of the Speech Node, however, is an application known as AwezaMed (www.aweza.co.za), a speech-to-speech translation application which employs speech recognition, GF-based machine-aided translation, and TTS to enable health care practitioners to communicate with patients, when language barriers exist between them.
The Speech Node is developing a second product under the working title Augmented Ebooks. This product will combine audio (synthesised or human-narrated) and text into a single title/book, and provide for highlighted reading at word, sentence or paragraph level.
Speech or voice interfaces are becoming more prevalent as a form of HCI in modern (often AI-driven) technology with applications such as Amazon’s Alexa, Apple’s Siri and Google’s Home assistant dominating the market and landscape in this new field of applications and interaction. Speech interfaces have the potential to replace many applications as it is faster and more natural to use than traditional keyboard-based interfaces and it simplifies the mental load. Speech interfaces also democratise access to the connected world by means of replacing complicated visual applications or interfaces with a human-oriented natural interface, thereby increasing access through lowering the complexity of interaction. Speech enables a convenient integration. It is hands-free, eyes-free and keyboard-free. Humans can speak 150 words on average per minute compared with 40 when typing. Speech interaction can be quickly mastered by young generations, old people, disabled people and low-literate people. It can also be applied in occasions and devices where common interactions are challenging such as while driving, without light, or in extremely small wearables. These advantages make speech an increasingly popular media for devices and applications.
Speech interfaces can play a useful role in a wide range of application environments, including educational settings, kiosks, embedded devices and more. Significant benefits can be envisioned if information is provided in domains such as agriculture, health care and government services. Gartner estimates that 30% of HCI was through conversations with smart machines by the end of 2018.
The importance of the role SADiLaR is playing in terms of enabling the development of the resources required to build localised speech and language technologies cannot be overstated. SADiLaR is core to ensuring that we enable our local languages to fulfil their vital role in the 4th Industrial Revolution.
Contact person:
Dr Karen Calteaux – kcalteaux@csir.co.za