SADiLaR Projects

African Wordnet and Linguistic Terminology

The African Wordnet (AWN) and Linguistic Terminology project is two-pronged and concerns the development of language resources in the form of word nets for a variety of African languages as well as the development of linguistics terminology for all official African languages. The project comprises two work packages: Work Package 1 deals with expanding the scope of the existing African Wordnet while Work Package 2 involves the expansion of the Open Educational Resource Term Bank (OERTB) with newly extracted linguistic terminology.

Node project which started on: 1 October 2017
Status: Project completed - Finalising

READ MORE...


African Wordnet and Linguistic Terminology - Extension of Phase 1 

The African Wordnet (AWN) and Linguistic Terminology project is two-pronged and concerns the development of language resources in the form of wordnets for a variety of African languages as well as the development of linguistics terminology for all official African languages. In the initial project, development of the AWN was limited to 7 of the indigenous South African languages. With this expansion, we plan to add the remaining 2 languages so that all 9 indigenous South African languages are represented in the African Wordnet. In doing so, we will ensure that further development for all languages is stimulated and the platform for further wordnet development in any of the languages is created.

The extended project comprises two work packages: Work Package 1 deals with expanding the scope of the existing African Wordnet including the usage of the AWN data for language learning while Work Package 2 involves a wrap up workshop for the Open Educational Resource Term Bank (OERTB) with newly extracted linguistic terminology.

Project which started on: 1 October 2017
Status: Project completed - Finalising

READ MORE...


African Wordnet Development: Phase 2

The African Wordnet (AWN) project concerns the development of language resources in the form of wordnets for 9 African languages.

Node project which started on: 15 June 2019
Status: Project in progress

READ MORE...


Communicative Development Inventories for all South Africa’s eleven official languages – Phase Two

Communicative Development Inventories for six languages: Sesotho sa Leboa, Tshivenda, isiNdebele, Siswati, isiZulu and South African English.

Node project which started on: 1 January 2021
Status: Project in progress

READ MORE...


Communicative Development inventories for five South African languages

The aim of this project is to collect and digitize data on children's language development from 8 to 30 months and from these data construct and validate Communicative Development Inventories (COis), which are parent completed questionnaires (for infants 8-18 months and toddlers 16-30 months) about children's vocabulary, gesture and grammatical abilities for all official South African languages beginning with five: Setswana, Sesotho, isiXhosa, Xitsonga and Afrikaans.

Open Call project which started on: 1 November 2018
Status: Project completed

 


Corpus and system development for automatic captioning of official speeches

The primary aim of the proposed project is to create a corpus of automatically transcribed government speeches. The CSIR proposes to start with the current president (Mr Cyril Ramaphosa) and then expand the corpus with speeches made by previous presidents and/or other members of parliament. A secondary aim is to initiate the development of an automatic speech recognition system that could serve as a first step towards addressing the need for automatic captioning expressed by the Department of Government Communications.

Node project which started on: 1 April 2020
Status: Project in progress

READ MORE...


Development of a multi-level, multi-genre learner corpus academic writing

Development of a multi-genre, multi-level learner corpus of academic writing in order to develop, refine and implement an online academic writing tool.

Node project which started on: 1 March 2017
Status: Project completed

READ MORE...


Enabling Localised language technology applications: A Computational Wide coverage resource grammar for isiZulu

The aim of this project is to develop a Wide coverage resource grammar  for isiZulu, and to make it available to the research community in a variety of ways.

Node project which started on: 1 April 2020
Status: Project in progress

READ MORE...


Escalator

Development and execution of a Digital Humanities Champions programme.

HUB project which started on: 1 December 2020
Status: Project in progress


Expansion and further refinement of a multi-level, multi-genre learner corpus of academic writing

Collect additional data from various universities in South Africa to grow the multi-level, multi-genre learner corpus of academic writing

Node project which started on: 1 January 2020
Status: Project in progress

READ MORE...


Exploring fair and unbiased testing

Creation of a Protocol for fair and unbiased testing

Node project which started on: 1 March 2017
Status: Project completed


Extended digitization of Language resources-B

Generation of clear protocols relating to the digitization project managed by the UP node that will enable the UP node to build language resources for the official South African languages through the digitization of relevant analogue text, graphic, audio and video data.

Node project which started on: 1 April 2020
Status: Project completed

READ MORE...


Extended digitization of Language resources-C

Building language resources for the indigenous South African languages through digitization of language and language related text, audio, online and video data. This project entails the continuation of mass digitization of all 11 Official languages of South Africa. Digitization will also include digital resources for specific needs and projects.

Node project which started on: 1 October 2020
Status: Project in progress

READ MORE...


Extension of the Expansion and further refinement of a mutli-level, multi-genre learner corpus of academic writing: Translation of writing Resources in African languages

The project aims to create translations of writing resources developed as part of the expansion and further refinement of a multi-level, multi-genre learner corpus of academic writing ICELDA node project and making it available for all official languages of South African including South African sign language for open source release (CC BY 4.0).

Node project which started on: 1 August 2020
Status: Project in progress

READ MORE...


Harvesting existing sources of speech data for HTL development in South Africa

The aim of the proposed project is to explore different possibilities for the (semi-) automatic harvesting of existing sources of speech data to create resources that can be used to develop new and improve on existing speech technologies. Ultimately the aim of the project is to enlarge the size of the existing speech corpora for all South Africa's official languages.

Node project which started on: 1 April 2018
Status: Project completed - Finalising

READ MORE...


Health Resources in the South Africa Languages

A systematic review of available health resources available for the South African Languages, culminating in an index of health resources. A wide range of resources form part of the index, including screening questionnaires, diagnostic assessments, and intervention programmes designed for health professionals.

Open Call project which started on: 1 November 2018
Status: Project completed - Finalising

READ MORE...


Human Language Technologies Audit 2017/2018

This project aims to provide information on the current state of HL T R&D in South Africa. Specifically, to replicate the HL T audit completed in 2009 and to update the information on the various HL T tools, resources and applications identified in the 2009 audit. The tools, resources and applications developed since 2009 will be identified and categorised using a more updated version of the technology matrix previously employed.

Node project which started on: 1 July 2017
Status: Project completed

READ MORE...


Linguistic corpus enrichment for conjunctively written South African languages - Extension

Extension of enriched corpora for the four official South African languages with a conjunctive orthography, i.e. isiNdebele (NR), isiXhosa (XH), isiZulu (ZU), and Siswati (SS). The corpora will consist of approximately 50,000 tokens, parallel on sentence level, with English as source language, for each language. Each language's corpus will be annotated on two levels (in addition to the annotation already underway), namely: 

  •  Part of speech (POS); and
  •  Lemmatisation.

Node project which started on: 1 July 2019
Status: Project in progress

READ MORE...


Linguistic corpus enrichment for conjuctively written South African languages (isiNdebele, isiXhosa, isiZulu and Siswati)

The aim of this project is to extend morpho-syntactically enriched corpora for the conjunctively written South African languages. Manually enriched bilingual parallel corpora of approximately 40,000 tokens for isiNdebele, isiXhosa, isiZulu and Siswati will be morphologically annotated.

Node project which started on: 1 October 2017
Status: Project completed

READ MORE...


Mobile Dictionary application framework

The project aims to develop an open-source hybrid mobile application framework that will allow for online access to a TMS and dictionary API, managed through a TMS API manager (TAM) and offline access to local dictionary content. The framework will create a shared codebase supporting the deployment of both Android and iOS apps from their respective marketplaces.

This framework will expand access to dictionaries to allow users to not only gain online access to dictionaries but also provide users with an option to store dictionary content in a local database on mobile devices.

Node project which started on: 1 August 2020
Status: Project in progress 

READ MORE...


Multimedia Digital Corpus of siPhûthî

A multilingual digital corpus of siPhûtî as spoken in South Africa and Lesotho. 

Open call project which started on: 1 August 2019
Status: Project in progress

 


Parallel corpora for English into isiXhosa

Development of parallel data sets between English and isiXhosa

Node project which started on: 1 July 2019
Status: Project completed 

READ MORE...


Parallel corpora for English-Siswati 

This project entails the collection and processing of bilingual data for the development of an English–Siswati parallel corpus. 

Node project which started on: 1 July 2019
Status: Project in progress

READ MORE...


Phase 2: Harvesting existing sources of speech data for HLT development in South Africa

The aim of the project is develop speech resources in a (semi-)automatic manner (for L3, L4, L5 and L6) based on the Phase 1 feasibility study. This will entail the collection of appropriate speech and text data for L3, L4, L5 and L6, enabling the development of baseline ASR systems, followed by the development and release of automatically transcribed speech data and updated harvesting procedures for the remaining languages (L7 to L 11 ).

Node project which started on: 1 April 2021
Status: Project in progress 

 READ MORE...


Project Expansion and further refinement of a multi-level, multi-genre learner corpus academic writing

Redevelopment of the Write‐it Course in SADiLaR’s Moodle environment. 

Node project which started on: 1 October 2020
Status: Project completed - Finalising

READ MORE...


Spoken data corpus-Afrikaans, Setswana, Sesotho sa Leboa

The phonetics and phonology of Coloured Afrikaans have as yet barely received any serious attention. This is largely due to the lack of adequate spoken data corpora. Without it, no complete and reliable acoustic descriptions are possible. In relation to this, satisfactory sociolinguistic studies also are unlikely. The main aim of this project is the filling of this gap. The first phase of the project will focus on Coloured Afrikaans. Subsequent projects are planned for Setswana and Sesotho sa Leboa.

Open Call project which started on: 1 January 2020
Status: Project completed

READ MORE...


Through the lens Ex achina: using NLP and statistical learning methods to model eyewitness statements and choosing behavior

The primary aim of this project is to develop and put to trial a new, innovative way of analysing and using eyewitness statements and descriptions to predict eyewitness identification performance. This has not been done before with natural language processing or machine learning methods, and this could solve the current difficulty of analysing large quantities of verbal data.


Open Call project which started on: 1 November 2018
Status: Project completed

READ MORE...


Towards multilingual academic literacy testing for Secondary and Higher Education

Develop a translation protocol for academic literacy tests (this protocol will also consider the possibility of bias in translations) and translate and refine academic literacy tests for the following languages English, Afrikaans, isiXhosa, isiZulu, Setswana and Sesotho

Node project which started on: 1 January 2020
Status: Project in progress

READ MORE...


VC Daghregister transcription project: Phase 2

The project is exclusively focussed on the digitisation and transcription of a number of VOC Journals (VC series) of the Cape of Good Hope, vested in the Western Cape Archives and Record Services, Cape Town, in order to make linguistic information available in the public domain. The main purpose of the project is to make available in the public domain historical linguistic material (in particular Afrikaans and Dutch) offering numerous examples of diachronic and synchronic importance which are contained in documents of the Dutch East-India Company (VOC). Of the entire period (1651-1795) the years 1671 to 1679 will be completed during this phase of the project.


Open Call project which started on: 1 October 2018
Status: Project completed

READ MORE...