Digital humanities (DH) is what happens at the intersection of computing or digital technologies and the disciplines of the humanities. In a way, DH is the humanities’ response to the digital world. On the one hand, DH entails using digital and computational methodologies in the broad field of humanities. For instance, computational approaches allow for the analysis of large amount of texts, which may be practically infeasible to do manually. On the other hand, DH encompasses the humanities-oriented investigation into the digital world. For instance, researchers investigating behaviour of people on social media are working in DH. But DH can also be applied to research long predating present technologies, as an example, we can consider the digitisation of cultural artefacts and historical resources.
One of the most important features of DH is its two-way relationship between the humanities and the digital. DH means both the use of technology to conduct humanities research in a way that was never before possible, but also the humanities-oriented research into the technology that is so part of our everyday lives.
Both of these applications are critical when looking at our past, present and future. In order to fully appreciate the impact of DH, it is necessary to digitise our past. Digitised information allows for computational analyses. Additionally, it forces the re-documentation which is sorely needed for the decolonisation of knowledge. Having access to large amounts of documentation allows us to interrogate our past and our present on a wide scale. To safeguard our future, an understanding of the impact of the increasing importance of artificial intelligence (AI) and machine learning on all functions of life is essential. Asking relevant societal questions and investigating these is part of the research field of DH.
Simply having access to the material is, however, not enough. To be able to properly access the information that is contained in these large collections, novel analysis means are needed. DH investigates, develops, and applies tools such as optical character recognition (enabling searching in text collections that were previously only available as images) or textual analysis (for instance, for identifying people and places in texts).
Interrogating our present
In the last few decades the world has changed irrevocably. Mobile phones and easy Internet access led to an increasing number of social interactions online. Combined with the impact of the COVID-19 global pandemic that had the world in lockdown for most of 2020 and in some places even beyond, this has led to a major change in communication. For instance, grandparents are interacting with their grandchildren on video calls and families are staying in contact through Facebook and WhatsApp groups. How does that change human dynamics and are these for better or worse?
Now, as researchers based in Africa, we must ask ourselves, who is interrogating Africa’s past, present and future? Who is studying the unique effects of social media on the children of your own countries on this continent and who is working to adapt those platforms to protect them? What is the unique context of Africa and how can we take this specific context into account in such research?
These are in fact already real-world examples and we don’t know what may lie in the future. But what we do know is that AI is only as good as the data and algorithms it uses and this has an impact on its use in Africa. There has been less medical data gathered on African populations than any other population in the world, which may have major negative consequences for diagnostics and treatments of AI in medicine in Africa, for instance. Already the use of AI in the context of predictive policing has been shown to be racist because of biased algorithms, and disinformation and hate speech in Africa is being shared unchecked through social media platforms in part because YouTube’s algorithms are not effective in detecting toxic content in non-English speaking countries in the Global South.
It is here where it is pivotal to apply humanistic interrogation and questioning taught in the Humanities to these tools and algorithms, but for humanities researchers to develop not only the vocabulary to speak to the computer scientists working on these technologies, but also the tools, such as machine translation tools for our indigenous languages, to moderate the AI and make it work for our own societies and communities.
Building our own DH communities to solve our own community’s needs
It is obvious that with each community, each language group, each ethnic minority having their own unique needs, quirks of culture and belief, languages and even DNA, we need to build a strong network of DH practitioners across Africa, from the granular level up, to guard against blind spots and biases and ensure communities do not end up on the wrong side of an algorithm. Such a network of DH practitioners also enables people in Africa to take control over the practical use of AI and related techniques.
With this information in hand, we will know what to focus on, how to bring people with shared interests together, where our country’s strengths lie and what challenges need to be addressed. Because in order to build these skills and practically applicable resources, we need a strong community of practice. DH is a radically different way of doing research compared to traditional approaches. It mixes the digital with humanities, which requires interdisciplinary collaborations. These collaborations require practitioners to get together. As such building communities of practice is essential for DH to flourish, and to ensure Africa is ready for what lies ahead, as flourish it must.
Last month four SADiLaR researchers attended the 23rd Biennial International Conference of the African Languages Association of Southern Africa, organised by the African Language Association of Southern Africa (ALASA) in collaboration with the Pan South African Language Board, the Department of African Language Studies and the Centre for Advanced Studies of African Society of the University of the Western Cape’s (UWC) Faculty of Arts and Humanities. The conference, themed African languages in practice in the 21st century, took place in Stellenbosch from 21 to 24 September 2022.
Image caption: Four SADiLaR researchers attending the ALASA 2022 conference. From left to right: Andiswa Bukula, Rooweither Mabuya, Benito Trollip, Muzi Matfunjwa
Raising awareness of the SADiLaR’s contribution to the field
Rooweither Mabuya, Andiswa Bukula,Muzi Matfunjwa and Benito Trollip from SADiLaR all attended and presented papers, but, say the four researchers, the value of SADiLaR staff attending conferences goes far beyond presenting research.
“These conferences give us a great opportunity to speak to researchers in the field about what SADiLaR, as a national research infrastructure, does, and what we can do for them,” says Bukula, digital humanities (DH) researcher in isiXhosa. “Researchers are very interested in finding out about our open call for funding, how they can access our repository and what possibilities and support is available for researchers interested in enhancing their research using computational tools.”
Matfunjwa, presented on The efficacy of Siswati part of speech tagger, assessing the efficacy of the part of speech tagger for Siswati, problems observed from incorrectly tagged words and possible solutions to improve the accuracy of the tagger. He noted the interest from other researchers in how he used digital tools in his analysis of Siswati, the language he specialises in as a DH researcher.
“Meeting in person at a conference, and hearing researchers present is a great opportunity to show our relevance to researchers in the field,” says Trollip, who specialises in Afrikaans. “The ALASA conference, for instance, had strong presentations in forensic linguistics, and I met a research fellow from UWC who wants to build a corpus of specific court judgments. I found this out by listening to her presentation, and we connected thereafter as SADiLaR has the expertise to assist her with this.”
Celebrating South Africa’s indigenous languages
For all four researchers who attended the conference the highlight was the presentations given by researchers about their languages, in their languages, with no interpretation services.
“For me, this was mind blowing,” says Mabuya, whose research speciality is isiZulu., “Through this the ALASA conference made a very clear and powerful statement advocating for the upliftment and development of our indigenous languages.”
Trollip, who attended several presentations in South African languages he does not speak, agrees: “It is just wonderful watching people express themselves in their own languages. As an Afrikaans researcher and mother tongue speaker I purposely target Afrikaans conferences for the joy of being able to present in my own tongue. It is wonderful to see this practice taking off with our indigenous languages.”
Bukula says she also really appreciated the diversity of both the speakers and the presentations, and the high quality of presentations in general.
“Presentations ranged from those on literature and poetry to forensic linguistics to the more technical and computational work. It was great listening to presentations in our different languages and the social events were excellent,” she says.
“Congratulations to the organisers for an excellent and well-organised conference.”
Linguistics is the scientific study of language. In a multilingual society like South Africa, we need linguists who can provide the skills, insights and expertise to ensure all our South African languages remain relevant in our fast-changing world. To build up the field of linguistics in nine of South Africa’s official languages the University of South Africa (UNISA) node of SADiLaR, which focuses on language resource development, has created a linguistics termbank which is now freely available online. Users need only register and create a profile for themselves in the Lexonomy web portal, navigate to the Multilingual Linguistics Termbank and from there they can search common linguistics terms in Setswana, isiZulu, isiXhosa, Sesotho sa Leboa, Sesotho, Siswati, Xitsonga, Tshivenḓa or isiNdebele with English as the pivot language.
“The goal of the Multilingual Linguistics Termbank is twofold,” says Marissa Griesel, project manager and specialist researcher for the UNISA node of SADiLaR. “The first is to provide an open-access resource to linguistics students as a multilingual classroom support tool. The second is to begin to standardise linguistics terms in the languages taught in the Department of African Languages at Unisa and at other higher education institutions, thereby strengthening these languages of scholarship in the field of linguistics.”
The team who worked on the project included linguists from various education, government, and private institutions.
An open educational resource for linguistics
Because UNISA is a comprehensive open distance e-learning institution it was a priority to build a resource that also gives back to their own community. The termbank is accessible to anybody with Internet access and it offers definitions for 500 linguistics terms in the nine languages.
It is therefore possible for an isiXhosa linguistics student to input a term they are learning about and get back a definition for that term in their mother tongue, as well as a usage example to contextualise the term, and cross-reference the data from a related language such as isiZulu.
“We hope that by enabling students to access complex terminology in the South African languages they are better able to engage with the subject matter and enhance their understanding of this field of specialisation,” says Professor Mampaka Lydia Mojapelo, node co-manager and associate professor in UNISA’s Department of African Languages.
Building the termbank
Both founding members of this project, Professor Mampaka Lydia Mojapelo and Professor Rose Masubelele, are linguists in Sesotho sa Leboa and isiZulu respectively and taught various linguistics courses in UNISA’s Department of African Languages. Their research into the availability of language resources was the driving force behind the project - setting up initial termbanks for these two languages. Linguists from the other languages were then invited to use the pilot lists to expand the resource to their languages.
“The first step in the project was to collect existing terms that our academic predecessors worked so hard to establish in the linguistics classroom– these were mainly contained in old resources that are now out of print,” says Mojapelo. “We referenced old study guides, dictionaries and other linguistics textbooks used in the department to identify key terms in Sesotho sa Leboa and isiZulu. It was important to also include the English equivalent as a binding or pivot language so that cross-referencing could be done.”
After compiling this initial list of terms in the pilot project, linguists from the other languages were invited to join the project and expand it to include seven additional languages.
“This was not just a translation exercise. Each language has its own grammar, structure and linguistic phenomena that does not necessarily occur in the others,” says Griesel. The team that formed was made up of linguists with strong teaching experience, their experience and knowledge of the specific languages they were working on was critical to the success of the project.”
“As we embarked on the project,” says Griesel, “we realised that for many of the languages different terms were used for the same linguistic concepts across institutions.”
The final step was a workshop where linguistics teachers and stakeholders who were not initially part of the project and from different institutions could hash out the terms in the various languages and work towards standardising them.
“The original workshop date had to be delayed because of COVID-19 and lockdowns,” says Griesel. “But eventually, in July 2022, we managed to make it happen.”
Team members invited at least two linguists from each one of the nine indigenous languages. The total number of delegates was 42 and for two days the group discussed the definitions, usage examples and what problems might be common across the different languages.
The final product: a freely available Multilingual Linguistics Termbank for South African languages
“While the product is now available there is always more work to be done, either in expanding the term list or improving the quality. Our next focus will be to add terms from the literary domain,” says Griesel. “Users are welcome to send comments and suggestions for improvement to the project team via the project website. Future projects and research outputs are also shared on this platform.”
The termbank is also available on the SADiLaR repository as an XML file for anyone to download and use for resource development.
“This is a concerted effort to create a large-scale shared open-educational resource for linguistics in the official South African languages. The terms have always existed in landmark textbooks, some of which are out of print, and this resource aims at preserving the knowledge captured. This is also at the core of the National Development Plan 2030, which states that ‘Quality education encourages technology shifts and innovation that are necessary to solve present-day challenges’.” says Mojapelo.
Griesel adds: “But more than that we hope it will be a first step to growing the study and development of our indigenous languages to support a truly multilingual society in the fourth industrial revolution.”
The acquisition of language is a critical part of a child’s cognitive and social development. Language allows a child to communicate and express themselves, to develop and maintain relationships and to learn. Delayed language acquisition has been linked with learning disabilities, anxiety in children, behavioural problems and other social difficulties. Early identification and intervention in delayed language acquisition is critical to help children fulfill their potential. However, language acquisition happens differently in different languages. In order for speech-language therapists to be able to accurately identify delays in language development they need to have access to the norms of child language acquisition in a specific language, and in Africa, this information is not available.
SADiLaR’s Child Language Development node, based at Stellenbosch University, is working to fill this enormous gap in our knowledge around the development of language in Southern Africa, beginning with an inter-university collaboration focusing on development of Communicative Development Inventories (CDIs) for all South African languages.
“Different languages have completely different structures which result in children acquiring those languages in different ways,” explains Mikateko Ndhambi, a lecturer in speech-language pathology at Sefako Makgatho Health Sciences University who is currently undertaking a PhD at the University of Cape Town in child language acquisition in Xitsonga, her mother tongue. “Even in South Africa’s Bantu languages, we see wide variation in what words and what grammatical elements children first acquire.”
CDIs are effectively a parent report tool that can be used to measure a child’s language development. The CDI can be used to gather a large sample of children’s language that provides information on the average child’s language acquisition in a specific language. CDIs measure language development from 8 to 30 months and are good overall indicators of communicative development.
“If a speech-language therapist does not have the correct information for typical development in the language of the child they are trying to assess, they are likely to over-diagnose or under-diagnose the child, and interventions will not be appropriate,” explains Associate Professor Heather Brookes, a linguist and director of the Child Language Development Node.
There are CDIs for over 100 languages worldwide, but none for any of the languages commonly spoken in southern Africa, including the better-resourced languages of English and Afrikaans.
“There is,” explains Brookes, “a different CDI for American English as opposed to British English. This means neither of these will be applicable to South African English.”
Addressing this gap, to develop the norms which will serve as the basis of the linguistically and culturally appropriate CDIs for our 11 official languages, is a herculean task which involves researchers from over ten institutions, with a dedicated principal investigator for each language. It is also a multidisciplinary effort in that the work requires the expertise of speech therapists, linguists and developmental psychologists.
“Each field has a very important contribution to this project,” says Ndhambi. “The developmental psychologists provide the bigger developmental picture, because language acquisition does not work in isolation. The linguists provide the understanding of the fundamentals of a language, particularly the structures that will influence a child’s language acquisition. And then of course the speech-language therapist brings their knowledge of child speech and language, what are the common disorders, issues or delays.”
Much work has already been done in South Africa towards the development of these CDIs. The group has successfully harmonised the South African languages to cover the same areas for measurement where appropriate. They have developed a family background questionnaire for the different languages to assess how the child’s environment might impact their language development. The next step then is to begin to conduct field research where approximately 2300 families in each language will be interviewed to ascertain typical development norms.
“While we still have a long way to go, we are pleased with how far we have come,” says Brookes. “It was a huge team effort driven largely by the speech-language therapists. Their commitment to this project and ensuring the practical, on the ground, impact of it, has been fantastic.”
Building the southern African network
As well as the South African CDIs, the group is also committed to expanding this work across the southern African region. This is partly why, in June 2022, the node organised a workshop and meeting at the Southern African Linguistics and Applied Linguistics Society (SALALS) conference which took place in Potchefstroom.
Speakers at the workshop addressed the various challenges and opportunities with regards to child language development in Africa in particular. This included a presentation on the development of CDIs in Senegal, the experiences of a speech therapist working in multilingual and multicultural environment so common in Africa, and the challenges children face when exposed to multiple languages. One key presentation by Anne Baker of Stellenbosch University and the University of Amsterdam, spoke about the consequences of unidentified and unmet language needs and the need for accessible tools for language assessment at schools.
“This workshop,” says Ndhambi, who was instrumental in organising the event, “was to bring together those conducting research in early child language development to see what is being done, what needs to be done, and where there is space for future collaboration.”
“The reality is that developing the necessary norms and CDIs for child language acquisition to be properly assessed in Africa is such an enormous task, researchers cannot do it on their own, as we have learned, it requires large teams of experts,” says Brookes.
“The goal of the workshop was to encourage researchers in this field to not only join our collaboration, which we would love, but also to connect researchers to form their own collaborations, so we can team up to begin to address the gaps in research to better understand and respond to early childhood language development, with significant positive impacts on the socio-economic status of the region.”
A collaboration between the CSIR speech node, a private speech technology company, SAIGEN, and SADiLaR has resulted in an automatic speech caption technology, to be used by Government Communications Information Systems, to ensure greater accessibility of government communications to South Africans.
Video has become ubiquitous in society, be it news, marketing, entertainment and even educational material, we currently live in a video first world. But video, as an audio-visual medium, is not accessible to everyone. For the hard-of hearing or deaf community and second language English speakers it can be extremely challenging to engage with important messages delivered by video. Speech captions are the obvious and elegant solution to this problem. This is the prefered option for communication from the hard-of-hearing community and there is also much evidence showing that second-language speakers retain much more information from a video if it is accompanied by sub-titles or speech captions.
“But speech captions do not simply fall out of a hat,” says Jaco Badenhorst a researcher of the Council for Scientific and Industrial Research (CSIR). “It is a very labour-intensive and time consuming task for a video editor.”
During the COVID-19 pandemic and accompanying lockdowns government communications, particularly President Cyril Ramaphosa’s ‘family meeting’ addresses to the South African nation were critical. It was a time of crisis and it was imperative that all South African households were able to watch and engage with what the President was saying. It was therefore no coincidence that it was during this period that the Government Communication and Information System (GCIS), the government department responsible for communication, reached out to the CSIR for an automated speech recognition system that could, with near perfect accuracy, generate automatic speech captions for government speeches.
“To be as effective as possible, this needed to be a South African made system built on South African accents and dialects,” explains Badenhorst. “While there are a number of commercial options abroad which do this, their software will not accurately pick up the many variations of South African accents and this will mean much more work down the line to edit and correct the automated transcript.”
The SADILaR connection
The CSIR is home to the SADiLaR Speech Node with an existing focus on speech technologies including automatic speech recognition and text-to-speech technology. Although they had existing capabilities for this kind of technology development, the project was daunting.
Knowing this would be an expensive, but a very worthwhile endeavour, the team at the CSIR wrote a proposal for funding to SADiLaR.
“SADiLaR is an enabling infrastructure,” says Juan Steyn, project manager at SADiLaR. “So our mandate is to support the development and distribution of technology which will build our South African language resources and contribute to the good of our multilingual society.”
Funding a system to generate automatic speech captions for government communications was therefore a natural fit for SADiLaR. Especially because it meant this technology would then be available locally for other uses.
The SAIGEN connection
Fortunately for all players involved, a chance encounter between the team at the CSIR and a private company, SAIGEN, in the very early days of the project, revealed that SAIGEN was working on a very similar product which could form the base of the GCIS tool.
SAIGEN was formed back in 2017 by two former academics, Dr Charl van Heerden and Professor Etienne Barnard, who specialised the building of automatic speech recognition technology for under-resourced languages. Van Heerden used to be part of the CSIR and worked with Badenhorst, and it was through this connection that two groups became aware of the overlap.
“SAIGEN offers a number of commercial speech recognition products, including a call centre speech analytics product and a speech recognition product for media monitoring companies,” says van Heerden. “When we first became aware of the GCIS request to CSIR we had already developed an automatic speech recognition prototype, which we planned to license for those who needed automated transcription services.”
As it became clear that many of the base components that would be needed for the GCIS tool formed part of the SAIGEN prototype the CSIR reached out to SADILAR to adjust the funding proposal and strategy.
“Recognising that this was funded with public money, and that reinventing the wheel would be wasteful expenditure, all parties happily agreed to this collaboration,” says van Heerden.
The first element of the project was the generating of automated speech captions for the speeches of the President, Deputy President and cabinet ministers. This product is to allow the GCIS to be able to release the videos of government speeches with accurate captions in minimal time.
“The mandate was to make a system that would generate captions to be as near accurate as possible,” says Badenhorst. “Recognising of course that no system can do this perfectly. The product includes an interface in which the captions appear alongside speech, with those words the system is less confident about highlighted. The GCIS staff can then quickly and easily correct the captions where necessary.”
The goal also is that the system be customised to individual speakers through a speaker recognition system. This means the software would be trained on the speech of individual speakers for even greater accuracy of captions.
Another deliverable, this one for SADiLaR, is to deliver a text and speech corpus (a language resource) that could be added to the SADiLaR repository and used in the development of other language technologies or for research purposes. The GCIS has an archive of videos and speeches it was agreed with them to access this data and compile such a corpus to be made available in the public domain.
The corpus will be delivered to SADiLaR by the end of September 2022.
The final product and the way forward
The project is now in its third and final year of completion. One pilot was held with the GCIS, in which four GCIS staff members, including a manager, media specialist, video editor and intern, tested the tool. The feedback, fortunately, was very positive, with all reporting that they could quickly and easily verify and correct the captions.
A second pilot is planned for later in the year in which the GCIS will test additional features they have requested, such as altering the appearance of the captions according to the medium on which the video is published.
“The second pilot will have the opportunity to finalise the offering to ensure it is fit for client needs,” says Badenhorst.
Once the tool is being used by the GCIS, the hope is that there will be commercial demand for the software.
“Beyond the accessibility needs, video captions are widely used. Our busy lives means often people will watch a video in a noisy environment, such as public transport or in a queue, or somewhere like an office where it is simply not appropriate to have sound on. In these environments captions are critical,” says van Heerden.
“And in our multilingual society where a minority are first language English speakers, captions go a long way towards ensuring comprehension and retention of the subject or speech delivered.”