Corpus and system development for automatic captioning of official speeches

Project Type: SADiLaR Node – CSIR Speech Node
Project Start Date: 1 April 2020 
Project Status: In progress

Project Aims: 

The primary aim of the proposed project is to create a corpus of automatically transcribed government speeches. The CSIR proposes to start with the current president (Mr Cyril Ramaphosa) and then expand the corpus with speeches made by previous presidents and/or other members of parliament. A secondary aim is to initiate the development of an automatic speech recognition system that could serve as a first step towards addressing the need for automatic captioning expressed by GCIS.

Project Deliverables:

  • Resources transferred from GCIS archive will be utilised to produce the following: 
    • Evaluation data set (5 hours in total) 
    • Report on ASR performance evaluation
  • Corpus and related documentation transferred to SADiLaR (Depending on the availability of speeches from the GCIS archive the project will provide) approximately 10 hours of speech per year spanning a 7-year period which should yield approximately 100 hours of speech in total. This corpus will be released under a non-commercial, non- exclusive, research license, as GCIS is the proprietary owner thereof
  • Research outputs describing the released corpus, the acoustic analysis and findings
  • The baseline captioning system.