27th International Conference on Text, Speech and Dialogue
TSD 2024, Brno, Czech Republic, September 9–13 2024
 
Topics | Committees | Important Dates | Contact info | Call for Papers
TSD 2024 Paper Abstracts

#1289: A Paradigm for Interpreting Metrics and Measuring Error Severity in Automatic Speech Recognition

Thibault Bañeras-Roux, Mickael Rouvier, Jane Wottawa and Richard Dufour

The evaluation of automatic speech transcriptions relies heavily on metrics such as Word Error Rate (WER) and Character Error Rate (CER). However, these metrics have faced criticism for their limited correlation with human perception and their inability to capture linguistic and semantic nuances accurately. Despite the introduction of metric-based embeddings to approximate human perception, their interpretability remains challenging compared to traditional metrics. In this article, we introduce a novel paradigm aimed at addressing these limitations. Our approach integrates a chosen metric to derive Minimum Edit Distance (minED), which serves as an indicator of the rate of serious errors in automatic speech transcriptions. Unlike conventional metrics, minED offers a more nuanced understanding of errors, accounting for both linguistic complexities and human perception. Furthermore, our paradigm facilitates the measurement of error severity from both intrinsic and extrinsic perspectives.


#1205: Adapting Audiovisual Speech Synthesis to Estonian

Sven Aller and Mark Fishel

Audiovisual speech synthesis is an important topic from different points of view: visualization helps to understand speech more easily in noisy environments, conversation is more natural, and clarity is much better for hearing-impaired as well as other users. At the same time, its availability is limited to a much narrower selection of languages than speech-only synthesis, and language-independent methods of adding the visual part are not thoroughly tested for most languages. This paper presents the development of two methods of adapting audiovisual speech synthesis to Estonian. We reuse an existing neural speech synthesis model and adapt a speech-driven and text-driven approach to adding the visual part. We contrast the two developed solutions with pure audio in conditions with different noise levels and evaluate the clarity, naturalness, and pleasantness of the test samples via MOS scores. We also present a comparison of how computationally expensive these methods are. Our results show that while speech-driven visual counterpart generation is deemed more natural, the text-driven approach is computationally less demanding and can be used for real-time audiovisual speech synthesis. Also, according to the results all the presented models help to improve the clarity of synthesized speech in noisy conditions.


#1225: Analyzing Biases in Popular Answer Selection Datasets on Neural-based QA Models

Chang Nian Chuy, Cherie Ding and Qinmin Vivian Hu

The amount of information available on the internet has increased exponentially over the past decade. This digitization leads to the need of automated answering system to extract useful information from different sources. Due to the high demand of automated answering systems, many large-scale QA datasets and QA models have been introduced to the field to satisfy this need. In this work, we aim to explore and shed light upon the composition of the most popular QA datasets by comparing them through statistical distribution analyses and their biases. We collect multiple open QA datasets which cover different aspects of QA features, and highlight the differences of each QA dataset and its bias by comparing its effect on multiple baseline neural QA models. Our goal is to provide a clear understanding on the relationship of QA datasets and QA models, and offer a solid foundation for future research to enhance this growing field.


#1273: Anonymizing Dysarthric Speech: Investigating the Effects of Voice Conversion on Pathological Information Preservation

Abner Hernandez, Paula Andrea Perez-Toro, Tomas Arias-Vergara, Juan Camilo Vasquez-Correa, Seung Hee Yang, Juan Rafael Orozco-Arroyave, and Andreas Maier

Acquiring speech data is a crucial step in the development of speech recognition systems and related speech-based machine learning models. However, protecting privacy is an increasing concern that must be addressed. This study investigates voice conversion (VC) as a strategy for anonymizing the speech of individuals with dysarthria. We specifically focus on training a variety of VC models using self-supervised speech representations, such as Wav2Vec and its multi-lingual variant, Wav2Vec2.0 (XLSR). The converted voices maintain a word error rate that is within 1% with respect to the original recordings. The Equal Error Rate (EER) showed a significant increase, from 1.52% to 41.18% on the LibriSpeech test set, and from 3.75% to 42.19% on speakers from the VCTK corpus, indicating a substantial decrease in speaker verification performance. A similar trend is observed with dysarthric speech, where the EER varied from 16.45% to 43.46%. Additionally, our study includes classification experiments on dysarthric vs. healthy speech data to demonstrate that anonymized voices can still yield speech features essential for distinguishing between healthy and pathological speech. The impact of voice conversion is investigated by covering aspects such as articulation, prosody, phonation, and phonology.


#1219: Attention to Phonetics: A Visually Informed Explanation of Speech Transformers

Erfan A. Shams and Julie Carson-Berndsen

Self-supervised learning based on the transformer architecture has improved the performance of Automatic Speech Recognition (ASR) systems hugely in recent years while the interpretability of such transformer-based models, by design, has received less attention. Considering this, we investigate post-hoc explainability methodologies to explore the types of phonetic information that are encoded within the black box of transformer-based ASR models. We propose an exploratory visual environment based on the encoded parameters in the self-attention (SA) component of the models as a first step in explaining transformer-based ASRs via interactive exploration of the SA heads. The visualisations reveal clues about the functionality of specific SA heads that in turn support the choice of a suitable domain-informed post-hoc explainability method for a deeper analysis. We apply this method to identify the impact of certain SA heads in encoding sub-phonetic information in the model embeddings and demonstrate that specialised SA heads can be potentially identified via combined visualisation and post-hoc analysis.


#1271: Automatic Classification of Parkinson's Disease Using Wav2vec Embeddings at Phoneme, Syllable, and Word Levels

Jeferson David Gallo-Aristizábal, Daniel Escobar-Grisales, Cristian David Ríos-Urrego, Elmar Nöth and Juan Rafael Orozco-Arroyave

Parkinson’s disease (PD) is a neurological condition that produces several speech deficits, typically known as hypokinetic dysarthria. PD involves motor impairments and muscle dysfunction in the phonatory apparatus, producing anomalies in oral communication. Speech signals have been used as a biomarker for diagnosis and monitoring of PD. In this work, we discriminate between PD patients and healthy controls based on patterns extracted from speech signals collected from Colombian spanish speakers, considering three different granularity levels: phoneme, syllable, and word. The Wav2vec 2.0 model is used to obtain frame-level representations of each utterance. These representations are grouped according to each granularity level using different statistical functionals. Each granularity level was evaluated independently, obtaining accuracies of 86%, 80%, and 83% for phonemes, syllables, and words, respectively. In addition, we identified the phonological classes with better discrimination capability. Nasals, approximant, and plosive classes were the three most accurate. We believe that this work constitutes a step forward in the development of automatic systems that support speech and language therapy of PD patients. For future work, we plan to model co-articulation information in words and syllables.


#1252: Automatic Ellipsis Reconstruction in Coordinated German Sentences Based on Text-To-Text Transfer Transformers

Marisa Schmidt, Karin Harbusch and Denis Memmesheimer

Ellipsis reconstruction, i.e., revealing omitted syntactically obligatory words in a sentence, is still a challenging task in Natural Language Processing (NLP) technologies, even though this information is essential for advanced human-computer dialogues. Corpora for the training of omitted word reconstruction are increasingly appearing in the literature. Here, we focus on ellipsis phenomena in coordinated sentences---also called Clausal Coordinate Ellipsis (CCE)---in German. We report results with a Unified Text-To-Text Transfer Transformer (T5) model (Transfer Learning). A pre-trained model of written German is fine-tuned with a parallel corpus of pairs containing a reduced sentence and its canonical form, in which all omitted elements are explicitly listed. We compare the results for two parallel CCE corpora here, both of which are extracted from existing treebanks of German newspaper articles. We achieve a BLEU score of.8196 for testing the parallel TüBa-D/Z CCE corpus, and.6093 for testing a pre-release of the new parallel TIGER CCE corpus, respectively. The results can be improved to.8349 and.7543, respectively, for the training with the two CCE corpora together.


#1256: Bella Turca: A Large-Scale Dataset of Diverse Text Sources for Turkish Language Modeling

Duygu Altinok

In recent studies, it has been demonstrated that incorporating diverse training datasets enhances the overall knowledge and generalization capabilities of large-scale language models, especially in cross-domain scenarios. In line with this, we introduce Bella Turca: a comprehensive Turkish text corpus, totaling 265GB, specifically curated for training language models. Bella Turca encompasses 25 distinct subsets of 4 genre, carefully chosen to ensure diversity and high quality. While Turkish is spoken widely across three continents, it suffers from a dearth of robust data resources for language modelling. Existing transformers and language models have primarily relied on repetitive corpora such as OSCAR and/or Wiki, which lack the desired diversity. Our work aims to break free from this monotony by introducing a fresh perspective to Turkish corpora resources. To the best of our knowledge, this release marks the first instance of such a vast and diverse dataset tailored for the Turkish language. Additionally, we contribute to the community by providing the code used in the dataset's construction and cleaning, fostering collaboration and knowledge sharing.


#1253: Better Low-Resource Machine Translation with Smaller Vocabularies

Edoardo Signoroni and Pavel Rychlý

Data scarcity is still a major challenge in machine translation. The performance of state-of-the-art deep learning architectures, such as the Transformers, for under-resourced languages is well below the one for high-resourced languages. This precludes access to information for millions of speakers across the globe. Previous research has shown that the Transformer is highly sensitive to hyperparameters in low-resource conditions. One such parameter is the size of the subword vocabulary of the model. In this paper, we show that using smaller vocabularies, as low as 1k tokens, instead of the default value of 32k, is preferable in a diverse array of low-resource conditions. We experiment with different sizes on English-Akkadian, Lower Sorbian-German, English-Manipuri, to obtain models that are faster to train, smaller, and better performing than the default setting. These models achieve improvements of up to 322% ChrF score, while being up to 66% smaller and up to 17% faster to train.


#1195: Bilingual Lexicon Induction From Comparable and Parallel Data: A Comparative Analysis

Michaela Denisová and Pavel Rychlý

Bilingual lexicon induction (BLI) from comparable data has become a common way of evaluating cross-lingual word embeddings (CWEs). These models have drawn much attention, mainly due to their availability for rare and low-resource language pairs. An alternative offers systems exploiting parallel data, such as popular neural machine translation systems (NMTSs), which are effective and yield state-of-the-art results. Despite the significant advancements in NMTSs, their effectiveness in the BLI task compared to the models using comparable data remains underexplored. In this paper, we provide a comparative study of the NMTS and CWE models evaluated on the BLI task and demonstrate the results across three diverse language pairs: distant (Estonian-English) and close (Estonian-Finnish) language pair and language pair with different scripts (Estonian-Russian). Our study reveals the differences, strengths, and limitations of both approaches. We show that while NMTSs achieve impressive results for languages with a great amount of training data available, CWEs emerge as a better option when faced less resources.


#1242: Capturing Task-Related Information for Text-Based Grasp Classification Using Fine-Tuned Embeddings

Niko Kleer, Leon Weyand, Michael Feld and Klaus Berberich

Manipulating objects with a robotic hand or gripper is a challenging task that can be supported by knowledge about the object, such as textual descriptions. Even with such knowledge, there remain numerous possibilities for applying an appropriate grasping gesture. This ambiguity can be reduced by providing information about the intended task, aiding robots in making the choice of a suitable grasp less arbitrary and more robust. This work investigates using word embeddings in the context of grasp classification for multi-fingered robots. Instead of predicting grasping gestures without specifying the intended task, our work combines a description of the properties of an object and task-related information. We demonstrate that a systematically generated dataset and fine-tuned context embeddings can compete with existing models that do not consider object manipulation. Our best model achieves a micro f1 score of 0.774 and macro f1 score of 0.731 while distinguishing between over 40 tasks.


#1210: CoastTerm: a Corpus for Multidisciplinary Term Extraction in Coastal Scientific Literature

Julien Delaunay, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Georgeta Bordea, Mathilde Ducos, Nicolas Sidere, Antoine Doucet, Senja Pollak, and Olivier De Viron

The growing impact of climate change on coastal areas, particularly active but fragile regions, necessitates collaboration among diverse stakeholders and disciplines to formulate effective environmental protection policies. We introduce a novel specialized corpus comprising 2,491 sentences from 410 scientific abstracts concerning coastal areas, for the Automatic Term Extraction (ATE) and Classification (ATC) tasks. Inspired by the ARDI framework, focused on the identification of Actors, Resources, Dynamics and Interactions, we automatically extract domain terms and their distinct roles in the functioning of coastal systems by leveraging monolingual and multilingual transformer models. The evaluation demonstrates consistent results, achieving an F1 score of approximately 80% for automated term extraction and F1 of 70% for extracting terms and their labels. These findings are promising and signify an initial step towards the development of a specialized Knowledge Base dedicated to coastal areas.


#1202: Continual Learning Under Language Shift

Evangelia Gogoulou, Timothée Lesort, Magnus Boman and Joakim Nivre

The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. We study the pros and cons of updating a language model when new data comes from new languages -- the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Danish, Icelandic and Norwegian to investigate how forward and backward transfer effects depend on pre-training order and characteristics of languages, for models with 126M, 356M and 1.3B parameters.%for three different model sizes,%and learning rate schedulers. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be%either positive or negative depending on the order and characteristics of new languages.%To explain these patterns we explore several language similarity metrics and find that syntactic similarity appears to have the best correlation with our results. We explore a number of potentially explanatory factors and find that a combination of language contamination and syntactic similarity%gives the best fit with best fits our results.


#1244: Data Alignment and Duration Modelling in VITS

Zdeněk Hanzlíček

The paper analyses data alignment and duration modelling in the modern end-to-end speech synthesis model VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech). The standard version of VITS utilizes the MAS (Monotonic Alignment Search) procedure to align input text/phones and corresponding speech during the training procedure; the alignment is also used to obtain phoneme durations for the stochastic duration predictor training. This study analyzes the resulting MAS alignment and compares it with a reference alignment obtained by an LSTM-based phonetic segmentation system. We also examine the performance of VITS when the reference phonetic segmentation replaces the default MAS alignment. The comparison shows that while the original VITS is still slightly preferred in terms of quality, it provides a less interpretative data alignment. The duration modelling is more transparent in the modified version, allowing better duration control and modifications. The analysis has been carried out on two Czech voices.


#1215: Deep Speaker Embeddings for Speaker Verification of Children

Mohammed Hamzah Abed and Dávid Sztahó

Currently, deep speaker embedding models are the most advanced feature extraction methods for speaker verification. However, their effectiveness in identifying children's voices has not been thoroughly researched. While various methods have been proposed in recent years, most of them concentrate on adult speakers, with fewer researchers focusing on children. This study examines three deep learning-based speaker embedding methods and their ability to differentiate between child speakers in speaker verification. The study evaluated the X-vector, ECAPA-TDNN, and RESNET-TDNN methods for forensic voice comparison using pre-trained models and fine-tuning them on children's speech samples. The likelihood-ratio framework was used for evaluations using the likelihood-ratio score calculation method based on children's voices. The Samromur Children dataset was used to evaluate the workflow. It comprises 131 hours of speech from 3175 speakers aged between 4 and 17 of both sexes. The results indicate that RESNET-TDNN has the lowest EER and Cllrmin values (10.8% and 0.368, respectively) without fine-tuning the embedding models. With fine-tuning, ECAPA-TDNN performs the best (EER and Cllrmin are 2.9% and 0.111, respectively). No difference was found between the sexes of the speakers. When the results were analysed based on the age range of the speakers (4--10, 11--15, and 16--17), varying levels of performance were observed. The younger speakers were less accurately identified using the original pre-trained models. However, after fine-tuning, this tendency changed slightly. The results indicate that the models could be used in real-life investigation cases and fine-tuning helps mitigating the performance degradation in young speakers.


#1206: Dysphonia Diagnosis Using Self-Supervised Speech Models in Mono- and Cross-Lingual Settings

Dosti Aziz and Dávid Sztahó

Voice disorders like dysphonia can significantly impact a person's quality of life, so proper diagnostic methods are crucial. Previous approaches have primarily used datasets of a single language without considering language independence. This study investigates the effectiveness of self-supervised (SS) speech representation models in both mono- and cross-lingual settings to determine their ability to perform language-independent dysphonia detection. Four recent SS models, namely Wav2vec2.0, WavLM, HuBert, and Data2vec, in their large and base variations, were examined for their ability to capture speech features related to dysphonia. The findings suggest that larger variants of SS models generally outperform smaller ones, with the HuBert and WavLM large models achieving an accuracy of 93.06% and 91.67% in mono-lingual experiments, respectively. Additionally, the study explored cross-lingual capabilities and found that, except for Wav2vec2.0, base variations of SS models exhibited higher accuracies. The highest accuracy achieved in the cross-lingual case was 88.33% by the Wav2vec2.0 model when algorithms were trained on Hungarian samples and tested in the Dutch language. These results highlight the potential of SS models for language-dependent and independent dysphonia detection.


#1234: Effects of Training Strategies and the Amount of Speech Data on the Quality of Speech Synthesis

Lukáš Vladař and Jindřich Matoušek

During the development of a speech synthesizer, we often face a lack of training data. This paper describes how the amount of data used to train a speech synthesizer affects the quality of the final synthetic speech. To answer this question, we trained multiple VITS synthesizers using different amounts of training data and we compared them using listening tests and the MCD objective measure. Furthermore, we compared three training strategies: training a speech synthesizer from scratch, fine-tuning a single-speaker model and fine-tuning a multi-speaker model.


#1291: Enhancing Speech Emotion Recognition Using Transfer Learning From Speaker Embeddings

Maroš Jakubec, Roman Jarina, Eva Lieskovská, Peter Kasák and Michal Spišiak

Understanding and identifying emotions from speech is a key challenge in automatic Speech Emotion Recognition (SER). Speech carries a variety of information about speaker’s emotional state or contextual emotions, but the lack of large and diverse emotional datasets makes it hard to apply advanced deep learning models for development of realiable and robust SER systems. Our study introduces a methodology that uses transfer learning and data augmentation to improve SER systems’ ability to classify emotional states accurately. Specifically, we focus on enhancing and assessing the performance of x-vector and r-vector speaker embedding models for SER task through pretraining the models on a large amount of speaker-labeled data following by fine-tuninig on downstream emotional dataset. Testing the proposed approach on IEMOCAP and CREMA-D datasets shows notable increment in SER accuracy and thus usefulness of such cross-task transfer learning. Using transfer learning and data augmentation, our approach notably improved SER performance, achieving above 74% and 80% of accuracy on the IEMOCAP, and CREMA-D datasets, respectively.


#1262: Evaluation Metrics in LLM Code Generation

Kai Hartung, Sambit Mallick, Sören Gröttrup and Munir Georges

The advanced capabilities of large language models can also be seen in their increasing use in the automatic generation of programming code. Although models are generally getting better and better, there are very few metrics that can be used to evaluate the quality of the generated code. In particular, this evaluation becomes challenging without dependence on good reference data in the form of tests or alternative solutions. In this paper, we explore both existing and new approaches to evaluate generated python code. These approaches can be classified into two categories: similarity-based and reference independent. The similarity-based approaches involve examining the code’s syntax-tree structure and embeddings and comparing them to reference code from the dataset. On the other hand, the reference independent approaches utilize static code analysis metrics used to assess human-written code. These metrics include maintainability and adherence to style guidelines. We examine these metrics on the example of several state-of-the-art code generation models to test their validity. Based on our results, the independent metrics seem to be the most promising approaches for future research.


#1217: Explainable Multimodal Fusion for Dementia Detection From Text and Speech

Duygu Altinok

Alzheimer's dementia (AD) has significant negative impacts on patients, their families, and society as a whole, both psychologically and economically. Recent research has explored combining speech and transcript modalities to leverage linguistic and acoustic features. However, many existing multimodal studies simply combine speech and text representations, use majority voting, or average predictions from separately trained text and speech models. To overcome these limitations, our article focuses on explainability and investigates the fusion of speech and text modalities using cross-attention. We convert audio to Log-Mel spectrograms and utilize text and image transformers (RoBERTa and ViT) for processing transcripts and spectrograms, respectively. By incorporating a cross-attention layer, we analyze the impact on accuracy. Our multimodal fusion model achieves 90.01% accuracy on the ADReSS Challenge dataset. Additionally, we explore the explainability of both modalities through transformer visualization techniques and an analysis of the vocabulary used by dementia and non-dementia classes.


#1196: Explaining Metaphors in the French Language by Solving Analogies using a Knowledge Graph

Jérémie Roux, Hani Guenoune, Mathieu Lafourcade and Richard Moot

An analogy is a relation which operates between two pairs of terms representing two distant domains. It operates by transferring meaning from a concept that is known to another that one would like to clarify or define. In this report, we address analogy both from the aspect of modeling and by automatically explaining it. We will then propose a system of resolution of analogical equations in their notation in symbol chains. The model, based on the common sense knowledge base JeuxDeMots (a semantic network), operates by generating a list of potential candidates from which it chooses the most suitable solution. We conclude by evaluating our model on a collection of equations, and reflecting upon future work.


#1261: Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder

David Porteš and Aleš Horák

Language models operating on discrete audio representations are increasingly becoming the go-to framework for many speech-processing tasks.% trained on the self reconstruction task% Recently, discrete embeddings of various acoustic features have been succesfully used for various speech processing tasks. Recently, discrete embeddings of the fundamental frequency (F0), have been shown to improve performance across a variety of tasks.% To contribute to this line of research, we focus on generating discrete representations of the Fundamental frequency (F0).% F0 is one of the most salient features of prosody. It plays a crucial role in many emotional However, the benefits of using F0 embeddings can only be as good as the embeddings themselves. Therefore, in this paper, we present an exhaustive study on using the Vector-Quantized Variational Autoencoder (VQ-VAE) to generate high-quality embeddings of the F0 curve.% In this paper, we use the Vector Quantized Variational Autoencoder (VQ-VAE) for the task of generating accurate embeddings of F0, which we extract using the YAAPT algorihm.aoeuao% In our preliminary experiments, we find that applying the VQ-VAE directly to the F0 signal leads to innacuracies in reconstruction, as well as to certain artifacts caused by the way unvoiced frames are handled.% When generating F0 embeddings, the fact that F0 consists of voiced and unvoiced regions has to be taken into account for the embedding quality to remain high. We experiment with various input transformations that focus on handling unvoiced regions of the F0, which are regions where F0 is not defined. For each transformation, we perform an exhaustive grid search over the embedding size and codebook size parameters, in order to achieve highest possible embedding quality. Our experiments are conducted on two different-sized datasets, LJSpeech and LibriTTS, and, in total, comprise over 140 different experiment settings. We reach results ranging from 0.53% to 4.29% F0 Frame Error (FFE), depending on the dataset and preprocessing strategy used, and we publish our best models on the HuggingFace website.% Some results...


#1218: Improved Alignment for Score Combination of RNN-T and CTC Decoder for Online Decoding

Chin Yuen Kwok, Jia Qi Yip and Eng Siong Chng

There has been growing interest in utilizing ensembles of CTC and RNN-T models to improve online ASR performance, as diverse model architectures can provide additional information and predictions for both training and decoding. To combine the CTC and RNN-T model predictions, previous works use shallow fusion to combine their frame-level scores during prefix beam search. However, the approach gives inferior results when the scores are not aligned. This is because the RNN-T and CTC models may each emit the same text token after a different number of frames. This misalignment means scores are wrongly combined at frames where one model outputs a text token but another model outputs a blank token. To address this, this paper proposes to align the RNN-T and CTC outputs using a sliding window algorithm to perform text-text matching and avoid wrongly combining the scores at text-blank outputs. On AISHELL-1 and the Singapore National Speech Corpus, our method consistently reduces the character and word error rate from 4.52% to 4.38%, and from 21.29% to 20.03% respectively.


#1216: Improving and Understanding Clarifying Question Generation in Conversational Search

Daniel Ortega, Steven Söhnel and Ngoc Thang Vu

Conversational information-seeking systems (CISs), such as chatbots and virtual personal assistants, encounter difficulty when processing ambiguous user requests (URs) and generate an accurate response, especially when multiple search results match the given request. As a result, machine-generated clarifying questions (CQs) can be used to refine the user's intent and provide a more precise answer to the initial request. In this paper, we introduce a CIS that can identify the need for clarification in URs and, when necessary, generating appropriate CQs based on the most relevant answers generated from web search results leveraging our novel modular approach, otherwise, the system direvtly provides the most likely answer to the user. Experimental results on our enhanced version of the ClariQ dataset show the effectiveness of generating relevant and varied CQs, as evaluated by automatic evaluation metrics for fluency and informativeness, BLEURT and Distinct-N. Additionally, our experimental results are comparable to or outperform previous approaches in terms of traditional NLG evaluation metrics, such as BLEU, ROUGE, and METEOR. Finally, we conducted a user study assessing the CQs on five key aspects: grammaticality, on-topic, specificity, new information, and narrow down, which revealed their adequacy. Moreover, our comprehensive analysis identified correlations among these quality aspects.


#1269: Introducing LCC's NavProc 1.0 Corpus: Annotated Procedural Texts in the Naval Domain

Michael Mohler, Sandra Lee, Mary Brunson and David Bracewell

In this work, we introduce the NavProc 1.0 Corpus -- a medium-scale, annotated corpus of procedural texts within the naval domain -- for use as a first step in modeling procedural structures derived from real-world data sources. In particular, we have rigorously produced annotations of frame semantics (i.e., PropBank-inspired trigger/role links) across verbal, nominal, and adjectival frames. Furthermore, we have annotated 21 distinct types of semantic markers and structural links between textual elements (e.g., frame triggers, entities, modifiers) which, taken together, result in a text-focused graph of semantic elements. Such a graph can be used to derive a more complex procedure structure for use in personnel training, simulation, or collaborative procedure execution. Altogether, this annotation effort has encompassed 158 procedural units composed of 2,316 sentences, 44,459 tokens, and 48,137 distinct span annotations. Furthermore, we describe and report LLM-based extraction scores for use as a baseline in future research using this dataset.


#1203: Investigating Low-Cost LLM Annotation for Spoken Dialogue Understanding Datasets

Lucas Druart, Valentin Vielzeuf and Yannick Estève

In spoken Task-Oriented Dialogue (TOD) systems, the choice of the semantic representation describing the users' requests is key to a smooth interaction. Indeed, the system uses this representation to reason over a database and its domain knowledge to choose its next action. The dialogue course thus depends on the information provided by this semantic representation. While textual datasets provide fine-grained semantic representations, spoken dialogue datasets fall behind. This paper provides insights into automatic enhancement of spoken dialogue datasets' semantic representations. Our contributions are three fold: (1) assess the relevance of Large Language Model fine-tuning, (2) evaluate the knowledge captured by the produced annotations and (3) highlight semi-automatic annotation implications.


#1192: Is Prompting What Term Extraction Needs?

Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Julien Delaunay, Antoine Doucet and Senja Pollak

Automatic term extraction (ATE) is a natural language processing (NLP) task that reduces the effort of manually identifying terms from domain-specific corpora by providing a list of candidate terms. This paper summarizes our research on the applicability of open and closed-sourced large language models (LLMs) on the ATE task compared to two benchmarks where we consider ATE as sequence-labeling ( iobATE ) and seq2seq ranking ( templATE ) tasks, respectively. We propose three forms of prompting designs, including (1) sequence-labeling response; (2) text-extractive response; and (3) filling the gap of both types by text-generative response. We conduct experiments on the ACTER corpora in three languages and four domains with two different gold standards: one includes only terms (ANN) and the other covers both terms and entities (NES). Our empirical inquiry unveils that above all the prompting formats, text-extractive responses, and text-generative responses exhibit a greater ability in the few-shot setups when the amount of training data is scarce, and surpasses the performance of the templATE classifier in all scenarios. The performance of LLMs is close to fully supervised sequence-labeling ones, and it offers a valuable trade-off by eliminating the need for extensive data annotation efforts to a certain degree. This demonstrates LLMs' potential use within pragmatic, real-world applications characterized by the constricted availability of labeled examples.% Notwithstanding, these capabilities depend to a great % extent on the prompting style, language, and type of % responses.


#1238: Joint-Average Mean and Variance Feature Matching (JAMVFM) Semi-supervised GAN with Additional-Objective Training Function for Intent Detection

Ankit Kumar and Munir Georges

Intent detection, a crucial task in spoken language understanding (SLU) systems, often faces challenges due to the requirement for extensive labeled training data. However, the process of collecting such data is both resource-intensive and time-consuming. To mitigate these challenges, leveraging Semi-Supervised Generative Adversarial Networks (SS-GANs) presents a promising strategy. By employing SS-GANs, it becomes possible to fine-tune pre-trained transformer models like BERT using unlabeled data, thereby improving intent detection performance without the need for extensive labeled datasets. This article introduces a novel approach called Joint-Average Mean and Variance Feature Matching GAN (JAMVFM-GAN) with the additional objective to improve SS-GAN learning. By incorporating information about both mean and variance during latent feature learning, JAMVFM-GAN aims to more accurately capture the underlying data manifold. Except JAMVFM, we proposed an additional loss function for discriminator to enhance its discriminative capabilities. Experimental results demonstrate that JAMVFM-GAN along with additional objective function outperforms traditional SS-GAN in Intent Detection tasks. The results indicate the maximum relative improvement of 3.84%, 3.85%, and 1.04% over the baseline on the ATIS, SLURP, and SNIPS datasets, respectively.


#1264: Kernel Least Squares Transformations for Cross-lingual Semantic Spaces

Adam Mištera and Tomáš Brychcín

The rapid development in the field of natural language processing (NLP) and the increasing complexity of linguistic tasks demand the use of efficient and effective methods. Cross-lingual linear transformations between semantic spaces play a crucial role in this domain. However, compared to more advanced models such as transformers, linear transformations often fall short, especially in terms of accuracy. It is thus necessary to employ innovative approaches that not only enhance performance but also maintain low computational complexity. In this study, we propose Kernel Least Squares (KLS) for linear transformation between semantic spaces. In our comprehensive analysis involving three intrinsic and two extrinsic experiments across six languages from three different language families and a comparative evaluation with nine different linear transformation methods, we demonstrate the superior performance of KLS. Our results show that the proposed method significantly improves word translation accuracy, thereby standing out as the most efficient method for transforming only the source semantic space.


#1233: Leveraging Conceptual Similarities to Enhance Modeling of Factors Affecting Adolescents' Well-Being

Ondřej Sotolář, Jaromír Plhák and David Šmahel

While large language models consistently outperform their smaller transformer-based counterparts, there are constraints on their deployment. Model size becomes a critical limiting factor in cases involving sensitive data, particularly when the imperative is to execute inference on edge devices such as smartphones. We explore the possibility of detecting common positive and negative influence factors that impact adolescents' well-being in instant messenger communication using a newly annotated dataset. We show that by leveraging the similarities between the concepts, we can produce classifiers with a small ELECTRA-based model with 14M parameters that can run on resource-limited edge devices. Our findings can be used to advance intervention and parental control software, creating a safer digital environment for children and adolescents.


#1293: Mistrík's Readability Metric -- an Online Library

Mária Pappová and Matúš Valko

The term "readability" describes how simple it is for a reader to understand a written text. This can be measured with a variety of readability metrics. While some tools exist for assessing the readability of Slovak texts, no free or open-source tools currently offer this functionality. This article presents an online Python library that uses Mistrík's readability metric for the Slovak language. We developed an open-source library for measuring the readability score of Slovak texts and evaluated the findings from Mistrik's initial investigation approach.


#1282: Models and Strategies for Russian Word Sense Disambiguation: A Comparative Analysis

Anastasiia Aleksandrova and Joakim Nivre

Word sense disambiguation (WSD) is a core task in computational linguistics that involves interpreting polysemous words in context by identifying senses from a predefined sense inventory. Despite the dominance of BERT and its derivatives in WSD evaluation benchmarks, their effectiveness in encoding and retrieving word senses, especially in languages other than English, remains relatively unexplored. This paper provides a detailed quantitative analysis, comparing various BERT-based models for Russian, and examines two primary WSD strategies: fine-tuning and feature-based nearest-neighbor classification. The best results are obtained with the ruBERT model coupled with the feature-based nearest neighbor strategy. This approach adeptly captures even fine-grained meanings with limited data and diverse sense distributions.


#1247: Multiword Expressions Resources for Italian: Presenting a Manually Annotated Spoken Corpus

Ilaria Manfredi

Multiword expressions (MWEs) are word combinations that behave like a unit by showing some type of semantic, syntactic or functional idiosyncratic properties. MWEs are a pervasive phenomenon of language, and the importance of their computational treatment in many Natural Language Processing tasks has long been recognized. However, the ambiguous nature and unpredictable behavior of MWEs make the creation of resources such as lexicons, dictionaries, and annotated corpora difficult and time-consuming. Resources available often focus only on specific kinds of MWE, leaving out the others. This paper presents a corpus of spoken Italian annotated with MWEs by five annotators who are native speakers of Italian and experts in linguistics. The annotation was done in context and comprehensively, without focusing on specific types of MWEs. The corpus contains approximately 50,000 words and 1050 MWE forms, corresponding to 269 MWE types. The corpus is a new and unique resource for spoken Italian and MWEs, which can be used as a gold standard for MWE identification systems and other linguistic tasks. It can also prompt more research on MWEs in spoken varieties of languages.


#1246: Named Entity Linking in English-Czech Parallel Corpus

Zuzana Nevěřilová and Hana Žižková

We present a procedure to build relatively quickly new resources with annotated named entities and their linking to Wikidata. First, we applied state-of-the-art models for named entity recognition on a sentence-aligned parallel English-Czech corpus. We selected the most common entity classes: person, location, organization, and miscellaneous. Second, we manually checked the corpus in a suitably set annotation application. Third, we used a state-of-the-art tool for named entity linking and enhanced the ranking using sentence embeddings obtained by sentence transformers. We then checked manually whether the linking to knowledge bases was correct. As a result, we added two annotation layers to an existing parallel corpus: one with the named entities and one with links to Wikidata. The corpus contains 14,881 parallel Czech-English sentences and 3,769 links to Wikidata. The corpus can be used for training more robust named entity recognition and named entity linking models and for linguistic research of parallel news texts.


#1209: Neural Spell-Checker: Beyond Words with Synthetic Data Generation

Matej Klemen, Martin Božič, Špela Arhar Holdt, and Marko Robnik-Šikonja

Spell-checkers are valuable tools that enhance communication by identifying misspelled words in written texts. Recent improvements in deep learning, and in particular in large language models, have opened new opportunities to improve traditional spell-checkers with new functionalities that not only assess spelling correctness but also the suitability of a word for a given context. In our work, we present and compare two new spell-checkers and evaluate them on synthetic, learner, and more general-domain Slovene datasets. The first spell-checker is a traditional, fast, word-based approach, based on a morphological lexicon with a significantly larger word list compared to existing spell-checkers. The second approach uses a language model trained on a large corpus with synthetically inserted errors. We present the training data construction strategies, which turn out to be a crucial component of neural spell-checkers. Further, the proposed neural model significantly outperforms all existing spell-checkers for Slovene in both precision and recall.


#1211: New Human-Annotated Dataset of Czech Health Records for Training Medical Concept Recognition Models

Anetta Krištof and Aleš Horák

Following the widespread successes of leveraging recent large language models (LLMs) in various NLP tasks, this paper focuses on medical text content understanding. Adapting a foundational LLM to the medical domain requires a special kind of datasets where core medical concepts are accurately annotated. This paper addresses the need of better medical concept recognition in free-text electronic health records in low-resourced Slavic languages and introduces CSEHR, a new human-annotated dataset of Czech oncology health records. It describes the dataset inception, management, considerations, processing, and finally presents baseline concept recognition model results. XLM-RoBERTa models trained on the dataset using 5-fold cross-validation achieved an average weighted F1 score of 0.672 in exact and 0.777 in partial medical concept recognition ranging from 0.335 to 0.857 per different concept classes. This paper then describes future plans of bootstrapping larger annotated corpora from the CSEHR dataset and of making the dataset publicly available. This endeavor is unique in the realm of Slavic languages and already at this stage it represents a major step in the field of Slavic medical concept recognition.


#1287: Open-Source Web Service with Morphological Dictionary--Supplemented Deep Learning for Morphosyntactic Analysis of Czech

Milan Straka and Jana Straková

We present an open-source web service for Czech morphosyntactic analysis. The system combines a deep learning model with rescoring by a high-precision morphological dictionary at inference time. We show that our hybrid method surpasses two competitive baselines: While the deep learning model ensures generalization for out-of-vocabulary words and better disambiguation, an improvement over an existing morphological analyser MorphoDiTa, at the same time, the deep learning model benefits from inference-time guidance of a manually curated morphological dictionary. We achieve 50% error reduction in lemmatization and 58% error reduction in POS tagging over MorphoDiTa, while also offering dependency parsing. The model is trained on one of the currently largest Czech morphosyntactic corpora, the PDT-C 1.0, with the trained models available at https://hdl.handle.net/11234/1-5293. We provide the tool as a web service deployed at https://lindat.mff.cuni.cz/services/udpipe/. The source code is available at GitHub (https://github.com/ufal/udpipe/tree/udpipe-2), along with a Python client for a simple use. The documentation for the models can be found at https://ufal.mff.cuni.cz/udpipe/2/models#czech_pdtc1.0_model.


#1212: PiCo-VITS: Leveraging Pitch Contours for Fine-grained Emotional Speech Synthesis

Kwan-yeung Wong and Fu-lai Chung

Text-to-speech (TTS) research has made significant progress in achieving human-like interpretability. Yet, a noticeable research gap persists regarding emotional expressiveness in synthesized speech. While some existing studies address this through techniques such as voice conversion and textual context modeling, precise control over emotion composition in speech synthesis remains challenging. Moreover, those techniques typically operate at sentence-level, precluding fine-grained control over emotion transitions. Emotion composition in utterances is multifaceted, dictated by a multitude of linguistic and acoustic features; and among these features, pitch plays a crucial role, with distinct pitch contours often signaling specific emotions at different points in a spoken sentence. Aiming for more controllable emotional speech synthesis, we propose PiCo-VITS, an end-to-end TTS model architecture that leverages pitch contours in conjunction with latent features. Experimental results demonstrate the efficacy of the proposed model in synthesizing speech that conveys mixed emotions. Notably, the model allows both the desired emotions and the emotion transition patterns to be specified, while maintaining intelligibility comparable to state-of-the-art techniques.


#1200: Retrieval Augmented Spoken Language Generation for Transport Domain

Gokul Srinivasagan and Munir Georges

RAG-based models have gained significant attention in recent times mainly due to their ability to address some of the key challenges like mitigation hallucination, incorporation of knowledge from external sources and traceability in the reasoning process. While numerous works in the textual domain leverage additional knowledge to enhance performance, the adaptability of RAG-based models in the speech domain remains largely unexplored. This approach is particularly well-suited for transport applications, where there is a constant change in the schedule and the model needs to be aware of these changes to provide updated information to users. The datasets for such tasks are lacking, and the applicability of language models in the transport domain remains underexplored. In this work, we try to address these problems by exploiting pretrained large language models to generate a synthetic dataset for transport applications. We also utilize the pretrained language models to evaluate the performance of our cascaded RAG system. The experimental results revealed that our approach is less prone to hallucination and can generate grammatically correct responses to user queries.


#1222: Robust Classification of Parkinson’s Speech: an Approximation to a Scenario With Non-controlled Acoustic Conditions

Diego Alexander Lopez-Santander, Cristian David Rios-Urrego, Christian Bergler, Elmar Nöth and Juan Rafael Orozco-Arroyave

Several studies have shown Parkinson's disease (PD) can be detected from speech signals. However, most of them focus on clean speech recorded under controlled noise environments and standardized equipment, which may limit their ease of access and application in realistic scenarios. In this study, we analyze the performance of PD detection models when a modified version of the ORCA-CLEAN denoiser is applied. The denoiser was re-trained on human speech to clean noisy pathological speech signals before the classification stage. The residual signals were explored to determine whether the denoising process effectively removes unwanted noise while preserving essential speech features related to the disease and therefore relevant to PD detection. The experiments were conducted using recordings of the PC-GITA database along with replicas of it created by adding different levels of artificial noise. The results demonstrate remarkable robustness in classification accuracy despite high levels of added noise. These findings suggest that integrating denoising techniques into the PD classification pipeline can lead to reliable and accurate results even under non-ideal environments. These results will potentially lead to more accessible technology with application in real-world scenarios.


#1208: Sentences vs Phrases in Neural Speech Synthesis

Daniel Tihelka, Jindřich Matoušek, Zdeněk Hanzlíček and Lukáš Vladař

The neural network-based TTS models are usually trained and inferred on the whole sentences, or, in general, on longer chunks of speech. However, these may negatively affect the responsiveness of the TTS system in cases when latency should be kept as small as possible. We present experiments using smaller chunk lengths, namely phrases, and their impact on speech quality when various chunk length combinations are used for training and inference in the VITS synthesizer.


#1191: SeqCondenser: Inductive Representation Learning of Sequences by Sampling Characteristic Functions

Maixent Chenebaux and Tristan Cazenave

In this work, we introduce SeqCondenser, a neural network layer that compresses a variable-length input sequence into a fixed-size vector representation. The SeqCondenser layer samples the empirical characteristic function and its derivatives for each input dimension, and uses an attention mechanism to determine the associated probability distribution. We argue that the features extracted through this process effectively represent the entire sequence and that the SeqCondenser layer is particularly well-suited for inductive sequence classification tasks, such as text and time series classification. Our experiments show that SCoMo, a SeqCondenser-based architecture, outperforms the state-of-the-art inductive methods on nearly all examined text classification datasets and also outperforms the current best transductive method on one dataset.


#1268: StepDP: A Step Towards Expressive and Pervasive Dialogue Platforms

Julian Wolter, Niko Kleer and Michael Feld

The advancement of human-machine interactions and the expectation to interact with devices in a natural way necessitates the development of multi-modal dialogue systems that can process and respond to various forms of input modalities. Despite effective recognition methods of the different modalities, their integration into reusable frameworks often falls short, impeding rapid development and the long-term maintenance of systems that incorporate multiple modalities. This paper introduces stepDP, a novel and open-source dialogue platform designed to overcome these challenges by facilitating the quick and efficient implementation of multi-modal dialogue systems following a model-driven design paradigm. Well-researched concepts and algorithms were integrated into a framework and fine-tuned to work together seamlessly. One core concept is that the dialogue logic is abstracted from actual input modalities, allowing for the generalisation of dialogue behavior and seamless integration across domains. Emphasizing modularity and flexibility, stepDP allows for the easy integration of new features, for example, the combination of traditional NLU techniques with innovative LLMs, without requiring extensive system modifications. Our platform not only accelerates the development process but also promotes the exploration of new concepts and techniques in human-machine interaction.


#1241: Stream-Based Active Learning for Speech Emotion Recognition via Hybrid Data Selection and Continuous Learning

Santiago A. Moreno-Aceved, Juan Camilo Vasquez-Correa, Juan M. Martín-Doñas and Aitor Álvarez

This work proposes a novel stream-based Active Learning (AL) approach applied to Speech Emotion Recognition (SER) in real-life scenarios where new data are generated from different domains. The goal is to address major challenges in this field, including the lack of large-labeled data, the difficulty in the annotation, and the retrieval of representative emotional data. AL aims to address these problems by selecting/querying a small and valuable subset to be annotated with optimized labeling efforts and minimum resources. To this end, we consider a stream-based AL methodology leveraging MLOps principles and human-in-the-loop methods to continuously adapt previously trained deep learning models, ensuring both challenging and diverse audio samples and reducing the performance gap related to data diversity, cross-domain contexts, and continuous data ingestion. The considered pipeline was tested across several domains within three distinct scenarios, including both no stream- and stream-based approaches, as well as a pocket stream alternative to only update the previously trained models when significant improvements are obtained. The experimental outputs show that our proposed method achieved competitive results following an AL pocket stream-based strategy with just 20% of the original training data. This ensures good performance with a low allocated budget and continuous adaptation for practical, real-world environments.


#1251: TamSiPara: A Tamil -- Sinhala Parallel Corpus

Randil Pushpananda, Chamila Liyanage, Ashmari Pramodya and Ruvan Weerasinghe

This paper presents the development of a Sinhala-Tamil bilingual parallel corpus with sentence-level alignment. The corpus comprises source language text from contemporary writings, with all sentences translated manually. Active learning methods were employed to select sentences, ensuring the representation of effective language structures in both languages. The corpus is divided into two parts: one with translations from Sinhala to Tamil direction, consisting of 25k parallel sentences, while the other consists of translations from Tamil to Sinhala direction, comprising 22k parallel sentences. Manual translations were conducted by two teams of professional translators. The resulting final version of TamSiPara, the Tamil-Sinhala bilingual parallel corpus consists of a total of 47k parallel sentences.


#1197: The Aranea Corpora Family: Ten+ Years of Processing Web-Crawled Data

Vladimír Benko

Aranea is a project to create a family of web-crawled corpora for languages taught at Slovak Universities. Since 2013, more than two dozen languages have been added to the project: they are often represented with Gigaword+ size corpora and sometimes have subcorpora for their territorial varieties. Our paper summarizes the development of the Aranea project in the past decade. We describe the step-by-step optimization of the processing pipeline, highlight existing issues, and discuss the linguistic rationale behind some engineering decisions associated with the idiosyncrasies of individual languages.


#1267: Unsupervised Extraction of Morphological Categories for Morphemes

Abishek Stephen, Vojtěch John and Zdeněk Žabokrtský

Words in natural language can be assigned to specific morphological categories. For example, the English word `apples' can be described using morphological labels like N;PL. The conditional probabilities on such word forms given the labels would reveal for English that the morpheme `s' is present almost always when the label N;PL appears. This indicates that the morphological properties of a word can be traced to its morphemes. We do not have any data resource that associates morphemes with morphological categories. We use UniMorph schema and datasets for universal morphological annotation as a source of morphological categories and morpheme segmentation. We align morphemes (or exponents) with the corresponding morphological categories based on the UniMorph schema for 12 languages. Given the multilingual nature of the task, we utilize unsupervised methods based on the (Delta P) measure and IBM Models as we test out the effectiveness of alignment methods used in statistical machine translation. Our results indicate that IBM Models accurately capture the alignment asymmetries between morphemes and morphological categories under non-trivial alignment settings.


#1231: Using Neural Coherence Models to Assess Discourse Coherence

Lilia Azrou, Houda Oufaida, Philippe Blache and Israa Hamdine

Discourse coherence is an important characteristic of well-written texts and coherent speech. It is observed at several levels of discourse analysis: lexical, syntactic, semantic, and pragmatic. Recent work on discourse coherence uses deep neural network architectures to model coherence. However, most of these architectures are not linguistically explainable. In this paper, we propose a fine-tuned Large Language Model (LLM) and three interpretable approaches for modeling discourse coherence, that target different levels of discourse analysis and coherence information, capturing contextual information, semantic relatedness between adjacent sentences and paragraphs, and syntactic patterns of coherent texts. We want to determine whether these explainable approaches lead to competitive results compared to the proposed fine-tuned LLM. These architectures are evaluated on the multi-domain Grammarly Corpus of Discourse Coherence (GCDC) and compared to state-of-the-art (SOTA) and recent models. The results of our experiments show that the syntactic patterns combined with the semantic relatedness are a good indicator of the overall coherence and highlight the importance of the number of training examples for the model's ability to use the information provided by the syntactic patterns to make accurate predictions. Furthermore, the contextual information captured by the transformer-based model achieves good results that significantly outperform all other models, showing that the use of a fine-tuned LLM is now typically the best performing approach, despite being less interpretable than other methods.


#1285: X-vector-based Speaker Diarization Using Bi-LSTM and Interim Voting-driven Post-processing

J. B. Mala, Alex Raj S. M. and Rajeev Rajan

In this work, we propose a voting-driven post-processing strategy for enhancing the efficacy of supervised speaker diarization models. Speaker embeddings, x-vectors, are used to train deep learning architectures such as convolutional neural networks (CNN), Bi-directional gated recurrent units (Bi-GRU) and Bi-directional long short-term memory (Bi-LSTM). The state-of-the-art unsupervised diarization is implemented using agglomerative hierarchical clustering (AHC) with cosine affinity measure and obtained a DER of 26.07%. Among the supervised frameworks, Bi-LSTM achieves the lowest diarization error rate (DER) of 18.42% on the CallHome dataset. To further enhance the performance of the supervised diarization models, we introduce an interim voting-driven post-processing strategy using dynamic time warping (DTW) and euclidean distance (ED) on the predicted speaker labels. This interim voting and centroid-distance metric framework assigns mispredicted speakers to the most probable speakers' feature space leading to a notable reduction in DER. The experiments demonstrate that integrating the proposed approach with Bi-LSTM significantly reduces the DER to 10.26%, marking a relative improvement of 8.16% over the non-voting Bi-LSTM framework.


#1213: Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model

Jan Lehečka, Zdeněk Hanzlíček, Jindřich Matoušek and Daniel Tihelka

In this paper, we experimented with the SpeechT5 model pre-trained on large-scale datasets. We pre-trained the foundation model from scratch and fine-tuned it on a large-scale robust multi-speaker text-to-speech (TTS) task. We tested the model capabilities in a zero- and few-shot scenario. Based on two listening tests, we evaluated the synthetic audio quality and the similarity of how synthetic voices resemble real voices. Our results showed that the SpeechT5 model can generate a synthetic voice for any speaker using only one minute of the target speaker's data. We successfully demonstrated the high quality and similarity of our synthetic voices on publicly known Czech politicians and celebrities.


#1270: A Pipeline for Automatic Construction and Applications of College Curriculum Knowledge Graph

Di Wu, Munir Georges

With the rapidly increasing amount of specialized courses and fragmental learning materials, knowledge graph(KG) technologies have become a powerful visualization tool in offering learners and tutors with educational guidance. In this paper, we aim to automatically construct knowledge concept graphs basing on the analysis of educational ontology, and practiced their applications in personalized learning path recommendation. We refined the extracted concept entities and relations with a validation module and enhanced the knowledge correctness with LLMs. Focusing on STEM majors, Our work has been proved effective in supporting students to systematically organize multilingual knowledge and better achieve different learning goals.


#1280: Greet the speaking book: Pupil Identification on Personal Primer Prototypes 1, 2, 4, 8

DigiEduBerlin Team

In 2022, 0th prototype of a post-smartphone educational instrument known as "Personal Primer" has been presented to TSD audience. In our next TSD presentation, four further prototypes of this book-like artefact will be presented. Special focus of the demonstration will be put on voice-identification aspects of the Primer which is based on: 1) greeting the speaking book with greeting 2) calculation of ECAPA-TDNN embeddings 3) retrieving the closest matches from the vector database. During a live demonstration, it will be shown that one single "Ahoj" is enough to correctly identify at least 2 out of 3 TSD attendees.


#1300: Opravidlo: From Beta version to Opravidlo 2.0

Hana Žižková

Opravidlo is an online proofreading tool for Czech text accessible at www.opravidlo.cz. Beta version Opravidlo was released in June 2022, and the tool is available as a freely accessible web interface, where you can correct the texts inserted or written directly on it. The individual suggestions for correction are based on formal rules that identify mistakes in spelling (punctuation, capitalization, common spelling mistakes), grammar (sentence commas, grammatical consistency, ungrammatical sentence structures) and typography (according to standard CSN 01 6910). However, the rule-based system has its limits, and the development of neural networks and machine learning has challenged us to combine both approaches and create an online proofreader for Czech texts, Opravidlo 2.0.


#1301: A Leaderboard for BenCzechMark: A Czech-centric Multitask and Multimetric Benchmark for Language Models with Duel Scoring Mechanism

Martin Fajcik

We present the leaderboard for BenCzechMark (BCM), the first multitask and multimetric Czech language benchmark for large language models with unique scoring system that utilizes theory of statistical significance. Our benchmark covers 54 challenging mostly native Czech tasks spanning across 11 categories, including diverse domains such as historical Czech, pupil and language learner essays or spoken word. We release and maintain a leaderboard where new model submissions can be made at https://huggingface.co/spaces/CZLC/BenCzechMark.


#1302: OneClick Terms: Bilingual Terminology Extraction

Michal Cukr

This demo presents OneClick Terms, an innovative tool for automatic terminology extraction, based on the technologies which make Sketch Engine tick. Tailored for linguists, translators, and researchers, the tool offers seamless generation of termbases from multilingual data, either from uploaded documents or URLs. Attendees will gain insights into how this tool optimizes terminology work, facilitates translation, and supports language research through automated processes and intuitive data visualization.


#1303: Language Services

Zuzana Nevěřilová

Language Services is an aggregator API offering access to various natural language processing services. The demo will present the API frontend, accessible services, administration, and usage statistics over the past five years.


#1304: Unsupervised Sense Classification For Word Sketches

Ondřej Herman

We present a tool, which enriches word sketch data with word sense annotation. Word sketch is a technology which describes collocational behavior of words in large corpora based on morphosyntactic rules. Using unsupervised sense classification, this tool disambiguates word senses in word sketches, allowing linguists and lexicographers to explore more accurate word associations. By clustering similar meanings and providing contextually relevant collocations, it significantly improves the understanding of polysemous words in corpora.
















































.
TSD 2023 | TSD 2022 | TSD 2021