TSD 2020

Traditionally, there has been a disconnect between custom-built applications used to solve real-world information extraction problems in industry, and automated learning-based approaches developed in academia. Despite approaches such as transfer-based learning, adapting these to more customised solutions where the task and data may be different, and where training data may be largely unavailable, is still hugely problematic, with the result that many systems still need to be custom-built using expert hand-crafted knowledge, and do not scale. In the legal domain, a traditional slow adopter of technology, black box machine learning-based systems are too untrustworthy to be widely used. In industrial settings, the fine-grained highly specialised knowledge of human experts is still critical, and it is not obvious how to integrate this into automated classification systems. In this paper, we examine two case studies from recent work combining this expert human knowledge with automated NLP technologies.

#102: Multilingual Dependency Parsing from Universal Dependencies to Sesame Street

Joakim Nivre

Research on dependency parsing has always had a strong multilingual orientation, but the lack of standardized annotations for a long time made it difficult both to meaningfully compare results across languages and to develop truly multilingual systems. The Universal Dependencies project has during the last five years tried to overcome this obstacle by developing cross-linguistically consistent morphosyntactic annotation for many languages. During the same period, dependency parsing (like the rest of NLP) has been transformed by the adoption of continuous vector representations and neural network techniques. In this paper, I will introduce the framework and resources of Universal Dependencies, and discuss advances in dependency parsing enabled by these resources in combination with deep learning techniques, ranging from traditional word and character embeddings to deep contextualized word representations like ELMo and BERT.

#103: Multimodal Fake News Detection with Textual, Visual and Semantic Information

Anastasia Giachanou, Guobiao Zhang, Paolo Rosso

Recent years have seen a rapid growth in the number of fake news that are posted online. Fake news detection is very challenging since they are usually created to contain a mixture of false and real information and images that have been manipulated that confuses the readers. In this paper, we propose a multimodal system with the aim to differentiate between fake and real posts. Our system is based on a neural network and combines textual, visual and semantic information. The textual information is extracted from the content of the post, the visual one from the image that is associated with the post and the semantic refers to the similarity between the image and the text of the post. We conduct our experiments on three standard real world collections and we show the importance of those features on detecting fake news.

Behind the scenes of building a socialbot

Amazon Alexa Group

What happens when you replace a human participant in a conversation with a socialbot? What does it mean to have an engaging conversation with an AI assistant? How can that kind of conversation prove to be valuable, and can it provide its own kind of connection? The participants in this year’s Alexa Prize contest — an academic collaboration that empowers university students to innovate in the field of conversational AI by creating a socialbot using Alexa tools and capabilities — were driven by those questions.

Tune in to hear directly from the three winning teams of the Alexa Prize Socialbot Grand Challenge 3. Representatives from each team will touch on the challenges they set out to solve, the research that went into creating their socialbot, and the science behind their final award-winning results!

Creating AI driven voice experiences with Alexa Conversations

Amazon Alexa Group

This July, Amazon Alexa announced the public beta launch of Alexa Conversations dialogue management. With the launch, Alexa developers can now leverage a state-of-the-art dialogue manager powered by deep learning to create complex, nonlinear experiences — conversations that go well beyond today's typical one-shot interactions, such as "Alexa, what's the weather forecast for today?" or "Alexa, set a ten-minute pasta timer".

In this session, learn more about the science behind Alexa Conversations, and watch a demo of the capability. You’ll get insight into how Alexa Conversations enables customers to interact with Alexa in a natural and conversational manner, while simultaneously relieving developers of the effort they would typically need to expend in authoring complex dialogue management rules.

#1035: A Semantic Grammar for Augmentative and Alternative Communication Systems

Jayr Pereira, Natália Franco, Robson Fidalgo

The authoring of meaningful sentences is an essential requirement for AAC systems aimed at the education of children with complex communication needs. Some studies propose the use of linguistic knowledge databases to meet that requirement. In this paper, we propose and present a Semantic Grammar (SG) for AAC systems based on visual and semantic clues. The proposed SG was acquired using an automatic process based on Natural Language Processing (NLP) techniques for the extraction of semantic relations from text samples. We assessed the SG precision on suggesting the correct words on reconstructing telegraphic sentences and obtained a precision average of 90%.

#1023: A Systematic Study of Open Source and Commercial Text-to-Speech (TTS) Engines

Jordan Hosier, Jordan Kalfen, Nikhita Sharma, Vijay K. Gurbani

The widespread availability of open source and commercial text-to-speech (TTS) engines allows for the rapid creation of telephony services that require a TTS component. However, there exists neither a standard corpus nor common metrics to objectively evaluate TTS engines. Listening tests are a prominent method of evaluation in the domain where the primary goal is to produce speech targeted at human listeners. Nonetheless, subjective evaluation can be problematic and expensive. Objective evaluation metrics, such as word accuracy and contextual disambiguation (is "Dr." rendered as Doctor or Drive?), have the benefit of being both inexpensive and unbiased. In this paper, we study seven TTS engines, four open source engines and three commercial ones. We systematically evaluate each TTS engine on two axes: (1) contextual word accuracy (includes support for numbers, homographs, foreign words, acronyms, and directional abbreviations); and (2) naturalness (how natural the TTS sounds to human listeners). Our results indicate that commercial engines may have an edge over open source TTS engines.

#989: A Twitter Political Corpus of the 2019 10N Spanish Election

Javier Sánchez-Junquera, Simone Paolo Ponzetto, Paolo Rosso

We present a corpus of Spanish tweets of 15 Twitter accounts of politicians of the main five parties (PSOE, PP, Cs, UP and VOX) covering the campaign of the Spanish election of 10th November 2019 (10N Spanish Election). We perform a semi-automatic annotation of domain-specific topics using a mixture of keyword-based and supervised techniques. In this preliminary study we extracted the tweets of few politicians of each party with the aim to analyse their official communication strategy. Moreover, we analyse sentiments and emotions employed in the tweets. Although the limited size of the Twitter corpus due to the very short time span, we hope to provide with some first insights on the communication dynamics of social network accounts of these five Spanish political parties.

#1000: Acoustic Characteristics of VOT in Plosive Consonants Produced by Parkinson's Patients

Patricia Argüello-Vélez, Tomas Arias-Vergara, María Claudia González-Rátiva, Juan Rafael Orozco-Arroyave, Elmar Nöth, Maria Elke Schuster

Voice Onset Time (VOT) has been used as an acoustic measure for a better understanding of the impact of different motor speech disorders in speech production. The purpose of our paper is to present a methodology for the manual measuring of VOT in voiceless plosive sounds and to analyze its suitability to detect specific articulation problems in Parkinson's disease (PD) patients. The experiments are performed with recordings of the diadochokinetic evaluation which consists in the rapid repetition of the syllables /pa-ta-ka/. A total of 50 PD patients and 50 healthy speakers (HC) participated in this study. Manual measurements include VOT values and also duration of the closure phase, duration of the consonant, and the maximum spectral energy during the burst phase. Results indicate that the methodology is consistent and allows the automatic classification between PD patients and healthy speakers with accuracies of up to 77%.

#1018: Adjusting BERT's Pooling Layer for Large-scale Multi-label Text Classification

Jan Lehečka, Jan Švec, Pavel Ircing, Luboš Šmídl

In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label Text Classification (LMTC). In the LMTC task, each text document can have multiple class labels, while the total number of classes is in the order of thousands. We propose a pooling layer architecture on top of BERT models, which improves the quality of classification by using information from the standard [CLS] token in combination with pooled sequence output. We demonstrate the improvements on Wikipedia datasets in three different languages using public pre-trained BERT models.

#1001: Assessing Unintended Memorization in Neural Discriminative Sequence Models

Mossad Helali, Thomas Kleinbauer, Dietrich Klakow

Despite their success in a multitude of tasks, neural models trained on natural language have been shown to memorize the intricacies of their training data, posing a potential privacy threat. In this work, we propose a metric to quantify unintended memorization in neural discriminative sequence models. The proposed metric, named d-exposure (discriminative exposure), utilizes language ambiguity and classification confidence to elicit the model's propensity to memorization. Through experimental work on a named entity recognition task, we show the validity of d-exposure to measure memorization. In addition, we show that d-exposure is not a measure of overfitting as it does not increase when the model overfits.

#1002: Assessing the Dysarthria Level of Parkinson's Disease Patients with GMM-UBM Supervectors Using Phonological Posteriors and Diadochokinetic Exercises

Gabriel F. Miller, Juan Camilo Vásquez-Correa, Elmar Nöth

Parkinson's disease (PD) is a neuro-degenerative disorder that produces symptoms such as tremor, slowed movement, and a lack of coordination. One of the earliest indicators is a combination of different speech impairments called hypokinetic dysarthria. Some indicators that are prevalent in the speech of Parkinson's patients include, imprecise production of stop consonants, vowel articulation impairment and reduced loudness. In this paper, we examine those features using phonological posterior probabilities obtained via parallel bidirectional recurrent neural networks. We also utilize information such as the velocity and acceleration curve of the signal envelope, and the peak amplitude slope and variance to model the quality of pronunciation for a given speaker. With our feature set, we train Gaussian Mixture Model based Universal Background Models for a set of training speakers and adapt a model for each individual speaker using a form of Bayesian adaptation. With the parameters describing each speaker model, we train SVM and Random Forest classifiers to discriminate PD patients and Healthy Controls (HC), and to determine the severity of dysarthria for each speaker compared with ratings assessed by expert phoneticians.

#991: At Home with Alexa: A Tale of Two Conversational Agents

Jennifer Ureta, Celina Iris Brito, Jilyan Bianca Dy, Kyle-Althea Santos, Winfred Villaluna, Ethel Ong

Voice assistants in mobile devices and smart speakers offer the potential of conversational agents as storytelling peers of children, especially those who may have limited proficiency in spelling and grammar. Despite their prevalence, however, the built-in automatic speech recognition features of voice interfaces have been shown to perform poorly on children’s speech, which may affect child-agent interaction. In this paper, we describe our experiments in deploying a conversational storytelling agent on two popular commercial voice interfaces - Google Assistant and Amazon Alexa. Through post-validation feedback from children and analysis of the captured conversation logs, we compare the challenges encountered by children when sharing their stories with these voice assistants. We also used the Bilingual Evaluation Understudy to provide a quantitative assessment of the text-to-speech transcription quality. We found that voice assistants’ short waiting time and the frequent yet misplaced interruptions during pauses disrupt the thinking process of children. Furthermore, disfluencies and grammatical errors that naturally occur in children’s speech affected the transcription quality.

#1029: Attention to Emotions: Detecting Mental Disorders in Social Media

Mario Ezra Aragón, A. Pastor López-Monroy, Luis C. González, Manuel Montes-y-Gómez

Different mental disorders affect millions of people around the world, causing significant distress and interference to their daily life. Currently, the increased usage of social media platforms, where people share personal information about their day and problems, opens up new opportunities to actively detect these problems. We present a new approach inspired in the modeling of fine-grained emotions expressed by the users and deep learning architectures with attention mechanisms for the detection of depression and anorexia. With this approach, we improved the results over traditional and deep learning techniques. The use of attention mechanisms helps to capture the important sequences of fine-grained emotions that represent users with mental disorders.

#1084: Authorship Verification with Personalized Language Models

Milton King, Paul Cook

Malicious posts from a social media account by an unauthorized user could have severe effects for the account holder, such as the loss of a job or damage to their reputation. In this work, we consider an authorship verification task to detect unauthorized malicious social media posts. We propose a novel approach for authorship verification based on personalized, i.e., user-tailored, language models. We evaluate our proposed approach against a previous approach based on word embeddings and a one-class SVM. A large amount of text might not necessarily be available for an individual social media user. We therefore demonstrate that our proposed approach out-performs previous approaches, while requiring orders of magnitude less user-specific training text.

#1045: Automatic Correction of i/y Spelling in Czech ASR Output

Jan Švec, Jan Lehečka, Luboš Šmídl, Pavel Ircing

This paper concentrates on the design and evaluation of the method that would be able to automatically correct the spelling of i/y in the Czech words at the output of the ASR decoder. After analysis of both the Czech grammar rules and the data, we have decided to deal only with the endings consisting of consonants b/f/l/m/p/s/v/z followed by i/y in both short and long forms. The correction is framed as the classification task where the word could belong to the "i" class, the "y’’ class or the "empty’’ class. Using the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) architecture, we were able to substantially improve the correctness of the i/y spelling both on the simulated and the real ASR output. Since the misspelling of i/y in the Czech texts is seen by the majority of native Czech speakers as a blatant error, the corrected output greatly improves the perceived quality of the ASR system.

#1047: Combining Cross-Lingual and Cross-Task Supervision for Zero-shot Learning

Matúš Pikuliak, Marián Šimko

In this work we combine cross-lingual and cross-task supervision for zero-shot learning. Our main contribution is that we discovered that coupling models, i.e. models that share neither a task nor a language with the zero-shot target model, can improve the results significantly. Coupling models serve as a regularization for the other auxiliary models that provide direct cross-lingual and cross-task supervision. We conducted a series of experiments with four Indo-European languages and four tasks (dependency parsing, language modeling, named entity recognition and part-of-speech tagging) in various settings. We were able to achieve 32% error reduction compared to using cross-lingual supervision only.

#1085: Complexity of the TDNN Acoustic Model with Respect to the HMM Topology

Josef V. Psutka, Jan Vaněk, Aleš Pražák

In this paper, we discuss some of the properties of training acoustic models using a lattice-free version of the maximum mutual information criterion (LF-MMI). Currently, the LF-MMI method achieves state-of-the-art results on many speech recognition tasks. Some of the key features of the LF-MMI approach are: training DNN without initialization from a cross-entropy system, the use of a 3-fold reduced frame rate and the use of a simpler HMM topology. The conventional 3-state HMM topology was replaced in a typical LF-MMI training procedure with a special 1-stage HMM topology, that has different pdfs on the self-loop and forward transitions. In this paper, we would like to discuss both the different types of HMM topologies (conventional 1-, 2- and 3-state HMM topology) and the advantages of using biphone context modeling over using the original triphone or a simpler monophone context. We would also like to mention the impact of the subsampling factor to WER.

#1005: ConfNet2Seq: Full Length Answer Generation from Spoken Questions

Vaishali Pal, Manish Shrivastava, Laurent Besacier

Conversational and task-oriented dialogue systems aim to interact with the user using natural responses through multi-modal interfaces, such as text or speech. These desired responses are in the form of full-length natural answers generated over facts retrieved from a knowledge source. While the task of generating natural answers to questions from an answer span has been widely studied, there has been little research on natural sentence generation over spoken content. We propose a novel system to generate full length natural language answers from spoken questions and factoid answers. The spoken sequence is compactly represented as a confusion network extracted from a pre-trained Automatic Speech Recognizer. This is the first attempt towards generating full-length natural answers from a graph input(confusion network) to the best of our knowledge. We release a large-scale dataset of 259,788 samples of spoken questions, their factoid answers and corresponding full-length textual answers. Following our proposed approach, we achieve comparable performance with best ASR hypothesis.

#1071: Context-Aware XGBoost for Glottal Closure Instant Detection in Speech Signal

Jindřich Matoušek, Michal Vraštil

In this paper, we continue to investigate the use of classifiers for the automatic detection of glottal closure instants (GCIs) in the speech signal. We introduce context to extreme gradient boosting (XGBoost) and show that the context-aware XGBoost outperforms its context-free version. The proposed context-aware XGBoost is also shown to outperform traditionally used GCI detection algorithms on publicly available databases.

#1077: ConversIAmo: Improving Italian Question Answering Exploiting IBM Watson Services

Chiara Leoni, Ilaria Torre, Gianni Vercelli

Chatbots, conversational interfaces and NLP have achieved considerable improvements and are spreading more and more in everyday applications. Solutions on the market allow their implementation easily in different languages, but the proposals for the Italian language are not so effective as the English ones. This paper introduces ConversIAmo, the prototype of a conversational agent which implements a question answering system in Italian on a closed domain concerning artificial intelligence, taking the answers from online articles. This system integrates IBM services (Watson Assistant, Discovery and Natural Language Understanding) with functions developed within ConversIAmo and Tint, an open-source tool for the analysis of the Italian language. Our QA pipeline turned out to give better results than those obtained from using Watson Discovery service on its own, as for precision, F1-score and correct answer ranking (on average +12%, +21% and +20% respectively). Our main contribution is to address the need for an effective but easy-to-apply method aimed to improve performances of IBM Watson services for the Italian language. In addition, the AI domain is a new one for an Italian conversational agent.

#1075: Costra 1.1: An Inquiry into Geometric Properties of Sentence Spaces

Petra Barančíková, Ondřej Bojar

In this paper, we present a new dataset for testing geometric properties of sentence embeddings spaces. In particular, we concentrate on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of Costra 1.0, which we extended with more sentences and sentence comparisons. We show that available off-the-shelf embeddings do not possess essential attributes such as having synonymous sentences embedded closer to each other than sentences with a significantly different meaning. On the other hand, some embeddings appear to capture the linear order of sentence aspects such as style (formality and simplicity of the language) or time (past to future).

#1058: Cross-Lingual Transfer for Hindi Discourse Relation Identification

Anirudh Dahiya, Dr. Manish Shrivastava, Dr. Dipti Misra Sharma

Discourse relations between two textual spans in a document attempt to capture the coherent structure which emerges in language use. Automatic classification of these relations remains a challenging task especially in case of implicit discourse relations, where there is no explicit textual cue which marks the discourse relation. In low resource languages, this motivates the exploration of transfer learning approaches, more particularly the cross-lingual techniques towards discourse relation classification. In this work, we explore various cross-lingual transfer techniques on Hindi Discourse Relation Bank (HDRB), a Penn Discourse Treebank styled dataset for discourse analysis in Hindi and observe performance gains in both zero shot and finetuning settings on the Hindi Discourse Relation Classification task. This is the first effort towards exploring transfer learning for Hindi Discourse relation classification to the best of our knowledge.

#1038: Developing Resources for te reo Maori Text To Speech Synthesis System

Jesin James, Isabella Shields, Rebekah Berriman, Peter J. Keegan, and Catherine I. Watson

Te reo Māori (the Māori language of New Zealand) is an under-resourced language in terms of availability of speech corpora and resources needed to develop robust speech technology. Māori is an endangered indigenous language which has been subject to revitalisation efforts since the late 1970s, which are well known internationally. The Māori community recognises the need for developing speech technology tools for the language, which will improve its study and usage in wider and more digital contexts. This paper describes the development of speech resources in Māori to build one of the first Text To Speech synthesis system for the language. A speech corpus, extended dictionary and a parametric speech synthesiser are the main contributions of the study. To develop these resources, text processing, segmentation and alignment, letter to sound rules creation were also done with existing resources that were modified to be used for Māori. The acoustic similarity of synthesised speech vs natural speech was measured to evaluate the speech synthesis system statistically. Future work required is described.

#1027: Diversification of Serbian-French-English-Spanish Parallel Corpus ParCoLab with Spoken Language Data

Dušica Terzić, Saša Marjanović, Dejan Stosic, Aleksandra Miletic

In this paper we present the efforts to diversify Serbian-French-English-Spanish corpus ParCoLab. ParCoLab is the project led by CLLE research unit (UMR 5263 CNRS) at the University of Toulouse, France, and the Romance Department at the University of Belgrade, Serbia. The main goal of the project is to create a freely searchable and widely applicable multilingual resource with Serbian as the pivot language. Initially, the majority of the corpus texts represented written language. Since diversity of text types contributes to the usefulness and applicability of a parallel corpus, a great deal of effort has been made to include spoken language data in the ParCoLab database. Transcripts and translations of TED talks, films and cartoons have been included so far, along with transcripts of original Serbian films. Thus, the 17.6M-word database of mainly literary texts has been extended with spoken language data and it now contains 32.9M words.

#1048: EPIE Dataset: A Corpus for Possible Idiomatic Expressions

Prateek Saxena, Soma Paul

Idiomatic expressions have always been a bottleneck for language comprehension and natural language understanding, specifically for tasks like Machine Translation(MT). MT systems predominantly produce literal translations of idiomatic expressions as they do not exhibit generic and linguistically deterministic patterns which can be exploited for comprehension of the non-compositional meaning of the expressions. These expressions occur in parallel corpora used for training, but due to the comparatively high occurrences of the constituent words of idiomatic expressions in literal context, the idiomatic meaning gets overpowered by the compositional meaning of the expression. State of the art Metaphor Detection Systems are able to detect non-compositional usage at word level but miss out on idiosyncratic phrasal idiomatic expressions. This creates a dire need for a dataset with a wider coverage and higher occurrence of commonly occurring idiomatic expressions, the spans of which can be used for Metaphor Detection. With this in mind, we present our English Possible Idiomatic Expressions (EPIE) corpus containing 25,206 sentences labelled with lexical instances of 717 idiomatic expressions. These spans also cover literal usages for the given set of idiomatic expressions. We also present the utility of our dataset by using it to train a sequence labelling module and testing on three independent datasets with high accuracy, precision and recall scores.

#1072: Employing Sentence Context in Czech Answer Selection

Marek Medveď, Radoslav Sabol, Aleš Horák

Question answering (QA) of non-mainstream languages requires specific adaptations of the current methods tested primarily with very large English resources. In this paper, we present the results of improving the QA answer selection task by extending the input candidate sentence with selected information from preceding sentence context. The described model represents the best published answer selection model for the Czech language as an example of a morphologically rich language. The text contains thorough evaluation of the new method including model hyperparameter combinations and detailed error discussion. The winning models have improved the previous best results by 4% reaching the mean average precision of 82.91%.

#999: Evaluating a Multi-sense Definition Generation Model for Multiple Languages

Arman Kabiri, Paul Cook

Most prior work on definition modelling has not accounted for polysemy, or has done so by considering definition modelling for a target word in a given context. In contrast, in this study, we propose a context-agnostic approach to definition modelling, based on multi-sense word embeddings, that is capable of generating multiple definitions for a target word. In further contrast to most prior work, which has primarily focused on English, we evaluate our proposed approach on fifteen different datasets covering nine languages from several language families. To evaluate our approach we consider several variations of BLEU. Our results demonstrate that our proposed multi-sense model outperforms a single-sense model on all fifteen datasets.

#1026: Experimenting with Different Machine Translation Models in Medium-Resource Settings

Haukur Páll Jónsson, Haukur Barri Símonarson, Vésteinn Snæbjarnarson, Steinþór Steingrímsson, and Hrafn Loftsson

State-of-the-art machine translation (MT) systems rely on the availability of large parallel corpora, containing millions of sentence pairs. For the Icelandic language, the parallel corpus ParIce exists, consisting of about 3.6 million English-Icelandic sentence pairs. Given that parallel corpora for low-resource languages typically contain sentence pairs in the tens or hundreds of thousands, we classify Icelandic as a medium-resource language for MT purposes. In this paper, we present on-going experiments with different MT models, both statistical and neural, for translating English to Icelandic based on ParIce. We describe the corpus and the filtering process used for removing noisy segments, the different models used for training, and the preliminary automatic and human evaluation. We find that, while using an aggressive filtering approach, the most recent neural MT system (Transformer) performs best, obtaining the highest BLEU score and the highest fluency and adequacy scores from human evaluation for in-domain translation. Our work could be beneficial to other languages for which a similar amount of parallel data is available.

#1060: FinEst BERT and CroSloEngual BERT: Less is More in Multilingual Models

Matej Ulčar, Marko Robnik-Šikonja

Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. The research has been mostly focused on English language, though. While massively multilingual models exist, studies have shown that monolingual models produce much better results. We train two trilingual BERT-like models, one for Finnish, Estonian, and English, the other for Croatian, Slovenian, and English. We evaluate their performance on several downstream tasks, NER, POS-tagging, and dependency parsing, using the multilingual BERT and XLM-R as baselines. The newly created FinEst BERT and CroSloEngual BERT improve the results on all tasks in most monolingual and cross-lingual situations.

#1074: Grammatical Parallelism of Russian Prepositional Localization and Temporal Constructions

Victor Zakharov, Irina Azarova

In this paper we present a part of corpus-driven semantico-grammatical ontological description of Russian prepositional constructions. The main problem of a prepositional ontology is its inner controversy because the ontological structure presupposes logical analysis of concepts, however, prepositions are usually interpreted as non-lexical grammatical language elements. In our understanding, this is an ontology of lexico-grammatical relations that are implemented in prepositional constructions. We demonstrate the ontological structure for semantic rubrics of temporal and locative syntaxemes extracted through the elaborated technique for processing corpus statistics of prepositional constructions in modern Russian texts. Common and contrastive traits between this two topmost semantic domains are shown.

#1055: Graph Convolutional Networks for Student Answers Assessment

Nisrine Ait Khayi, Vasile Rus

Graph Convolutional Networks have achieved impressive results in multiple NLP tasks such as text classification. However, this approach has not been explored yet for the student answer assessment task. In this work, we propose to use Graph Convolutional Networks to automatically assess freely generated student answers within the context of dialogue-based intelligent tutoring systems. We convert this task to a node classification task. First, we build a DTGrade graph where each node represents the concatenation of the student answer and its corresponding reference answer whereas the edges represent the relatedness between nodes. Second, the DTGrade graph is fed to two layers of Graph Convolutional Networks. Finally, the output of the second layer is fed to a softmax layer. The empirical results showed that our model reached the state-of-the-art results by obtaining an accuracy of 73%.

#1022: Inserting Punctuation in a Real-Time Production Environment

Pavel Hlubík, Martin Španěl, Marek Boháč, Lenka Weingartová

The output of a speech recognition system is a continuous stream of words that has to be post-processed in various ways, out of which punctuation insertion is an essential step. Punctuated text is far more comprehensible to the reader, can be used for subtitling, and is necessary for further NLP processing, such as machine translation. In this article, we describe how state-of-the-art results in the field of punctuation restoration can be utilized in a production-ready business environment in the Czech language. A recurrent neural network based on long short-term memory is employed, making use of various features: textual based on pre-trained word embeddings, prosodic (mainly temporal), morphological, noise information, and speaker diarization. All the features except morphological tags were found to improve our baseline system. As we work in a real-time setup, it is not possible to employ information from the future of the word stream, yet we achieve significant improvements using LSTM. The usage of RNN also allows the model to learn longer dependencies than any n-gram-based language model can, which we find essential for the insertion of question marks. The deployment of an RNN-based model thus leads to a relative 22.6% decrease in punctuation errors and improvement in all metrics but one.

#1040: Interpreting Word Embeddings Using a Distribution Agnostic Approach Employing Hellinger Distance

Tamás Ficsor, Gábor Berend

Word embeddings can encode semantic and syntactic features and have achieved many recent successes in solving NLP tasks. Despite their successes, it is not trivial to directly extract lexical information out of them. In this paper, we propose a transformation of the embedding space to a more interpretable one using the Hellinger distance. We additionally suggest a distribution-agnostic approach using Kernel Density Estimation. A method is introduced to measure the interpretability of the word embeddings. Our results suggest that Hellinger based calculation gives a 1.35% improvement on average over the Bhattacharyya distance in terms of interpretability and adapts better to unknown words.

#986: Introduction of Semantic Model to Help Speech Recognition

Stephane Level, Irina Illina, Dominique Fohr

Current Automatic Speech Recognition (ASR) systems mainly take into account acoustic, lexical and local syntactic information. Long term semantic relations are not used. ASR systems significantly decrease performance when the training conditions and the testing conditions differ due to the noise, etc. In this case the acoustic information can be less reliable. To help noisy ASR system, we propose to supplement ASR system with a semantic module. This module re-evaluates the N-best speech recognition hypothesis list and can be seen as a form of adaptation in the context of noise. For the words in the processed sentence that could have been poorly recognized, this module chooses words that correspond better to the semantic context of the sentence. To achieve this, we introduced the notions of a context part and possibility zones that measure the similarity between the semantic context of the document and the corresponding possible hypothesis. The proposed methodology uses two continuous representations of words: word2vec and FastText. We conduct experiments on the publicly available TED conferences dataset (TED-LIUM) mixed with real noise. The proposed method achieves a significant improvement of the word error rate (WER) over the ASR system without semantic information.

#983: Investigating the Corpus Independence of the Bag-of-Audio-Words Approach

Mercedes Vetráb, Gábor Gosztolya

In this paper, we analyze the general use of the Bag-of-Audio-Words (BoAW) feature extraction method. This technique allows us to handle the problem of varying length recordings. The first step of the BoAW method is to define cluster centers (called codewords) over our feature set with an unsupervised training method (such as k-means clustering or even random sampling). This step is normally performed on the training set of the actual database, but this approach has its own drawbacks: we have to create new codewords for each data set and this increases the computing time and it can lead to over-fitting. Here, we analyse how much the codebook depends on the given corpus. In our experiments, we work with three databases: a Hungarian emotion database, a German emotion database and a general Hungarian speech database. We experiment with constructing a set of codewords on each of these databases, and examine how the classification accuracy scores vary on the Hungarian emotion database. According to our results, the classification performance was similar in each case, which suggests that the Bag-of-Audio-Words codebook is practically corpus-independent. This corpus-independence allows us to reuse codebooks created on different datasets, which can make it easier to use the BoAW method in practice.

#1064: Investigating the Impact of Pre-trained Word Embeddings on Memorization in Neural Networks

Aleena Thomas, David Ifeoluwa Adelani, Ali Davody, Aditya Mogadala, Dietrich Klakow

The sensitive information present in the training data, poses a privacy concern for applications as their unintended memorization during training can make models susceptible to membership inference and attribute inference attacks. In this paper, we investigate this problem in various pre-trained word embeddings (GloVe, ELMo and BERT) with the help of language models built on top of it. In particular, firstly sequences containing sensitive information like a single-word disease and 4-digit PIN are randomly inserted into the training data, then a language model is trained using word vectors as input features, and memorization is measured with a metric termed as exposure. The embedding dimension, the number of training epochs, and the length of the secret information were observed to affect memorization in pre-trained embeddings. Finally, to address the problem, differentially private language models were trained to reduce the exposure of sensitive information.

#1073: LSTM-Based Speech Segmentation Trained on Different Foreign Languages

Zdeněk Hanzlíček, Jakub Vít

This paper describes experiments on speech segmentation by using bidirectional LSTM neural networks. The networks were trained on various languages (English, German, Russian and Czech), segmentation experiments were performed on 4 Czech professional voices. To be able to use various combinations of foreign languages, we defined a reduced phonetic alphabet based on IPA notation. It consists of 26 phones, all included in all languages. To increase the segmentation accuracy, we applied an iterative procedure based on detection of improperly segmented data and retraining of the network. Experiments confirmed the convergence of the procedure. A comparison with a reference HMM-based segmentation with additional manual corrections was performed.

#1042: Labeling Explicit Discourse Relations using Pre-trained Language Models

Murathan Kurfalı

Labeling explicit discourse relations is one of the most challenging sub-tasks of the shallow discourse parsing where the goal is to identify the discourse connectives and the boundaries of their arguments. The state-of-the-art models achieve slightly above 45% of F-score by using hand-crafted features. The current paper investigates the efficacy of the pre-trained language models in this task. We find that the pre-trained language models, when finetuned, are powerful enough to replace the linguistic features. We evaluate our model on PDTB 2.0 and report the state-of-the-art results in extraction of the full relation. This is the first time when a model outperforms the knowledge intensive models without employing any linguistic features.

#1009: Leyzer: A Dataset for Multilingual Virtual Assistants

Marcin Sowanski, Artur Janicki

In this article we present the Leyzer dataset, a multilingual text corpus designed to study multilingual and cross-lingual natural language understanding (NLU) models and the strategies of localization of virtual assistants. The proposed corpus consists of 20 domains across three languages: English, Spanish and Polish, with 186 intents and a wide range of samples, ranging from 1 to 672 sentences per intent. We describe the data generation process, including creation of grammars and forced parallelization. We present a detailed analysis of the created corpus. Finally, we report the results for two localization strategies: train-on-target and zero-shot learning using multilingual BERT models.

#1059: Measuring Memorization Effect in Word-Level Neural Networks Probing

Rudolf Rosa, Tomáš Musil, David Mareček

Multiple studies have probed representations emerging in neural networks trained for end-to-end NLP tasks and examined what word-level linguistic information may be encoded in the representations. In classical probing, a classifier is trained on the representations to extract the target linguistic information. However, there is a threat of the classifier simply memorizing the linguistic labels for individual words, instead of extracting the linguistic abstractions from the representations, thus reporting false positive results. While considerable efforts have been made to minimize the memorization problem, the task of actually measuring the amount of memorization happening in the classifier has been understudied so far. In our work, we propose a simple general method for measuring the memorization effect, based on a symmetric selection of comparable sets of test words seen versus unseen in training. Our method can be used to explicitly quantify the amount of memorization happening in a probing setup, so that an adequate setup can be chosen and the results of the probing can be interpreted with a reliability estimate. We exemplify this by showcasing our method on a case study of probing for part of speech in a trained neural machine translation encoder.

#1024: Mining Local Discourse Annotation for Features of Global Discourse Structure

Lucie Poláková, Jiří Mírovský

Descriptive approaches to discourse (text) structure and coherence typically proceed either in a bottom-up or a top-down analytic way. The former ones analyze how the smallest discourse units (clauses, sentences) are connected in their closest neighbourhood, locally, in a linear way. The latter ones postulate a hierarchical organization of smaller and larger units, sometimes also represent the whole text as a tree-like graph. In the present study, we mine a Czech corpus of 50k sentences annotated in the local coherence fashion (Penn Discourse Treebank style) for indices signalling higher discourse structure. We analyze patterns of overlapping discourse relations and look into hierarchies they form. The types and distributions of the detected patterns correspond to the results for English local annotation, with patterns not complying with the tree-like interpretation at very low numbers. We also detect hierarchical organization of local discourse relations of up to 5 levels in the Czech data.

#1003: Modification of Pitch Parameters in Speech Coding for Information Hiding

Adrian Radej, Artur Janicki

The article presents a method of using F0 parameter in speech coding to transmit hidden information. It is an improved approach, which uses interpolation of pitch parameters instead of transmitting exact original values. Using an example of the Speex codec, we describe six variants of this method, named originally as HideF0, and we compare them by analyzing the capacity of the hidden channels, their detectability and the decrease in quality introduced by pitch manipulation. In particular, we perform listening tests using 20 participants to verify how perceptible the pitch manipulations are. The results are presented and discussed. We prove that minor modifications of pitch parameters are hardly perceptible, what can be used to create hidden transmission channels. One of the best proposed variants, called HideF0-FM, is shown to enable hidden transmission at the bitrate of over 120 bps at no speech quality degradation at all. Higher bitrates are also possible, only with minor quality degradation and limited detectability.

#1078: Next Step in Online Querying and Visualization of Word-Formation Networks

Jonáš Vidra, Zdeněk Žabokrtský

In this paper, we introduce a new and improved version of DeriSearch, a search engine and visualizer for word-formation networks. Word-formation networks are datasets that express derivational, compounding and other word-formation relations between words. They are usually expressed as directed graphs, in which nodes correspond to words and edges to the relations between them. Some networks also add other linguistic information, such as morphological segmentation of the words or identification of the processes expressed by the relations. Networks for morphologically rich languages with productive derivation or compounding have large connected components, which are difficult to visualize. For example, in the network for Czech, DeriNet 2.0, connected components over 500 words large contain 1/8 of the vocabulary, including its most common parts. In the network for Latin, Word Formation Latin, over 10 000 words (1/3 of the vocabulary) are in a single connected component. With the recent release of the Universal Derivations collection of word-formation networks for several languages, there is a need for a searching and visualization tool that would allow browsing such complex data.

#1054: On the Effectiveness of Neural Text Generation based Data Augmentation for Recognition of Morphologically Rich Speech

Balázs Tarján, György Szaszák, Tibor Fegyó, Péter Mihajlik

Advanced neural network models have penetrated Automatic Speech Recognition (ASR) in recent years, however, in language modeling many systems still rely on traditional Back-off N-gram Language Models (BNLM) partly or entirely. The reason for this are the high cost and complexity of training and using neural language models, mostly possible by adding a second decoding pass (rescoring). In our recent work we have significantly improved the online performance of a conversational speech transcription system by transferring knowledge from a Recurrent Neural Network Language Model (RNNLM) to the single pass BNLM with text generation based data augmentation. In the present paper we analyze the amount of transferable knowledge and demonstrate that the neural augmented LM (RNN-BNLM) can help to capture almost 50% of the knowledge of the RNNLM yet by dropping the second decoding pass and making the system real-time capable. We also systematically compare word and subword LMs and show that subword-based neural text augmentation can be especially beneficial in under-resourced conditions. In addition, we show that using the RNN-BNLM in the first pass followed by a neural second pass, offline ASR results can be even significantly improved.

#1015: Perceived Length of Czech High Vowels in Relation to Formant Frequencies Evaluated by Automatic Speech Recognition

Tomáš Bořil, Jitka Veroňková

Recent studies measured significant differences in formant values in the production of short and long high vowel pairs in the Czech language. Perceptional impacts of such findings were confirmed employing listening tests proving that a perceived vowel length is influenced by formant values related to a tongue position. Non-native speakers of Czech may experience difficulties in communication when they interchange the vowel length in words, which may lead to a completely different meaning of the message. This paper analyses perception of two-syllable words with manipulated duration and formant frequencies of high vowels i/i: or u/u: in the first syllable using automatic speech recognition (ASR) system. Such a procedure makes it possible to set a fine resolution in the range of examined factors. Our study confirms the formant values have a substantial impact on the perception of high vowels' length by ASR, comparable to mean values obtained from listening tests performed on a group of human participants.

#1014: Phonetic Attrition in Vowels' Quality in L1 Speech of Late Czech-French Bilinguals

Marie Hévrová, Tomáš Bořil, Barbara Koepke

This study examines phonetic attrition of the first language (L1) affected by second language (L2) in Czech speakers living in Toulouse (late Czech-French bilinguals -- CF). We compared the production of vowels by 13 CF and 13 Czech monolinguals living in the Central Bohemian Region (C). CF had been living in France for at least one year and started to learn French when they were more than 6 years old. Both C and CF were speakers of Common Czech. We recorded their production in reading task and semi-spontaneous speech and performed measurements of vowel formants. Results show a statistically significant difference between F1 of CF [a:] and F1 of C [a:], and between F3 of CF [i:] and F3 of C [i:]. These findings are discussed in relation to the perceptual approach suggesting that several vowels can be perceived as different in C and CF production.

#1030: Quantitative Analysis of the Morphological Complexity of Malayalam Language

Kavya Manohar, A. R. Jayan, Rajeev Rajan

This paper presents a quantitative analysis on the morphological complexity of Malayalam language. Malayalam is a Dravidian language spoken in India, predominantly in the state of Kerala with about 38 million native speakers. Malayalam words undergo inflections, derivations and compounding leading to an infinitely extending lexicon. In this work, morphological complexity of Malayalam is quantitatively analysed on a text corpus containing 8 million words. The analysis is based on the parameters type-token growth rate (TTGR), type-token ratio (TTR) and moving average type-token ratio (MATTR). The values of the parameters obtained in the current study is compared to that of the values of other morphologically complex languages.

#1080: Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer

Kateřina Macková, Milan Straka

Reading comprehension is a well studied task, with huge training datasets in English. This work focuses on building reading comprehension systems for Czech, without requiring any manually annotated Czech training data. First of all, we automatically translated SQuAD 1.1 and SQuAD 2.0 datasets to Czech to create training and development data, which we release at http://hdl.handle.net/11234/1-3249. We then trained and evaluated several BERT and XLM-RoBERTa baseline models. However, our main focus lies in cross-lingual transfer models. We report that a XLM-RoBERTa model trained on English data and evaluated on Czech achieves very competitive performance, only approximately 2 percent points worse than a model trained on the translated Czech data. This result is extremely good, considering the fact that the model has not seen any Czech data during training. The cross-lingual transfer approach is very flexible and provides a reading comprehension in any language, for which we have enough monolingual raw texts.

#1021: Recognizing Preferred Grammatical Gender in Russian Anonymous Online Confessions

Anton Alekseev, Sergey Nikolenko

We present annotation results for a dataset of public anonymous online confessions in Russian ("Overheard/Podslushano" group in VKontakte, posts tagged #family). Unlike many other cases with online social network data, intentionally anonymous posts do not contain any explicit metadata such as age or gender. We consider the problem of predicting the author's preferred grammatical gender for self-reference, a problem that proved to be surprisingly hard and not reducible to simple morphological analysis. We describe an expert labeling of a dataset for this problem, show the findings of predictive analysis, and introduce rule-based and machine learning approaches.

#1044: Registering Historical Context for Question Answering in a Blocks World Dialogue System

Benjamin Kane, Georgiy Platonov, Lenhart Schubert

Task-oriented dialogue-based spatial reasoning systems need to maintain history of the world/discourse states in order to convey that the dialogue agent is mentally present and engaged with the task, as well as to be able to refer to earlier states, which may be crucial in collaborative planning (e.g., for diagnosing a past misstep). We approach the problem of spatial memory in a multi-modal spoken dialogue system capable of answering questions about interaction history in a physical blocks world setting. We employ a pipeline consisting of a vision system, speech I/O mediated by an animated avatar, a dialogue system that robustly interprets queries, and a constraint solver that derives answers based on 3D spatial modelling. The contributions of this work include a semantic parser competent in this domain and a symbolic dialogue context allowing for interpreting and answering free-form historical questions using world and discourse history.

#1082: Semi-supervised Induction of Morpheme Boundaries in Czech using a Word-formation Network

Jan Bodnár, Zdeněk Žabokrtský, Magda Ševčíková

This paper deals with automatic morphological segmentation of Czech lemmas contained in the word-formation network DeriNet. Capturing derivational relations between base and derived lemmas, and segmenting lemmas into sequences of morphemes are two closely related formal models of how words come into existence. Thus we propose a novel segmentation method that benefits from the existence of the network; our solution constitutes new state-of-the-art for the Czech language.

#1070: Speaker-Dependent BiLSTM-Based Phrasing

Speaker-Dependent BiLSTM-Based Phrasing

Phrase boundary detection is an important part of text-to-speech systems since it ensures more natural speech synthesis outputs. However, the problem of phrasing is ambiguous, especially per speaker and per style. This is the reason why this paper focuses on speaker-dependent phrasing for the purposes of speech synthesis, using a neural network model with a speaker code. We also describe results of a listening test focused on incorrectly detected breaks because it turned out that some mistakes could be actually fine, not wrong.

#1012: Synthesising Expressive Speech Which Synthesiser for VOCAs?

Jan-Oliver Wülfing, Chi Tai Dang, Elisabeth André

In the context of people with complex communication needs who depend on Voice Output Communication Aids, the ability of speech synthesisers to convey not only sentences, but also emotions would be a great enrichment. The latter is essential and very natural in interpersonal speech communication. Hence, we are interested in the expressiveness of speech synthesisers and their perception. We present the results of a study in which 82 participants listened to different synthesised sentences with different emotional contours from three synthesisers. We found that participants' ratings on expressiveness and naturalness indicate that the synthesiser CereVoice performs better than the other synthesisers.

#993: Towards Automated Assessment of Stuttering and Stuttering Therapy

Sebastian P. Bayerl, Florian Hönig, Joëlle Reister, Korbinian Riedhammer

Stuttering is a complex speech disorder that can be identified by repetitions, prolongations of sounds, syllables or words and blocks while speaking. Severity assessment is usually done by a speech therapist. While attempts at automated assessment were made, it is rarely used in therapy. Common methods for the assessment of stuttering severity include percent stuttered syllables (% SS), the average of the three longest stuttering symptoms during a speech task or the recently introduced Speech Efficiency Score (SES). This paper introduces the Speech Control Index (SCI), a new method to evaluate the severity of stuttering. Unlike SES, it can also be used to assess therapy success for fluency shaping. We evaluate both SES and SCI on a new comprehensively labeled dataset containing stuttered German speech of clients prior to, during and after undergoing stuttering therapy. Phone alignments of an automatic speech recognition system are statistically evaluated in relation to their relative position to labeled stuttering events. The results indicate that phone length distributions differ in respect to their position in and around labeled stuttering events.

#997: Transfer Learning to Detect Parkinson's Disease from Speech in Different Languages Using Convolutional Neural Networks with Layer Freezing

Cristian David Rios-Urrego, Juan Camilo Vásquez-Correa, Juan Rafael Orozco-Arroyave, Elmar Nöth

Parkinson's Disease is a neurodegenerative disorder characterized by motor symptoms such as resting tremor, bradykinesia, rigidity and freezing of gait. The most common symptom in speech is called hypokinetic dysarthria, where speech is characterized by monotone intensity, low pitch variability and poor prosody that tends to fade at the end of the utterance. This study proposes the classification of patients with Parkinson's Disease and healthy controls in three different languages (Spanish, German, and Czech) using a transfer learning strategy. The process is further improved by freezing consecutive different layers of the architecture. We hypothesize that some convolutional layers characterize the disease and others the language. Therefore, when a fine-tuning in the transfer learning is performed, it is possible to find the topology that best adapts to the target language and allows an accurate detection of Parkinson's Disease. The proposed methodology uses Convolutional Neural Networks trained with Mel-scale spectrograms. Results indicate that the fine-tuning of the neural network does not provide good performance in all languages while fine-tuning of individual layers improves the accuracy by up to 7%. In addition, the results show that Transfer Learning among languages improves the performance in up to 18% when compared to a base model used to initialize the weights of the network.

#1061: Verb Focused Answering from CORD-19

Elizabeth Jasmi George

At this time of a pandemic turning into an infodemic, it is significant to answer questions asked on the research related to that. This paper discusses a method of answering questions leveraging the syntactic structure of the sentences to find the verb of action in the context corresponding to the action in the question. This method generates correct answers for many factoid questions on descriptive context passages. The proposed method finds all the sentences in the passage, which has the same or synonymous verb as the verb in the question, processes the dependencies of the verbs obtained from the dependency parser and proceeds with further rule-based filtering for matching the other attributes of the answer span. We demonstrate this method on CORD-19 data evaluated with free form natural language questions.

#1046: Very Fast Keyword Spotting System with Real Time Factor below 0.01

Jan Nouza, Petr Červa, Jindřich Žďánský

In the paper we present an architecture of a keyword spotting (KWS) system that is based on modern neural networks, yields good performance on various types of speech data and can run very fast. We focus mainly on the last aspect and propose optimizations for all the steps required in a KWS design: signal processing and likelihood computation, Viterbi decoding, spot candidate detection and confidence calculation. We present time and memory efficient modelling by bidirectional feedforward sequential memory networks (an alternative to recurrent nets) either by standard triphones or so called quasi-monophones, and an entirely forward decoding of speech frames (with minimal need for look back). Several variants of the proposed scheme are evaluated on 3 large Czech datasets (broadcast, internet and telephone, 17 hours in total) and their performance is compared by Detection Error Tradeoff (DET) diagrams and real-time (RT) factors. We demonstrate that the complete system can run in a single pass with a RT factor close to 0.001 if all optimizations (including a GPU for likelihood computation) are applied.

#985: Voice-Activity and Overlapped Speech Detection Using x-Vectors

J. Málek, J. Žďánský

The x-vectors are features extracted from speech signals using pretrained deep neural networks, such that they discriminate well among different speakers. Their main application lies in speaker identification and verification. This manuscript studies, which other properties are encoded in x-vectors. The focus lies on distinguishing between speech signals/noise and utterances of a single speaker versus overlapped-speech. We attempt to show that the x-vector network is capable to extract multi-purpose features, which can be used by several simple back-end classifiers. This means a common feature extracting front-end for the tasks of voice-activity/overlapped speech detection and speaker identification. Compared to the alternative strategy, that is training of independent classifiers including feature extracting layers for each of the tasks, the common front-end saves computational time during both training and test phase.

#968: Automated Legal Research for German Law

Thejeswi Nagendra Kamatchi, Jelena Mitrović, Siegfried Handschuh

This demonstration will be based on the system for performing automated legal research of Civil Law. The dataset has the legal text organized according to legal code, sections, paragraphs and sentence numbers. Relevant links connecting related laws are present. Supporting information such as POS tags, parse-trees, synonyms and similar words (found using Wikipedia word embeddings) are used to enrich the dataset with features. The user can input a simple sentence(s) describing the case, according to which the legal case is classified to a specific part of the law. Interactive fact collection is then performed. Once enough facts are collected and particular legal texts can be matched with sufficient confidence, judgment prediction is performed. All the collected facts, matching legal text with justification and predictions are compiled into a report for the user. Future work on this system will include an argument mining system based on rhetorical relations and figures in law text.

#969: Computer model of the Tibetan language morphology

Aleksei Dobrov, Anastasia Dobrova, Pavel Grokhovskiy, Maria Smirnova, Nikolay Soms

The research describes the developing of a computer model of the Tibetan morphology which can be used to explain the phenomena of positional morpheme interchange in the Tibetan language. The work included such main stages as development of the faceted classification of observed interchanges according to the type of variation, types of initials and finals of morphemes and other possible reasons; development of an object-oriented model reflecting the created classification, and allowing to automate the work of observed rules of gradation; development of the system of automatic regression testing of the model, which makes it possible to guarantee its compliance with linguistic material. The created computer model of the Tibetan morphology was evaluated using a regression testing system, which ensures that the model conforms to the observed morphological phenomena.

#970: SMACC - text analyzer for legal assistance

Basile Audard, Elena Manishina, Joao Pedro Campello

In our daily life we regularly come across various contracts and agreements: real estate lease, electricity or mobile phone plan, insurance contract, general conditions of sale, etc. These contracts, in paper or online versions, may contain thousands of lines; reading and understanding these lines is a real challenge for many people, the language in the legal domain texts being notoriously hard to digest for non-professionals. SMACC (Smart Contract Checker) texual analyzer was developed to address those issues. SMACC is a tool that offers legal assistance to users faced with a biding contract and consequent obligations and who would like to get a better overview of the document at hand as well as the idea of its legality vis-à-vis the existing legislation in the specific legal domain.

#971: Text Embeddings Based on Synonyms

Magdalena Wiercioch

Searching for text representation is one of the main tasks in information retrieval domain. The appropriate model has an impact on sentiment analysis also known as opinion mining. Take for instance books review sentiment studies. The goal is to assess people’s opinions or emotions towards the book. Obviously, it may be applied in various fields such as recommendation systems. However, the quality of text representation affects the performance of this type of tasks.

#972: An interface between the Czech valency lexicon PDT-Vallex and corpus manager KonText

Kira Droganova, Eva Fučíková, Anša Vernerová

We present a user interface between the Czech valency lexicon, PDT-Vallex, and KonText -- a web application for querying corpora available within the LINDAT/CLARIN project.

TSD 2019 | TSD 2018 | TSD 2017