TSD 2014

Determining the language proficiency level required to understand a given text is a key requirement in vetting documents for use in second language learning. In this work, we describe our approach for developing an automatic text analytic to estimate the text difficulty level using the Interagency Language Roundtable (ILR) proficiency scale. The approach we take is to use machine translation to translate a non-English document into English and then use an English language trained ILR level detector. We achieve good results in predicting ILR levels with both human and machine translation of Farsi documents. We also report results on text leveling prediction on human translations into English of documents from 54 languages.

preprint PDF

#102: An Information Extraction Customizer

Ralph Grishman, Yifan He

When an information extraction system is applied to a new task or domain, we must specify the classes of entities and relations to be extracted. This is best done by a subject matter expert, who may have little training in NLP. To meet this need, we have developed a toolset which is able to analyze a corpus and aid the user in building the specifications of the entity and relation types.

preprint PDF

#103: Entailment Graphs for Text Analytics in the Excitement Project

Bernardo Magnini, Ido Dagan, Günter Neumann, Sebastian Pado

In the last years, a relevant research line in Natural Language Processing has focused on detecting semantic relations among portions of text, including entailment, similarity, temporal relations, and, with a less degree, causality. The attention on such semantic relations has raised the demand to move towards more informative meaning representations, which express properties of concepts and relations among them. This demand triggered research on "statement entailment graphs", where nodes are natural language statements (propositions), comprising of predicates with their arguments and modifiers, while edges represent entailment relations between nodes. We report initial research that defines the properties of entailment graphs and their potential applications. Particularly, we show how entailment graphs are profitably used in the context of the European project EXCITEMENT, where they are applied for the analysis of customer interactions across multiple channels, including speech, email, chat and social media, and multiple languages (English, German, Italian).

preprint PDF

#627: A Factored Discriminative Spoken Language Understanding for Spoken Dialogue Systems

Filip Jurčíček, Ondřej Dušek, Ondřej Plátek

This paper describes a factored discriminative spoken language understanding method suitable for real-time parsing of recognised speech. It is based on a set of logistic regression classifiers, which are used to map input utterances into dialogue acts. The proposed method is evaluated on a corpus of spoken utterances from the Public Transport Information (PTI) domain. In PTI, users can interact with a dialogue system on the phone to find intra- and inter-city public transport connections and ask for weather forecast in a desired city. The results show that in adverse speech recognition conditions, the statistical parser yields significantly better results compared to the baseline well-tuned handcrafted parser.

preprint PDF

#695: A Method for Parallel Non-Negative Sparse Large Matrix Factorization

Anatoly Anisimov, Oleksandr Marchenko, Emil Nasirov, Stepan Palamarchuk

This paper proposes parallel methods of non-negative sparse large matrix factorization. The described methods are tested and compared on large matrices processing.

preprint PDF

#596: A Topic Model Scoring Approach for Personalized QA Systems

Hamidreza Chinaei, Luc Lamontagne, François Laviolette, Richard Khoury

To support the personalization of Question Answering (QA) systems, we propose a new probabilistic scoring approach based on the topics of the question and candidate answers. First, a set of topics of interest to the user is learned based on a topic modeling approach such as Latent Dirichlet Allocation. Then, the similarity of questions asked by the user to the candidate answers, returned by the search engine, is estimated by calculating the probability of the candidate answer given the question. This similarity is used to re-rank the answers returned by the search engine. Our preliminary experiments show that the reranking highly increases the performance of the QA system estimated based on accuracy and MRR (mean reciprocal rank).

preprint PDF

#628: Alex: a Statistical Dialogue Systems Framework

Filip Jurčíček, Ondřej Dušek, Ondřej Plátek, Lukáš Žilka

This paper describes the Alex Dialogue Systems Framework (ADSF). The ADSF currently includes mature components for public telephone network connectivity, voice activity detection, automatic speech recognition, statistical spoken language understanding, and probabilistic belief tracking. The ADSF is used in a real-world deployment within the Public Transport Information (PTI) domain. In PTI, users can interact with a dialogue system on the phone to find intra- and inter-city public transport connections and ask for weather forecast in a desired city. Based on user responses, vast majority of the system users are satisfied with the system performance.

preprint PDF

#677: An Experiment with Theme--Rheme Identification

Karel Pala, Ondřej Svoboda

In this paper we start from the theory of Functional Sentence Perspective developed primarily by Firbas , Svoboda and also later by Sgall et al. . We make an attempt to formulate and implement a procedure for Czech allowing to automatically recognize which sentence constituents carry information that is contextually dependent and thus known to an addressee ( theme ), constituents containing new information ( rheme ), and also constituents bearing non-thematic and non-rhematic information ( transition ). The experimental implementation of the procedure uses tools developed in NLP Centre, FI MU, particularly the morphological analyzer Majka , disambiguator DESAMB and parser SET . As a starting data resource we use a small corpus of 120 Czech sentences, which at the moment does not include a free continuous text. This is motivated by the fact that we do not use syntactically pre-tagged text but perform syntactic analysis directly using the parser SET. Thus, we offer only a very basic evaluation, which captures the main FSP phenomena and shows that the task is feasible. The toolset developed for the experiment consists of two parts: first, a chunker, which determines word-order positions from the parse tree of a sentence, second, an FSP tagger which is the implementation of the procedure. It labels the chunks with the tags of what is further called functional elements (e.g. theme proper, transition, rheme proper). An experimental version is available at rl{http://nlp.fi.muni.cz/ xsvobo15/fsp/fsp.html}.

preprint PDF

#629: An MLU Estimation Method for Hungarian Transcripts

György Orosz, Kinga Mátyus

Mean length of utterance (MLU) is an important indicator for measuring complexity in child language. A generally employed method for calculating MLU is to use the CLAN toolkit, which includes modules that enable the measurement of utterance length in morphemes. However, these methods are based on rules which are only available for just a few languages not involving Hungarian. Therefore, in order to automatically analyze and measure Hungarian transcripts adequate methods need to be developed. In this paper we describe a new toolkit which is able to estimate MLU counts (in morphemes) while providing morphosyntactic tagging as well. Its components are based on existing resources; however, many of them were adapted to the language of the transcripts. The tool-chain performs the annotation task with a high precision and its MLU estimates are correlated with that of human experts.

preprint PDF

#642: Anti-Models: An Alternative Way to Discriminative Training

Jan Vaněk, Josef Psutka

Traditional discriminative training methods modify Hidden Markov Model (HMM) parameters obtained via a Maximum Likelihood (ML) criterion based estimator. In this paper, anti-models are introduced instead. The anti-models are used in tandem with ML models to incorporate a discriminative information from training data set and modify the HMM output likelihood in a discriminative way. Traditional discriminative training methods are prone to over-fitting and require an extra stabilization. Also, convergence is not ensured and usually "a proper" number of iterations is done. In the proposed anti-models concept, two parts, positive model and anti-model, are trained via ML criterion. Therefore, the convergence and the stability are ensured.

preprint PDF

#672: Aranea: Yet Another Family of (Comparable) Web Corpora

Vladimír Benko

Our paper deals with an on-going Project in the framework of which, by means of open-source and free tools, a family of web corpora is being created that would (to a large extend) deserve the designation of being "comparable". A summary of results after the first stage of the Project is given, and experiences with the tools are commented.

preprint PDF

#647: Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation

Pavel Campr, Marie Kunešová, Jan Vaněk, Jan Čech, Josef Psutka

Our goal is to create speaker models in audio domain and face models in video domain from a set of videos in an unsupervised manner. Such models can be used later for speaker identification in audio domain (answering the question "Who was speaking and when") and/or for face recognition ("Who was seen and when") for given videos that contain speaking persons. The proposed system is based on an audio-video diarization system that tries to resolve the disadvantages of the individual modalities. Experiments on broadcasts of Czech parliament meetings show that the proposed combination of individual audio and video diarization systems yields an improvement of the diarization error rate (DER).

preprint PDF

#575: Automatic Adaptation of Author’s Stylometric Features to Document Types

Jan Rygl

Many Internet users face the problem of anonymous documents and texts with a counterfeit authorship. The number of questionable documents exceeds the capacity of human experts, therefore a universal automated authorship identification system supporting all types of documents is needed. In this paper, five predominant document types are analysed in the context of the authorship verification: books, blogs, discussions, comments and tweets. A method of an automatic selection of authors’ stylometric features using a double-layer machine learning is proposed and evaluated. Experiments are conducted on ten disjunct train and test sets and a method of an efficient training of large number of machine learning models is introduced (\np{163,700} models were trained).

preprint PDF

#662: Automatic Speech Recognition Texts Clustering

Svetlana Popova, Ivan Khodyrev, Irina Ponomareva, Tatiana Krivosheeva

Abstract. This paper deals with the clustering task for Russian texts obtained using automatic speech recognition (ASR). The input for processing are recognition result for phone call recordings and manual text transcripts for these calls. We present a comparative analysis of clustering results for recognition texts and manual text transcripts, make an evaluation of how recognition quality affects clustering and explore approaches to increasing clustering quality by using stop words and Latent Semantic Indexing (LSI).

preprint PDF

#646: BFQA: A Bengali Factoid Question Answering System

Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay

Question Answering (QA) research for factoid questions has recently achieved great success. Presently, QA systems developed for European, Middle Eastern and Asian languages are capable of providing answers with reasonable accuracy. However, Bengali being among the most spoken languages in the world, no factoid question answering system is available for Bengali till date. This paper describes the first attempt on building a factoid question answering system for Bengali language. The challenges in developing a question answering system for Bengali have been discussed. Extraction and ranking of relevant sentences have also been proposed. Also extraction strategy of the ranked answers from the relevant sentences are suggested for Bengali question answering system.

preprint PDF

#608: Bengali Named Entity Recognition using Margin Infused Relaxed Algorithm

Somnath Banerjee, Sudip Kumar Naskar, Sivaji Bandyopadhyay

The present work describes the automatic recognition of named entities based on language independent and dependent features. Margin Infused Relaxed Algorithm is applied for the first time in order to learn named entities for Bengali language. We used openly available annotated corpora with twelve different tagset defined in IJCNLP-08 NERSSEAL shared task and obtained 91.23%, 87.29% and 89.69% precision, recall and F-measure respectively. The proposed work outperforms the existing models with satisfactory margin.

preprint PDF

#671: Building an Arabic Linguistic Resource from a Treebank: the Case of Property Grammar

Raja Bensalem Bahloul, Marwa Elkarwi, Kais Haddar, Philippe Blache

This paper presents a survey of Arabic treebanks to facilitate their reuse for the building of new linguistic resources. In our case, we created from a treebank an automatically induced Property Grammar (GP). So, we discussed characteristics of these treebanks to choose the appropriate one. To build our resource, we adopted an automatic technique, acquiring first a context-free grammar (CFG) from the chosen treebank, and second, inducing a GP by generating relations between grammatical units described in the CFG.

preprint PDF

#679: Captioning of Live TV Commentaries from the Olympic Games in Sochi: Some Interesting Insights

Josef V. Psutka, Aleš Pražák, Josef Psutka, Vlasta Radová

In this paper, we describe our effort and some interesting insights obtained during captioning more than 70 hours of live TV broadcasts from the Olympic Games in Sochi. The closed captioning was prepared for ČT Sport, the sport channel of the public service broadcaster in the Czech Republic. We will briefly discuss our solution for distributed captioning architecture on live TV programs using re-speaking approach as well as several modifications of existing live captioning application (especially LVCSR system), but also the way of re-speaking of a real TV commentary for individual sports. We will show that a re-speaker after hard training can achieve such accuracy (more than 98 %) and readability of captions which clearly outperform accuracy of captions created by automatic recognition of TV soundtrack.

preprint PDF

#685: Clustering in a News Corpus

Richard Elling Moe

We adapt the Suffix Tree Clustering method for application within a corpus of Norwegian news articles. Specifically, suffixes are replaced with n-grams and we propose a new measure for cluster similarity as well as a scoring-function for base-clusters. These modifications lead to substantial improvements in effectiveness and efficiency compared to the original algorithm.

preprint PDF

#562: Comparative Study Concerning the Role of Surface Morphological Features in the Induction of Part-of-Speech Categories

Daniel Devatman Hromada

Being based on English language, existing systems of part-of-speech induction prioritize the contextual and distributional features “external” to the word and attribute somewhat secondary importance to features derived from word’s “internal” morphologic and orthotactic regularities. Here we present some preliminary empirical results supporting the statement that simple “internal” features derived from frequencies of occurrences of character n-grams can substantially increase the V-measure of POS categories obtained by repeated bisection k-way clustering of tokens contained in Multext-East corpora. Obtained data indicate that information contained in suffix features can furnish c(l)ues strong enough to outperform some much more complex probabilist or HMM-based POS induction models, and that this can especially be the case for Western Slavic languages.

preprint PDF

#618: Continuous Distributed Representations of Words as Input of LSTM LM

Daniel Soutner, Luděk Müller

The continuous skip-gram model is an efficient algorithm for learning quality distributed vector representations that are able to capture a large number of syntactic and semantic word relationships. Artificial neural networks have become the state-of-the-art in the task of language modelling whereas Long-Short Term Memory (LSTM) networks seem to be efficient training algorithm. In this paper, we carry out experiments with a combination of these powerful models: the continuous distributed representations of words are trained with skip-gram method on a big corpora and are used as the input of LSTM language model instead of traditional 1-of-N coding. The possibilities of this approach are shown in experiments on perplexity with Wikipedia and Penn Treebank corpus.

preprint PDF

#588: Detecting Commas in Slovak Legal Texts

Róbert Sabo, Štefan Beňuš

This paper reports on initial experiments with automatic comma recovery in legal texts. In deciding whether to insert a comma or not, we propose to use the value of the probability of a bigram of two words without a comma and a trigram of the words with the comma. The probability is determined by the language model trained on sentences with commas labeled as separate words. In the training database one sentence corresponds to one line. The thresholds of bigrams and trigrams probability were experimentally determined to achieve the best balance of precision and recall. The advantage of the proposed method is its high precision (95%) at a relatively satisfactory recall (49%). For judges as potential users of an ASR system with an automatic comma insertion function, precision is particularly important.

preprint PDF

#591: Detection and Classification of Events in Hungarian Natural Language Texts

Zoltán Subecz

The detection and analysis of events in natural language texts plays an important role in several NLP applications such as summarization and question answering. In this study we introduce a machine learning-based approach that can detect and classify verbal and infinitival events in Hungarian texts. First we identify the multiword noun + verb and noun + infinitive expressions. Then the events are detected and the identified events are classified. For each problem, we applied binary classifiers based on rich feature sets. The models were expanded with rule-based methods too. In this study we introduce new methods for this application area. According to our best knowledge ours is the first result for detection and classification of verbal and infinitival events in Hungarian natural language texts. Evaluating them on test databases, our algorithms achieved competitive results as compared to the current English results.

preprint PDF

#611: Development of a Large Spontaneous Speech Database of Agglutinative Hungarian Language

Tilda Neuberger, Dorottya Gyarmathy, Tekla Etelka Gráczi, Viktória Horváth, Mária Gósy, András Beke

In this paper, a large Hungarian spoken language database is introduced. This phonetically-based multi-purpose database contains various types of spontaneous and read speech from 333 monolingual speakers (about 50 minutes of speech sample per speaker). This study presents the background and motivation of the development of the BEA Hungarian database, describes its protocol and the transcription procedure, and also presents existing and proposed research using this database. Due to its recording protocol and the transcription it provides a challenging material for various comparisons of segmental structures of speech also across languages.

preprint PDF

#692: Development of a Semantic and Syntactic Model of Natural Language by Means of Non-Negative Matrix and Tensor Factorization

Anatoly Anisimov, Oleksandr Marchenko, Volodymyr Taranukha, Taras Vozniuk

A method for developing a structural model of natural language syntax and semantics is proposed. Syntactic and semantic relations between parts of a sentence are presented in the form of a recursive structure called a control space. Numerical characteristics of these data are stored in multidimensional arrays. After factorization, the arrays serve as the basis for the development of procedures for analyses of natural language semantics and syntax.

preprint PDF

#652: Dictionary-Based Problem Phrase Extraction from User Reviews

Valery Solovyev, Vladimir Ivanov

This paper describes a system for problem phrase extraction from texts that contain users' reviews of products. In contrast to recent works, this system is based on dictionaries and heuristics, not a machine learning algorithms. We explored two approaches to dictionary construction: manual and automatic. We evaluated the system on a dataset constructed using Amazon Mechanical Turk. Performance values are compared to a machine learning baseline.

preprint PDF

#617: Disambiguation of Japanese Onomatopoeias using Nouns and Verbs

Hironori Fukushima, Kenji Araki, Yuzu Uchida

Japanese onomatopoeias are very difficult for machines to recognize and translate into other languages due to their uniqueness. In particular, onomatopoeias that convey several meanings are very confusing for machine translation systems to distinguish and translate correctly. In this paper, we discuss what features are helpful in order to automatically disambiguate the meaning of onomatopoeias that have two different meanings. We used nouns, adjectives, and verbs extracted from sentences as features, then carried out a machine learning classification analysis and compared the accuracy of how well these features differentiate two meanings of ambiguous onomatopoeias. As a result, we discovered that employing a combination of machine learning with nouns and verbs as a feature achieved accuracy of above 80 points. In addition, we were able to improve the accuracy by excluding pronouns and proper nouns and also by limiting verbs to those that are modified by onomatopoeias. In future, we plan to concentrate on dependency between verbs that are modified by onomatopoeia and nouns, as we believe that this approach will help machine translation to translate Japanese onomatopoeias correctly.

preprint PDF

#632: Divergences in the Usage of Discourse Markers in English and Mandarin Chinese

David Steele, Lucia Specia

Statistical machine translation (SMT) has, in recent years, improved the accuracy of automated translations. However, SMT systems often fail to deliver human quality translations especially with complex sentences and distant language pairs. Current SMT systems often focus on translating single sentences with clauses being treated in isolation, leading to a loss of contextual information. Discourse markers (DMs) are vital contextual links between discourse segments and this paper examines the divergences in their usage across English and Mandarin Chinese. We highlight important structural differences in composite sentences extracted from a number of parallel corpora, and show examples of how these cases are dealt with by popular SMT systems. Numerous significant divergences, such as contextual omissions, were observed, which can lead to incoherent automatic translations. Our objective is to use these findings to guide a framework proposal to address divergences in DM usage in order to improve SMT output quality.

preprint PDF

#600: Document Classification with Deep Rectifier Neural Networks and Probabilistic Sampling

Tamás Grósz, István Nagy T.

Deep learning is regarded by some as one of the most important technological breakthroughs of this decade. In recent years it has been shown that using rectified neurons, one can match or surpass the performance achieved using hyperbolic tangent or sigmoid neurons, especially in deep networks. With rectified neurons we can readily create sparse representations, which seems especially suitable for naturally sparse data like the bag of words representation of documents. To test this, here we study the performance of deep rectifier networks in the document classification task. Like most machine learning algorithms, deep rectifier nets are sensitive to class imbalances, which is quite common in document classification. To remedy this situation we will examine the training scheme called probabilistic sampling, and show that it can improve the performance of deep rectifier networks. Our results demonstrate that deep rectifier networks generally outperform other typical learning algorithms in the task of document classification.

preprint PDF

#561: Empiric Introduction to Light Stochastic Binarization

Daniel Devatman Hromada

We introduce a novel method for transformation of texts into short binary vectors which can be subsequently compared by means of Hamming distance measurement. Similary to other semantic hashing approaches, the objective is to perform radical dimensionality reduction by putting texts with similar meaning into same or similar buckets while putting the texts with dissimilar meaning into different and distant buckets. First, the method transforms the texts into complete TF-IDF, than implements Reflective Random Indexing in order to fold both term and document spaces into low-dimensional space. Subsequently, every dimension of the resulting low-dimensional space is simply thresholded along its 50th percentile so that every individual bit of resulting hash shall cut the whole input dataset into two equally cardinal subsets. Without implementing any parameter-tuning training phase whatsoever, the method attains, especially in the high-precision/low-recall region of 20newsgroups text classification task, results which are comparable to those obtained by much more complex deep learning techniques.

preprint PDF

#598: Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches

Jurgita Kapoči\=ute-Dzikiene, Andrius Utka, Ligita Šarkute

This paper reports the first authorship attribution results based on the automatic computational methods for the Lithuanian language. Using supervised machine learning techniques we experimentally investigated the influence of different feature types (lexical, character, and syntactic) focusing on a few authors within three datasets, containing transcripts of the parliamentary speeches and debates. Due to our aim to keep as many interfering factors as possible to a minimum, all datasets were composed by selecting candidates having the same political views (avoiding ideology-based classification) from the overlapping parliamentary terms (avoiding topic classification task). Experiments revealed that content-based features are more useful compared with the function words or part-of-speech tags; moreover, lemma n-grams (sometimes used in concatenation with morphological information) outperform word or document-level character n-grams. Due to the fact that Lithuanian is highly inflective, morphologically and vocabulary rich; moreover, we were dealing with the normative language; therefore morphological tools were maximally helpful.

preprint PDF

#574: GMM Classification of TTS Synthesis: Identification of Original Speaker's Voice

Jiří Přibil, Anna Přibilová, Jindřich Matoušek

This paper describes two experiments. The first one deals with evaluation of synthetic speech quality by reverse identification of original speakers whose voices had been used for several Czech text-to-speech (TTS) systems. The second experiment was aimed at evaluation of the influence of voice transformation on the original speaker recognition. The paper further describes an analysis of the influence of initial settings for creation and training of the Gaussian mixture models (GMM), and the influence of different types of used speech features (spectral and/or supra-segmental) on correctness of GMM identification. The stability of the identification process with respect to the duration of the tested sentence (number of the processed frames) was analysed, too.

preprint PDF

#594: Generating Underspecified Descriptions of Landmark Objects

Ivandré Paraboni, Alan K. Yamasaki, Adriano S. R. da Silva, Caio V. M. Teixeira

We present an experiment to collect referring expressions produced by human speakers under conditions that favour landmark underspecification. The experiment shows that underspecified landmark descriptions are not only common but, under certain conditions, may be largely preferred over minimally and fully-specified descriptions alike.

preprint PDF

#666: Impact of Irregular Pronunciation on Phonetic Segmentation of Nijmegen Corpus of Casual Czech

Petr Mizera, Petr Pollak, Alice Kolman, Mirjam Ernestus

This paper describes the pilot study of phonetic segmentation applied to Nijmegen Corpus of Casual Czech (NCCCz). This corpus contains informal speech of strong spontaneous nature which influences the character of produced speech at various levels. This work is the part of wider research related to the analysis of pronunciation reduction in such informal speech. We present the analysis of the accuracy of phonetic segmentation when canonical or reduced pronunciation is used. The achieved accuracy of realized phonetic segmentation provides information about general accuracy of proper acoustic modelling which is supposed to be applied in spontaneous speech recognition. As a byproduct of presented spontaneous speech segmentation, this paper also describes the created lexicon with canonical pronunciations of words in NCCCz, a tool supporting pronunciation check of lexicon items, and finally also a minidatabase of selected utterances from NCCCz manually labelled on phonetic level suitable for evaluation purposes.

preprint PDF

#648: Improving a Long Audio Aligner through Phone-relatedness Matrices for English, Spanish and Basque

Aitor Álvarez, Pablo Ruiz, Haritz Arzelus

A multilingual long audio alignment system is presented in the automatic subtitling domain, supporting English, Spanish and Basque. Pre-recorded contents are recognized at phoneme level through language-dependent triphone-based decoders. In addition, the transcripts are phonetically translated using grapheme-to-phoneme transcriptors. An optimized version of Hirschberg's al-gorithm performs an alignment between both phoneme sequences to find matches. The correctly aligned phonemes and their time-codes obtained in the recognition step are used as the reference to obtain near-perfectly aligned subtitles. The performance of the alignment algorithm is evaluated using different non-binary scoring matrices based on phone confusion-pairs from each decoder, on phonological similarity and on human perception errors. This system is an evolution of our previous successful system for long audio alignment.

preprint PDF

#637: Incorporating Language Patterns and Domain Knowledge into Feature-opinion Extraction

Erqiang Zhou, Xi Luo, Zhiguang Qin

We present a hybrid method for aspect-based sentiment analysis of Chinese restaurant reviews. Two main components are employed so as to extract feature-opinion pairs in the proposed method: domain independent language patterns found in Chinese and a lexical base built for restaurant reviews. The language patterns focus on the general knowledge which is implicit contained in Chinese, thus can be used directly by other domains without any modification. The lexical base, on the other hand, targets for particular characteristics of a given domain and acts as a plug-in part in our prototype system, thus does not affect the portability when applying the proposed approach in practice. Empirical evaluation shows that our method performs well and it can gain a progressive result when each component takes into effective.

preprint PDF

#657: Initial Experiments on Automatic Correction of Prosodic Annotation of Large Speech Corpora

Zdeněk Hanzlíček, Martin Grůber

Most modern speech synthesis systems utilize large speech corpora to learn new voices. These speech corpora usually contain several hours of speech spoken by talented speakers who are able to record such an amount of speech data in a sufficient quality. An appropriate phonetic and prosodic annotation of the recorded utterances is necessary for a high quality of synthesized speech. For many languages, the pitch shape within the last prosodic word of a phrase is characteristic for particular types of sentences and phrase structure of compound/complex sentences. However in the real data, this formal convention can be breached and a different pitch shape than expected can be present. This can be a source of prosody inconsistency in synthesized speech. This article presents some experiments on automatic detection of prosodic mismatch in recorded utterances. A simple classifier based on GMM was proposed for this task. Experiments were performed on 5 large speech corpora. The classification results were successfully verified by listening tests.

preprint PDF

#578: Integration of an on-line Kaldi Speech Recogniser to the Alex Dialogue Systems Framework

Ondřej Plátek, Filip Jurčíček

This paper describes the integration of an on-line Kaldi speech recogniser into the Alex Dialogue Systems Framework (ADSF). As the Kaldi \term{OnlineLatgenRecogniser} is written in C++, we first developed a Python wrapper for the recogniser so that the ADSF, written in Python, could interface with it. Training scripts for acoustic and language modelling were developed and integrated into ADSF, and acoustic and language models were build. Finally, optimal recogniser parameters were determined and evaluated. The dialogue system Alex with the new speech recogniser is evaluated on Public Transport Information (PTI) domain.

preprint PDF

#702: Intelligibility Assessment of the De-Identified Speech Obtained Using Phoneme Recognition and Speech Synthesis Systems

Tadej Justin, France Mihelič, Simon Dobrišek

The paper presents and evaluates a speaker de-identification technique using speech recognition and two speech synthesis techniques. The phoneme recognition system is built using HMM-based acoustical models of context-dependent diphone speech units, and two different speech synthesis systems (diphone TD-PSOLA-based and HMM-based) are employed for re-synthesizing the recognized sequences of speech units. Since the acoustical models of the two speech synthesis systems are assumed to be completely independent of the input speaker's voice, the highest level of input speaker de-identification is ensured. The proposed de-identification system is considered to be language dependent, but is, however, vocabulary and speaker independent since it is based mainly on acoustical modelling of the selected diphone speech units. Due to the relatively simple computing methods, the whole de-identification procedure runs in real-time. The speech outputs are compared and assessed by testing the intelligibility of the re-synthesized speech from different points of view. The assessment results show interesting variabilities of the evaluators' transcriptions depending on the input speaker, the synthesis method applied and the evaluators capabilities. But in spite of the relatively high phoneme recognition error rate (approx. 19%), the re-synthesized speech is in many cases still fully intelligible.

preprint PDF

#590: Inter-Annotator Agreement on Spontaneous Czech Language

Tomáš Valenta, Luboš Šmídl, Jan Švec, Daniel Soutner

The goal of this article is to show that for some tasks in automatic speech recognition (ASR), especially for recognition of spontaneous telephony speech, the reference annotation differs substantially among human annotators and thus sets the upper bound of the ASR accuracy. In this paper, we focus on the evaluation of the inter-annotator agreement (IAA) and ASR accuracy in the context of imperfect IAA. We evaluated it using a part of our Czech Switchboard-like spontaneous speech corpus called Toll-free calls. This data set was annotated by three different annotators rendering three parallel transcriptions. The results give us additional insights for understanding the ASR accuracy.

preprint PDF

#640: LIUM and CRIM ASR System Combination for the REPERE Evaluation Campaign

Anthony Rousseau, Gilles Boulianne, Paul Deléglise, Yannick Esteve, Vishwa Gupta, Sylvain Meignier

This paper describes the ASR system proposed by the \SODA consortium to participate in the ASR task of the French REPERE evaluation campaign. The official test REPERE corpus is composed of TV shows. The entire ASR system was produced by combining two ASR systems built by two members of the consortium. Each ASR system has some specificities: one uses an i-vector-based speaker adaptation of deep neural networks for acoustic modeling, while the other one rescores word-lattices with continuous space language models. The entire ASR system won the REPERE evaluation campaign on the ASR task. On the REPERE test corpus, this composite ASR system reaches a word error rate of 13.5 %.

preprint PDF

#603: Language Independent Evaluation of Translation Style and Consistency: Comparing Human and Machine Translations of Camus' Novel "The Stranger"

Mahmoud El-Haj, Paul Rayson, David Hall

We present quantitative and qualitative results of automatic and manual comparisons of translations of the originally French novel "The Stranger" (French: L'Étranger). We provide a novel approach to evaluating translation performance across languages without the need for reference translations or comparable corpora. Our approach examines the consistency of the translation of various document levels including chapters, parts and sentences. In our experiments we analyse four expert translations of the French novel. We also used Google's machine translation output as baselines. We analyse the translations by using readability metrics, rank correlation comparisons and Word Error Rate (WER).

preprint PDF

#699: Language Resources and Evaluation for the Support of the Greek Language in the MARY TtS

Pepi Stavropoulou, Dimitrios Tsonos, Georgios Kouroupetroglou

The paper outlines the process of creating a new voice in the MARY Text-to-Speech Platform, evaluating and proposing extensions on the existing tools and methodology. It particularly focuses on the development of the phoneme set, the Grapheme to Phone (GtP) conversion module and the subsequent process for generating a corpus for building the new voice. The work presented in this paper was carried out as part of the process for the support of the Greek Language in the MARY TtS system, however the outlined methodology should be applicable for other languages as well.

preprint PDF

#601: Minimum Text Corpus Selection for Limited Domain Speech Synthesis

Markéta Jůzová, Daniel Tihelka

This paper concerns limited domain TTS system based on the concatenative method, and presents an algorithm capable to extract the minimal domain-oriented text corpus from the real data of the given domain, while still reaching the maximum coverage of the domain. The proposed approach ensures that the least amount of texts are extracted, containing the most common phrases and (possibly) all the words from the domain. At the same time, it ensures that appropriate phrase overlapping is kept, allowing to find smooth concatenation in the overlapped regions to reach high quality synthesized speech. In addition, several recommendations allowing a speaker to record the corpus more fluently and comfortably are presented and discussed. The corpus building is tested and evaluated on several domains differing in size and nature, and the authors present the results of the algorithm and demonstrate the advantages of using the domain oriented corpus for speech synthesis.

preprint PDF

#643: Modelling F_0 Dynamics in Unit Selection Based Speech Synthesis

Daniel Tihelka, Jindřich Matoušek, Zdeněk Hanzlíček

In the common unit selection implementations, \Fo continuity is measured as one of concatenation cost features with the expectation that smooth units transition (regarding speech melody) is ensured when the difference of \Fo is low enough. This measure generally uses a static \Fo value computed at the units boundary. In the present paper we show, however, that the use of static \Fo values is not enough for smooth speech units concatenation, and that a dynamic nature of the \Fo contour must be taken into account. Two schemes of dynamic \Fo handling are presented, and speech generated by both schemes is compared by means of listening tests on specially selected phrases which are known to carry unnatural artefacts. Advantages and disadvantages of the individual schemes are also discussed.

preprint PDF

#619: NERC-fr: Supervised Named Entity Recognition for French

Andoni Azpeitia, Montse Cuadros, Seán Gaines, German Rigau

Currently there are only few available language resources for French. Additionally there is a lack of available language models for for tasks such as Named Entity Recognition and Classification (NERC) which makes difficult building natural language processing systems for this language. This paper presents a new publicly available supervised Apache OpenNLP NERC model that has been trained and tested under a maximum entropy approach. This new model achieves state of the art results for French when compared with another systems. Finally we have also extended Apache OpenNLP libraries to support part-of-speech feature extraction component which has been used for our experiments.

preprint PDF

#676: Named Entity Recognition for Highly Inflectional Languages: Effects of Various Lemmatization and Stemming Approaches

Michal Konkol, Miloslav Konopík

In this paper, we study the effects of various lemmatization and stemming approaches on the named entity recognition (NER) task for Czech, a highly inflectional language. Lemmatizers are seen as a necessary component for Czech NER systems and they were used in all published papers about Czech NER so far. Thus, it has an utmost importance to explore their benefits, limits and differences between simple and complex methods. Our experiments are evaluated on the standard Czech Named Entity Corpus 1.1 as well as the newly created 2.0 version.

preprint PDF

#625: Ontology Based Strategies for Supporting Communication within Social Networks

Ivan Kopeček, Radek Ošlejšek, Jaromír Plhák

In this paper, ontology based dialogue strategies are presented in connection with the concept of communicative images. Communicative images are graphical objects integrated with a dialogue interface and linked to an associated knowledge database which stores the semantics of the objects depicted. The relevant pieces of information can be linked to the external knowledge distributed in a social network. Exploiting a formal ontology approach facilitates the process of deriving information from relevant texts that can be found in the social network and it simultaneously forms a suitable framework for supporting dialogue communication in natural language. This approach is discussed and illustrated with various examples in this paper.

preprint PDF

#670: Parametric Speech Coding Framework for Voice Conversion Based on Mixed Excitation Model

Michał Lenarczyk

Adaptation of mixed-excitation linear predictive (MELP) model for application in voice conversion is presented. The adapted model features only numerical parameters which can be used for phonetic space transformation from source to target speaker using methods of machine learning. The validity of the model was demonstrated by applying transformation to both the pitch and the spectral envelope of voice.

preprint PDF

#684: Paraphrase and Textual Entailment Generation

Zuzana Nevěřilová

One particular information can be conveyed by many different sentences. This variety concerns the choice of vocabulary and style as well as the level of detail (from laconism or succinctness to total verbosity). Although verbosity in written texts is considered bad style, generated verbosity can help natural language processing (NLP) systems to fill in the implicit knowledge. The paper presents a rule-based system for paraphrasing and textual entailment generation in Czech. The inner representation of the input text is transformed syntactically or lexically in order to produce two type of new sentences: paraphrases (sentences with similar meaning) and entailments (sentences that humans will infer from the input text). The transformations make use of several language resources as well as a natural language generation (NLG) subsystem. The paraphrases and entailments are annotated by one or more annotators. So far, we annotated 3,321 paraphrases and textual entailments, from which 1,563 were judged correct (47.1%), 1,238 (37.3%) were judged incorrect entailments, and 520 (15.6 %) were judged non-sense. Paraphrasing and textual entailment can be put into effect in chatbots, text summarization or question answering systems. The results can encourage application-driven creation of new language resources or improvement of the current ones.

preprint PDF

#686: Partial Grammar Checking for Czech Using the SET Parser

Vojtěch Kovář

Checking people’s writing for correctness is one of the prominent language technology applications. In the Czech language, punctuation errors and mistakes in subject-predicate agreement belong to the most severe and most frequent errors people make, as there are complex and non-intuitive rules for both of these phenomena. At the same time, they include numerous syntactic, semantic and pragmatic aspects which makes them very difficult to be formalized for automatic checking. In this paper, we present an automatic method for fixing errors in commas and subject-predicate agreement, using pattern-matching rule-based syntactic analysis provided by the SET parsing system. We explain the method and present first evaluation of the overall accuracy.

preprint PDF

#693: Partial Measure of Semantic Relatedness based on the Local Feature Selection

Maciej Piasecki, Michał Wendelberger

A corpus-based Measure of Semantic Relatedness can be calculated for every pair of words occurring in the corpus, but it can produce erroneous results for many word pairs due to accidental associations derived on the basis of several context features. We propose a novel idea of a partial measure that assigns relatedness values only to word pairs well enough supported by corpus data. Three simple implementations of this idea are presented and evaluated on large corpora and wordnets for two languages. Partial Measures of Semantic Relatedness are shown to perform better in tasks focused on wordnet development than a state-of-the-art `full' Measure of Semantic Relatedness. A comparison of the partial measure with a globally filtered measure is also presented.

preprint PDF

#582: Phonation and Articulation Analysis of Spanish Vowels for Automatic Detection of Parkinson's Disease

Juan Rafael Orozco-Arroyave, Elkyn Alexander Belalcázar-Bolaños, Julián David Arias-Londoño, Jesús Francisco Vargas-Bonilla, Tino Haderlein, Elmar Noeth

Parkinson's disease (PD) is a chronic neurodegenerative disorder of the nervous central system and it can affect the communication skills of the patients. There is an interest in the research community to develop computer aided tools for the analysis of the speech of people with PD for detection and monitoring. In this paper, three new acoustic measures for the simultaneous analysis of the phonation and articulation of patients with PD are presented. These new measures along with other classical articulation and perturbation features are objectively evaluated with a discriminant criterion. According to the results, the speech of people with PD can be detected with an accuracy of 81% when phonation and articulation features are combined.

preprint PDF

#599: Processing of Quantitative Expressions with Measurement Units in the Nominative, Genitive, and Accusative Cases for Belarusian and Russian

This paper outlines an approach to the stage-by-stage solution of the computer-linguistic problem of the processing of quantitative expressions with measurement units by means of the linguistic processor NooJ. The focus is put on the nominative, genitive, and accusative cases for Belarusian and Russian. The paper gives a general analysis of the problem providing examples not only for Belarusian and Russian, but also for English.

preprint PDF

#595: Referring Expression Generation: Taking Speakers' Preferences into Account

Thiago Castro Ferreira, Ivandré Paraboni

We describe a classification-based approach to referring expression generation (REG) making use of standard context-related features, and an extension that adds speaker-related features. Results show that taking speakers' preferences into account outperforms the standard REG model in four test corpora of definite descriptions.

preprint PDF

#668: RelANE: Discovering Relations between Arabic Named Entities

Ines Boujelben, Salma Jamoussi, Abdelmajid Ben Hamadou

In this paper, we describe the first tool that detects the semantic relation between Arabic named entities, henceforth RelANE. We use various supervised learning techniques to predict the word or the sequence of terms that can highlight one or more semantic relationship between two Arabic named entities. For each word in the sentence, we use its morphological, contextual and semantic features of entity types. We do not integrate a relation classes predefined in order to cover more relations that can be presented in sentences. Given that free Arabic corpora for this task are not available, we built our own corpus annotated with the required information. Plenty of experiments are conducted, and the preliminary results proved the effectiveness of our process that allows to extract semantic relation between Arabic NEs. We obtained promising results in terms of F-score when applied to our corpus.

preprint PDF

#690: Russian Learner Translator Corpus: Design, Research Potential and Applications

Andrey Kutuzov, Maria Kunilovskaya

The project we present – Russian Learner Translator Corpus (RusLTC) is a multiple learner translator corpus which stores Russian students’ translations out of English and into it. The project is being developed by a cross-functional team of translator trainers and computational linguists in Russia. Translations are collected from several Russian universities; all translations are made as part of routine and exam assignments or as submissions for translation contests by students majoring in translation. As of March 2014 RusLTC contains the total of nearly 1.2 million word tokens, 258 source texts, and 1,795 translations. The paper gives a brief overview of the related research, describes the corpus structure and corpus-building technologies used; it also covers the query tool features and our error annotation solutions. In the final part we make a summary of the RusLTC-based research, its current practical applications and suggest research prospects and possibilities.

preprint PDF

#612: Score Normalization Methods Applied to Topic Identification

Lucie Skorkovská, Zbyněk Zajíc

Multi-label classification plays the key role in modern categorization systems. Its goal is to find a set of labels belonging to each data item. In the multi-label document classification unlike in the multi-class classification, where only the best topic is chosen, the classifier must decide if a document does or does not belong to each topic from the predefined topic set. We are using the generative classifier to tackle this task, but the problem with this approach is that the threshold for the positive classification must be set. This threshold can vary for each document depending on the content of the document (words used, length of the document,...). In this paper we use the Unconstrained Cohort Normalization, primary proposed for speaker identification/verification task, for robustly finding the threshold defining the boundary between the correc and the incorrect topics of a document. In our former experiments we have proposed a method for finding this threshold inspired by another normalization technique called World Model score normalization. Comparison of these normalization methods has shown that better results can be achieved from the Unconstrained Cohort Normalization.

preprint PDF

#681: Self Training Wrapper Induction with Linked Data

Anna Lisa Gentile, Ziqi Zhang, Fabio Ciravegna

This work explores the usage of Linked Data for Web scale Information Extraction, with focus on the task of Wrapper Induction. We show how to effectively use Linked Data to automatically generate training material and build a self-trained Wrapper Induction method. Experiments on a publicly available dataset demonstrate that for covered domains, our method can achieve F measure of 0.85, which is a competitive result compared against a supervised solution.

preprint PDF

#621: Semantic Classes and Relevant Domains on WSD

Rubén Izquierdo, Sonia Vázquez, Andrés Montoyo

Language ambiguities are a problem in various fields. For example, in Machine Translation the major cause of errors is ambiguity. Moreover, ambiguous words can be confusing for Information Extraction algorithms. Our purpose in this work is to provide a new approach to solve semantic ambiguities by dealing with the problem of the fine granularity of sense inventories. Our goal is to replace word senses with Semantic Classes that share properties, features and meanings. Also another semantic resources, Relevant Domains, is used to extract extract semantic information and enrich the process. The results obtained are evaluated in the Evaluation Exercises for the Semantic Analysis of Text (SensEval) framework.

preprint PDF

#633: Sentence Similarity by Combining Explicit Semantic Analysis and Overlapping N-grams

Hai Hieu Vu, Jeanne Villaneau, Farida Said, Pierre-François Marteau

We propose a similarity measure between sentences which combines a knowledge-based measure, that is a lighter version of ESA (Explicit Semantic Analysis), and a distributional measure, {\sc Rouge}. We used this hybrid measure with two French domain-orientated corpora collected from the Web and we compared its similarity scores to those of human judges. In both domains, ESA and {\sc Rouge} perform better when they are mixed than they do individually. Besides, using the whole Wikipedia base in ESA did not prove necessary since the best results were obtained with a low number of well selected concepts.

preprint PDF

#586: Speaker Identification by Combining Various Vocal Tract and Vocal Source Features

Yuta Kawakami, Longbiao Wang, Atsuhiko Kai, Seiichi Nakagawa

Previously, we proposed a speaker recognition system using a combination of MFCC-based vocal tract feature and phase information which includes rich vocal source information. In this paper, we investigate the efficiency of combination of various vocal tract features (MFCC and LPCC) and vocal source features (phase and LPC residual) for normal-duration and short-duration utterance. The Japanese Newspaper Article Sentence (JNAS) database was used to evaluate our proposed method. The combination of various vocal tract and vocal source features achieved remarkable improvement than the conventional MFCC-based vocal tract feature for both normal-duration and short-duration utterances.

preprint PDF

#673: Speech Synthesis and Uncanny Valley

Jan Romportl

The paper discusses a hypothesis relating high quality text-to-speech (TTS) synthesis in spoken dialogue systems with the concept of "uncanny valley". It introduces a "Wizard-of-Oz" experiment with 30 volunteers engaged in conversations with two synthetic voices of different naturalness. The results of the experiment are summarized and interpreted, leading to the conclusion that the TTS uncanny valley effect in dialogue systems can probably be superseded and inverted by a positive attitude of the systems' users toward new technologies.

preprint PDF

#605: Study on Phrases Used for Semi-Automatic Text-Based Speakers’ Names Extraction in the Czech Radio Broadcasts News

Michaela Kuchařová, Svatava Škodová, Ladislav Šeps, Marek Boháč

In this paper we introduce a methodology leading to the extension of speakers' database used in the process of automatic transcription of spoken documents stored in the largest Czech Radio audio archive. We address the issue of the conversion of spoken speech to written texts -- the automatic detection of speakers and their names. We work with a subset of the archive that consists of 8,020 hours of broadcasting news and 58,914,179 words within the years 1968--2011. We observed the occurrence of thousands of speakers' names during the period and therefore it is necessary to use their automatic or semi-automatic identification. Another investigated issue leading to the extension of speakers' database is the co-occurrence of a speaker's name in a specific phrase in the text transcription linked with the speaker's change in the audio recording.

preprint PDF

#556: SuMACC Project's Corpus: a Topic-based Query Extension Approach to Retrieve Multimedia Documents

Mohamed Morchid, Richard Dufour, Usman Niaz, Francis Bouvier, Clément de Groc, Claude de Loupy, Georges Linarès, Bernard Merialdo, Bertrand Peralta

The SuMACC project aims at automatically tracking new multimodal entities on Internet. The goal of the project is to propose robust multimedia methods that define relevant patterns allowing to automatically retrieve these entities. This paper describes the SuMACC corpus collected on video-sharing platforms using word-queries. Since concepts are limited to a single or few words, querying video-sharing platforms with the concept only can easily introduce irrelevant collected videos. In this paper, we propose to use an extended query obtained by mapping the initial concept into a topic space from a Latent Dirichlet Allocation (LDA) algorithm. This topic-based query extension approach allows to better retrieve videos related to the targeted concept. As a result, a corpus of 7,517 videos, extracted using the simple ( i.e. concept only) and the extended queries, from 47 concepts, was obtained. Results show the effectiveness of the proposed thematic querying approach compared to the simple concept query in terms of relevance (+21%) and ambiguity (-4%). The annotation process as well as the corpus statistics are detailed in this paper.

preprint PDF

#675: Towards a Unified Exploitation of Electronic Dialectal Corpora: Problems and Perspectives

Nikitas N. Karanikolas, Eleni Galiotou, Angela Ralli

In this paper, we deal with the problem of storing and retrieving dialectal data in a unified framework. In particular, we discuss issues concerning the design and implementation of a multimedia database which will contain written and oral data from three Greek dialects in Asia Minor. At first, we describe the overall architecture of a system aiming at providing the user with the possibility to store audio recordings, text transcripts, and other annotations. Then we discuss the possibilities and limitations of a retrieval module aiming at combining different linguistic levels for a unified exploitation of oral and written corpora.

preprint PDF

#602: Tuning Limited Domain Speech Synthesis Using General TTS System

Markéta Jůzová, Daniel Tihelka

The subject of the present paper is the building of a limited domain speech synthesis system, where longer units, like words and phrases, can naturally be concatenated together. However, instead of building a single-purpose domain-oriented engine working with longer units, we show that a general-purpose TTS system can be used as a good emulation tool to ensure that a real domain-oriented engine will work correctly. Since the current general speech synthesis system embedding unit selection method concatenates short speech units (diphones), the selection algorithm has been modified to pretend the concatenation of words or even the whole phrases, while still concatenating diphones internally. The behaviour of the system is tested on two limited domains and its output is compared to the output of general (unmodified) version of the same TTS system. The results show clear encouragement for the build of the "real" domain-oriented engine.

preprint PDF

#616: Two-layer Semantic Entity Detection and Utterance Validation for Spoken Dialogue Systems

Adam Chýlek, Jan Švec, Luboš Šmídl

In this paper we present a novel method for semantic entity detection in a limited domain for spoken language understanding. The target domain of this method is a dialogue system for an interactive training of air traffic controllers (ATC). The method comprises of two layers of detection. First layer uses formerly proposed method for semantic entity detection to extract domain-dependent set of semantic entities. This semantic entities are modelled using context-free grammars. To detect mispronounced words or words which do not comply with the ATC radio-telephony rules we use the second layer of semantic entity detection. Together with that, we assign a semantic meaning to the utterance. We also discuss the possibility of using this approach for semantic-based correction of an utterance. The experiments were performed on transcribed data as well as on an output from speech recognizer.

preprint PDF

#614: Unit Selection Cost Function Exploration Using an A* based Text-to-Speech System

David Guennec, Damien Lolive

Speech synthesis systems usually use the Viterbi algorithm as a basis for unit selection, while it is not the only possible choice. In this paper, we study a speech synthesis system relying on the A^{*} algorithm, which is a general pathfinding strategy developing a graph rather than a lattice. Using state of the art techniques, we propose and analyze different selection strategies and evaluate them using a subjective evaluation on the N-best paths returned. The best strategy achieves a MOS score of 3.29 (\pm 0.18). More interesting, the proposed system enables an in-depth analysis of unit selection.

preprint PDF

#697: Using Graph Transformation Algorithms to Generate Natural Language Equivalents of Icons Expressing Medical Concepts

Pascal Vaillant, Jean-Baptiste Lamy

A graphical language addresses the need to communicate medical information in a synthetic way. Medical concepts are expressed by icons conveying fast visual information about patients' current state or about the known effects of drugs. In order to increase the visual language's acceptance and usability, a natural language generation interface is currently developed. In this context, this paper describes the use of an informatics method -- graph transformation -- to prepare data consisting of concepts in an OWL-DL ontology for use in a natural language generation component. The OWL concept may be considered as a star-shaped graph with a central node. The method transforms it into a graph representing the deep semantic structure of a natural language phrase. This work may be of future use in other contexts where ontology concepts have to be mapped to half-formalized natural language expressions.

preprint PDF

#606: Using Suprasegmental Information in Recognized Speech Punctuation Completion

Marek Boháč, Karel Blavka

We propose a scheme to determine punctuation of the text produced by an automatic speech recognizer. We deal with the addition of commas based on the recognized text and we propose a full stop detection scheme using both -- the textual and prosody information. We also propose an expanded scheme which utilizes enriched audio document information (e.g. speaker diarization, language detection etc.) to improve the sentence boundary detection. We compare the above mentioned schemes and its accuracy in terms of (in)correctly estimated punctuation markers and its ability to mark the positions of sentence boundaries. Hence we want to show it is better to incorporate all the relevant information sources in one reasonable scheme than to split the document processing into independent layers. Proposed schemes are evaluated over a set of recordings from the Czech (and Czechoslovak) radio broadcasts.

preprint PDF

#630: Using Verb-Noun Patterns to Detect Process Inputs

Munshi Asadullah, Damien Nouvel, Patrick Paroubek

We present the preliminary results of an ongoing work aimed at using morpho-syntactic patterns to extract information from process descriptions in a semi-supervised manner. The experiments have been designed for generic information extraction tasks and evaluated on detecting ingredients from cooking recipes in French using a large gold standard corpus. The proposed method uses bi-lexical dependency oriented syntactic analysis of the text and extracts relevant morpho-syntactic patterns. Those patterns are then used as features for different machine learning methods to acquire the final ingredient list. Furthermore, this approach may easily be adapted to similar tasks since it relies on mining generic morpho-syntactic patterns from the documents automatically. The method itself is language independent, considering language specific parsers being used. The performance of our method on the DEFT 2013 data set is nevertheless satisfactory since it significantly outperforms the best system from the original challenge (0.75 vs 0.66 MAP).

preprint PDF

#604: Visualization of Intelligibility Measured by Language-Independent Features

Tino Haderlein, Catherine Middag, Andreas Maier, Jean-Pierre Martens, Michael Doellinger, Elmar Noeth

Automatic intelligibility assessment using automatic speech recognition is usually language-specific. In this study, a language-independent approach based on alignment-free phonological and phonemic features is proposed. It utilizes models that are trained with Flemish speech, and it is applied to assess dysphonic German speakers. In order to visualize the results, two techniques were tested: a plain selection of most relevant features emerging from Ensemble Linear Regression involving feature selection, and a Sammon transform of all the features to a 3-D space. The test data comprised recordings of 73 hoarse persons (48.3 \pm 16.8 years) who read the German version of the text "The North Wind and the Sun". The reference evaluation was obtained by five speech therapists and physicians who rated intelligibility according to a 5-point Likert scale. In the 3-D visualization, the different levels of intelligibility were clearly separated. This could be the basis for an objective support for diagnostics in voice and speech rehabilitation.

preprint PDF

#623: Research potential and design of Russian Learner Translator Corpus

Andrey Kutuzov, Maria Kunilovskaya

The project we present – Russian Learner Translator Corpus (RusLTC) – is a multiple learner translator corpus, which stores Russian students’ translations out of English into Russian (L1) and out of Russian into English (L2). The project was initiated in 2011, and is being developed by a cross-functional team of translator trainers and computational linguists from National Research University Higher School of Economics (Moscow, Russia) and Tyumen State University (Tyumen, Russia). Translations are collected from 10 Russian universities, which offer specialist, BA and MA translation programs. All translations are made as part of routine and exam assignments or as submissions for translation contests by students majoring in translation. RusLTC is not designed with a specific research purpose in mind, but for a broad research agenda, including descriptive translation studies, translation variability and research into didactics of translation. It is also intended for direct use in the classroom and as a resource for teaching material design. As of March 2014 RusLTC contains the total of nearly 1.2 mln word tokens, 258 source texts (41 Russian sources and their 589 translation; 217 English sources and their 1206 translation). The number of translations varies from 1 to more than 60. We stick to open knowledge philosophy in choosing corpus-building technology, and are happy to make the Corpus available online under Creative Commons license. To the best of our knowledge RusLTC is the third multiple learner translator corpus available online after ENTRAD and MeLLANGE LTC. The Corpus is aligned at sentence-level, and the query interface (available at http://rus-ltc.org) supports lexical search for both sources and targets and returns all occurrences of the query item in respective texts along with their targets/sources. The paper gives a brief overview of the related research, describes the corpus structure and corpus-building technologies used; it also covers the query tool features and our error annotation solutions. In the final part we make a summary of RusLTC-based research, its current practical applications and suggest research prospects and possibilities.

#709: Multilingual Aspects of the Khresmoi project

Hlaváčová Jaroslava

The demonstration will show working prototype of the European project named Khresmoi which concentrates on multilingual information retrieval and machine translation within the field of biomedicine.

#711: PMSE - The PetaMem Scripting Environment

Richard Jelínek, Jiří Mácha, Jiří Václavík

PetaMem Scripting Environment (PMSE) is a software suite that allows you to perform virtually any task related to corpus linguistics. PMSE has been designed to offer a comprehensive toolchain to provide the user with a very generic way of working with text corpora - starting with the acquisition of data, any modification (like format conversion), thorough statistical analysis and data visualization.

#712: A collaborative NLP development environment

Tobias Kortkamp, Martin Ring, Mathias Soeken, Rolf Drechsler

Many powerful tools and frameworks exist for the development of natural language processing (NLP) applications. However, visualization capabilities are often not provided, or if they are then only in a very limited manner. This slows down the development of new applications, since (i) it is often difficult to understand the data structures and results of NLP algorithms, and (ii) a user-friendly interface is cumbersome to implement. By adding a back-end for NLP to the development environment Clide we provide solutions to both problems.

#713: QCRI Advanced Transcription System (QATS)

Ahmed Ali, Yifan Zhang, Stephan Vogel

QCRI Advanced Transcription System (QATS) is continuously monitoring Aljazeera.net, and fetch any video files being labeled by journalists to be transcribed. The current system is deployed on Azure platform to fetch video files from the high quality video streaming server, and return subtitle transcription in Distribution Format Exchange Profile (DFXP) format with a real time less than 1.5 Real Time (RT), with minimum Turn Around Time (TAT) of 17 minutes. TAT is the exact time from downloading the video until DFXP is being made ready. QATS is currently being deployed on Aljazeera video archive to allow retrieval capabilities using the video content as well as the metadata.

#714: Finding Terms in Corpora for Many Languages

Vit Suchomel

Term candidates for a domain, in a language, can be found by comparing a corpus for the domain to a reference corpus for the language. The items with the highest relative frequency are the top term candidates. The steps to produce the candidates are not unusual or innovative for NLP. However it is far from trivial to implement them all, for numerous languages, in an environment that makes it easy for non-programmers to find the terms in a domain. This is what we have done in the Sketch Engine and will demonstrate.

TSD 2013 | TSD 2012 | TSD 2011