Schedule – Advanced Language Processing Winter School

The speakers will provide pre-recorded lectures. Free slots allow you to watch those lectures before the Q&A sessions. The schedule will be provided as an ics calendar soon on this page.

CET	Monday 16/01	Tuesday 17/01	Wednesday 18/01	Thursday 19/01	Friday 20/01
8-9
9-10
10-11
11-12	Gather town Poster Session 1		Slack Lab Session Neural Machine Translation		Gather Town Social Session 2
12-13	Gather town Poster Session 1
13-14
14-15	Zoom Q&A François Yvon	Slack Lab Session Prompt Engineering			Zoom Q&A Dirk Hovy
15-16				Gather town Poster Session 2
16-17				Gather town Poster Session 2
17-18	Zoom Q&A Kyunghyun Cho	Zoom Q&A Michael Auli	Zoom Q&A Collin Raffel
18-19		Gather Town Social Session 1		Zoom Q&A Yejin Choi

Social Sessions

Social Session 1

Titouan Parcollet – Speech and open source toolkits
Maha Elbayad – Large Scale Multilingual Machine Translation
Sara Hooker – Tips and tricks for living your best life as a researcher (from attending conferences to producing research you are proud of)
Jesse Dodge – Efficient and Reproducible NLP
Vlad Niculae – Structure prediction
Ange Richard – Applying NLP to social science

Social Session 2

William Havard – Handling an interdisciplinary PhD topic
Shachar Mirkin – Being a Remote Scientist
Diane Larlus – Text and image multimodal models
Sebastian Ruder – Collaboration in research
Fabien Ringeval – Academic careers in France
Wilker Aziz – Text generation (decoding) under uncertainty

Poster Sessions

Session 1

A 1 Seongmin Mun
How do Transformer-Architecture Models Address Polysemy of Korean Adverbial Postpositions?
[Abstract]
Several studies have used contextualized word-embedding models to reveal the functions of Korean postpositions. To add more interpretation, we devised a classification model by employing BERT and GPT-2 and introduces a computational simulation that interactively demonstrates how these transformer-architecture models simulate human interpretation of Korean adverbial postpositions -ey, -eyse, and -(u)lo. Results reveal that (i) the BERT model performs better than the GPT-2 model to classify the intended function of postpositions, (ii) there is an inverse relationship between the classification performance and the number of functions that each postposition manifests, (iii) the models’ performance gradually improves as the epoch proceeds, and (iv) the models are affected by the scarcity of input and/or semantic closeness between the items.
A 2 Kanishka Silva
Self Attention Generative Adversarial Network based Authorship Attribution in Historical Texts
[Abstract]
Authorship Attribution is identifying an author from an unknown text. Given the numerous disputed texts and use of pen names in humanities, especially in historical and literary texts, we aim to follow a deep-learning-based approach to perform authorship attribution in historical texts. In this study, we address the short-text challenge and the authorship obfuscation. We propose a novel Self-Attentive Generative Adversarial Network model which demonstrates the usage of deep learning for digital humanities. We empirically evaluate various conventional and modern baselines, including the newly proposed self-attentive GAN model, using a newly created dataset of long 19th-century English novels. The later modifications of the proposed model include applying it to modern applications.
A 3 Thomas Stroehle
How can pre-trained language models support idea evaluation?
[Abstract]
In crowdsourced idea competitions, the use of deep language models is a promising way to analyze the large amount of text-based ideas to overcome cognitive limitations and reduce the amount of work required for evaluation. Ideas are usually biased in evaluation due to their text length, which can be remedied by standardization using text summarization. But how does text summarization change the quality of ideas? Pre-trained language models can also be used to structure idea spaces. Can semantic similarities of ideas be used to identify more novel ideas from prior knowledge already known?
A 4 Elise Lincker
Multi-modal data extraction and enrichment: textbooks as a use case
[Abstract]
In order to foster inclusive education, the MALIN research project aims to build a computational system that automatically adapts textbooks to make them accessible to children with disabilities. More specifically, it will help pupils with motor or visual impairment, learning disorders or autism. Although making books accessible seems to be achievable, textbooks are more challenging due to diversity of content and formats, and remain an understudied type of data. The first objective is to develop approaches to extract structure (lessons, exercises [divided into instructions, statements, examples…], memos…) and multi-modal content (text, pictures, drawings, graphs, equations…) from digital textbooks. The second objective is to analyse the content of each extracted block in order to categorize them into educational activities.
A 5 Romain Meunier
NLP for crisis management
[Abstract]
The use of social networks is pervasive in our daily life. All areas are concerned, including civil security and crisis management. Recently, Twitter has been widely used to generate valuable information in crisis situations, showing that traditional means of communication between population and rescue departments are clearly suboptimal. Our work focuses on classifying tweets for crisis management.
A 6 Alkis Koudounas
Transformer-based Non-Verbal Emotion Recognition
[Abstract]
Recognizing emotions in non-verbal audio tracks requires a deep understanding of their underlying features. Traditional classifiers relying on excitation, prosodic, and vocal traction features are not always capable of effectively generalizing across speakers’ genders. We explore the use of a Transformer architecture trained on contrastive audio examples. We leverage augmented data to learn robust non-verbal emotion classifiers. We also investigate the impact of different audio transformations, including neural voice conversion, on the classifier capability to generalize across speakers’ genders. The empirical findings indicate that neural voice conversion is beneficial in the pre-training phase, yielding an improved model generality, whereas is harmful at the fine-tuning stage as hinders model specialization for the task of non-verbal emotion recognition.
A 7 Aleksey Dorkin
Comparison of Current Approaches to Lemmatization: A Case Study in Estonian
[Abstract]
We experiment with two recently proposed neural network based methods of lemmatization on data in Estonian language. The first method is a pattern-based token classification model that predicts transformation rules based on contextual representation; these rules are then applied to word forms to obtain lemmas. The second approach is based on a generative encoder-decoder model that uses morphological features to represent context. We show the comparison of these two approaches on two Estonian datasets of different domains and discuss the advantages and disadvantages of both approaches.
A 8 Till Überrück-Fries
Linguse — Identifying Multi-Word Expressions for Language Learners
[Abstract]
Linguse is a reading app for language learners with the mission to keep learners engaged in what they are reading and provide at all times exactly the assistance needed. The app automatically breaks down each text into lexical units and keeps track of each unit read or marked by the learner throughout all texts. Multi-word expressions and their inherent idiomaticity pose a particular challenge for language learners. Linguse aims to identify MWEs and point learners to the most helpful information to understand the MWE in context. During the poster session, I will present Linguse’s approach to identifying MWEs and look forward to a more general discussion about NLP for language learning.
A 9 Emil Kalbaliyev
Narrative Why-Question Answering: A Review of Challenges and Datasets
[Abstract]
Narrative Why-Question Answering is an important task to assess the causal reasoning ability of models in narrative settings. Further progress in this domain requires clear identification of challenges that question answering models need to address. Since Narrative Why-Question Answering combines the characteristics of both narrative understanding and why-question answering, we review the challenges related to these two domains. Furthermore, we identify suitable datasets for Narrative Why-Question Answering and outline both data-specific and task-specific challenges that can be utilized to test the performance of models. We also discuss some issues that can pose problems in benchmarking Narrative Why-Question Answering systems.
A 10 Deepak Kumar
Parameter-efficient On-demand Bias Mitigation via AdapterFusion
[Abstract]
Large pre-trained language models contain societal biases and carry along these biases to downstream tasks. Current in-processing bias mitigation approaches (like adversarial training) impose debiasing by updating a model’s parameters, or in fact by transferring the model to a new (non-restorable) debiased state. In this work, we propose a novel approach to create stand-alone debiasing functionalities separate from the model which can be integrated into the model on-demand, while keeping the core model untouched. Drawing from the concept of AdapterFusion in multi-task learning, we introduce OPTIMA (Optional bias Mitigation via AdapterFusion) – a debiasing approach to first encapsulate arbitrary bias mitigation functionalities into separate adapters, and then add them to the model on-demand to deliver fairness qualities. We conduct a large set of experiments on three classification tasks with gender, race, and age as protected attributes. Our results show that OPTIMA improves or maintains the effectiveness of bias mitigation, avoids catastrophic forgetting on a multi-attribute scenario, and maintains on-par task performance, while granting parameter-efficiency and easy switching between the original and debiased models.
A 11 Ronald Cardenas
Cognitive Structures of Content for Controlled Summarization of Long Documents
[Abstract]
Summarization systems face the core challenge of identifying and selecting important information. However, despite advancements in fluency, summaries are still overall poorly structured and, if not accounted for, highly redundant. In this work, we tackle the problem of summary redundancy and coherence in abstractive summarization of long, highly-redundant scientific articles. We introduce a system inspired by a psycholinguistic model of text comprehension and production that theorizes on how humans organize and manipulate information units in short-term memory (STM). The system simulates STM that dynamically selects relevant information, makes sure selected content connects to previously generated content, and adds details without being redundant. Extensive experiments demonstrate that the summaries produced by our system exhibit better lexical cohesiveness and significantly lower redundancy.
A 12 Harshita Diddee
Unsupervised Estimation of Quality of Data
[Abstract]
Estimating the quality of data is a critical requirement to train robust models on high-quality, representative and unbiased data. Usually such an estimation is reliant on human-validation which is cost and effort intensive, in addition to being susceptible to artefacts of human subjectivity. This work presents a preliminary investigation of estimating the quality of data is an unsupervised manner. We discuss sequential approaches derived from utilizing referenceless language generation metrics, unsupervised translation estimation and data-driven attributes that might act as weak proxies for indicating data quality.
A 13 Tanise Ceron
Optimizing text representations to capture (dis)similarity between political parties
[Abstract]
Even though fine-tuned neural language models have been pivotal in enabling “deep” automatic text analysis, optimizing text representations for specific applications remains a crucial bottleneck. In this study, we look at this problem in the context of a task from computational social science, namely modeling pairwise similarities between political parties. Our research question is what level of structural information is necessary to create robust text representation, contrasting a strongly informed approach (which uses both claim span and claim category annotations) with approaches that forgo one or both types of annotation with document structure-based heuristics. Evaluating our models on the manifestos of German parties for the 2021 federal election. We find that heuristics that maximize within-party over between-party similarity along with a normalization step lead to reliable party similarity prediction, without the need for manual annotation.
A 14 Amit Sah
ADA: An Attention-Based Data Augmentation Approach to Handle Imbalanced Textual Datasets
[Abstract]
This paper presents an Attention-based Data Augmentation (ADA) approach which (i) utilizes a vector similarity-based keywords extraction mechanism to identify keywords from the minority class data points, (ii) exploits the keywords to extract important contextual words from minority class documents using an attention mechanism, and finally (iii) utilizes the important contextual words to enrich the minority class dataset. We oversample the minority class dataset by generating new documents based on important contextual words and augmenting them to the minority class dataset. We evaluate both the oversampled and original versions of the datasets on the classification task. We also compare ADA over the augmented datasets with two popular state-of-the-art text data augmentation techniques.
A 15 Sumanth Doddapaneni
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
[Abstract]
We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the web, resulting in a 4x increase. We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar, which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at this https URL and we hope they will help advance research in NMT and multilingual NLP for Indic languages.

Session 2

B 1 Florian Mai
HyperMixer: An MLP-based Low Cost Alternative to Transformers
[Abstract]
Transformer-based architectures are the model of choice for natural language understanding, but they come at a significant cost, as they have quadratic complexity in the input length, require a lot of training data, and can be difficult to tune. In the pursuit of lower costs, we investigate simple MLP-based architectures. We find that existing architectures such as MLPMixer, which achieves token mixing through a static MLP applied to each feature independently, are too detached from the inductive biases required for natural language understanding. In this paper, we propose a simple variant, HyperMixer, which forms the token mixing MLP dynamically using hypernetworks. Empirically, we demonstrate that our model performs better than alternative MLP-based models, and on par with Transformers. In contrast to Transformers, HyperMixer achieves these results at substantially lower costs in terms of processing time, training data, and hyperparameter tuning.
B 2 Shu Okabe
Towards Automatic Gloss Generation for Computational Language Documentation
[Abstract]
One step of the language documentation process consists in annotating utterances collected on the field—once transcribed—at the morpheme level. For each minimal unit segmented in the input stream, the annotation process associates either one (or, in rare cases, several) morpho-syntactic tag(s) or one conceptual label represented by the corresponding English lemma. With the goal of automating this annotation task, we report the results of a preliminary study where this task is viewed as a sequence labelling task. Comparing the obtained results with a standard PoS task on the same data allows us to assess the difficulty of the process, especially in the context where the training resources are limited.
B 3 Febe de Wet
Voices of Mzansi: Localising the Mozilla Common Voice platform for South Africa’s official languages
[Abstract]
Despite many attempts to address the situation, South Africa’s official languages remain under-resourced in terms of the text and speech data required to implement state-of-the-art language technology. To ensure that no language is left behind, resource development should remain a priority until a strong digital presence has been established for all indigenous languages. The poster introduces an ongoing initiative to launch South Africa’s languages on the Mozilla Common Voice platform.
B 4 Atilla Kaan Alkan
TDAC: the First Time-Domain Astrophysics Corpus
[Abstract]
The increased interest in time-domain astronomy over the last decades has resulted in a substantial increase in observation report publication leading to a saturation of how astrophysicists read, analyze and classify information. Due to the short life span of the detected astronomical events, information related to the characterization of new phenomena has to be communicated and analyzed very rapidly to allow other observatories to react and conduct their follow-up observations. We introduce TDAC: a Time-Domain Astrophysics Corpus. TDAC is the first corpus based on astrophysical observation reports for Named Entity Recognition.
B 5 Lila Kim
Automatic classification of nasal vowels for a characterization of speakers’ voice quality by convolutional neural network
[Abstract]
Voice quality is considered in the literature to have implications for speaker characterization. It can be a permanent part of a speaker’s voice due to physiological particularities or speaker’s habits, but it is also subject to intra-speaker variability. Acoustic analyses of nasality are complex due to acoustic effects produced by the coupling of the nasal and oral cavities. The present study aims to assess the ability to discriminate between an oral vowel and a nasal vowel with a convolutional neural network (CNN). In addition, there is work on a visualization technique and the effect of factors for misclassification cases. Finally, we compare the perceptual performance of human with the performance of CNN in English data in use of nasal coarticulation.
B 6 Carolina Biliotti
Breaking Down the Lockdown: The Causal Effect of Stay-At-Home Mandates on Uncertainty and Sentiments during the COVID-19 pandemi
[Abstract]
We study the causal effects of lockdown measures on uncertainty and sentiments on Twitter. By exploiting the quasi-experimental setting induced by the first Western COVID-19 lockdown — the unexpected lockdown implemented in Northern Italy in February 2020 — we measure changes in public uncertainty and sentiment expressed on daily pre and post-lockdown tweets geolocalized inside and in the proximity of the lockdown areas. Using natural language processing, including dictionary-based methods and deep learning models, we classify each tweet across four categories — economics, health, politics and lockdown policy– to identify in which areas uncertainty and sentiments concentrate. Using a Difference-in-Difference analysis, we show that areas under lockdown depict lower uncertainty around the economy and the stay-at-home mandate itself. This surprising result likely stems from an informational asymmetry channel, for which treated individuals adjusts their expectations once the policy is in place, while uncertainty persists around the untreated. However, we also find that the lockdown comes at a cost as political sentiments worsen.
B 7 Anna Aksenova
Challenges of Information Extraction from Clinical Texts
[Abstract]
The poster is going to present challenges faced by Named Entity Recognition in the clinical domain that we encounter during working on EU projects (EXAMODE, RES-Q+). As one of the largest problems in medical NLP is data, we are going to mention language resources that were used in previous projects to tackle lack of annotated and open access data. In addition, we will mention the issue of multilinguality as all of the data is provided by hospitals in domestic languages. In the end we will talk about the challenges of extreme number of entity and relation tags. As it is a work in progress, any feedback will be highly appreciated.
B 8 Léa-Marie Lam-Yee-Mui
Multilingual features for speech recognition for low-resource languages
[Abstract]
To alleviate the lack of transcribed data for speech recognition for low-resource languages, cross-lingual transfer learning from multilingual models trained on thousands of hours of other languages allows the acoustic model to be initialized with non-random weights. With End-to-End models, slightly similar approaches stack linear layers on top of a multilingual self-supervised pre-trained model. The obtained model is then trained with a standard supervised loss. In this work, we study whether we can instead use the multilingual models as features extractors for ASR for low-resource languages. We experiment with features extracted from models trained in a supervised manner or in an unsupervised manner and compare the performances to transfer learning methods.
B 9 Xu Yizhou
Prompt Engineering-Based Text Anomaly Detection
[Abstract]
Text anomaly detection is an important text mining task. Many outlier identification methods have been applied in this field. However, these approaches have hardly benefited from recent Natural Language Processing (NLP) advances. The advent of pre-trained language models like BERT and GPT-2 has given rise to a new machine learning paradigm called prompt engineering, which has shown good performance on many NLP tasks. This article presents an exploratory work aimed at examining the possibility of detecting text anomaly using prompt engineering. In our experiments, we have examined the performance of different prompt templates. The results showed that prompt engineering is a promising method for text anomaly detection.
B 10 Riccardo Pozzi
Evaluation of Incremental Entity Extraction with Background Knowledge and Entity Linking
[Abstract]
Named entity extraction is a crucial task to support the population of Knowledge Bases (KBs) from documents written in natural language. However, in many application domains, these documents must be collected and processed incrementally to update the KB as more data are ingested. In some cases, quality concerns may even require human validation mechanisms along the process. While very recent work in the NLP community has discussed the importance of evaluating and benchmarking continuous entity extraction, it has proposed methods and datasets that avoid Named Entity Linking (NEL) as a component of the extraction process. In this paper, we advocate for batch-based incremental entity extraction methods that can exploit NEL with a background KB, detect mentions of entities that are not present in the KB yet (NIL mentions), and update the KB with the novel entities. Based on this assumption, we present a methodology to evaluate NEL-based incremental entity extraction, which can be applied to a “static” dataset for evaluating NEL into a dataset for evaluating incremental entity extraction. We apply this methodology to an existing benchmark for evaluating NEL algorithms, and evaluate an incremental extraction pipeline that orchestrates different strong state-of-the-art and baseline algorithms for the tasks involved in the extraction process, namely, NEL, NIL prediction, and NIL clustering. In presenting our experiments, we demonstrate the increased difficulty of the information extraction task in incremental settings and discuss the strengths of the available solutions as well as open challenges
B 11 Nalin Kumar
Exploring joint approaches to RDF triple parsing
[Abstract]
Recent work on natural language understanding and natural language generation has shown a lot of progress, based on pretrained language models. In addition, it has been shown that these tasks often benefit from a joint solution. The 2020 WebNLG shared task (Castro Ferreira et al., 2020), which provides a dataset of RDF triples and corresponding natural language descriptions, has been an important benchmark for these kinds of experiments. The state-of-the-art approaches still have significant problems, especially regarding the accuracy of both understanding and generation (Ji et al., 2022a; Ji et al., 2022b). The aim of this poster is to explore joint approaches to RDF triple parsing, specifically focusing on improving their accuracy. This work uses additional training tasks, such as named entity recognition, to regularize the trained models. It incorporates other existing renowned techniques in the field as well. The results show an improvement over the baseline, which was trained similarly to the text-to-text tasks for T5 models.
B 12 Rakia Saidi
BERT model for Arabic multi-tasks
[Abstract]
The Arabic language is described as difficult to automatically analyze due to its properties, morphological structure, and syntactic features. The Deep learning (DNN) models has recently displayed previously unheard-of capabilities, changing the field of artificial intelligence (AI). This poster present a arabic BERT model applied to some Arabic NLP tasks: Sentiment analysis (SA), text semantic similarity (TSS) and word sense disambiguation (WSD).
B 13 Fabio Fehr
A VAE for Transformers with Nonparametric Variational Information Bottleneck
[Abstract]
We propose a Variational AutoEncoder (VAE) for Transformers by using a Variational Information Bottleneck (VIB) regulariser for Transformer embeddings. The Transformer encoders embedding space is formalised as mixture distributions, and use Bayesian nonparametrics to develop a Nonparametric VIB (NVIB) for such attention-based representations. The variable number of mixture components supported by nonparametric methods captures the variable number of vectors supported by attention, and exchangeable distributions from nonparametric methods capture the permutation invariance of attention. We propose our Transformer VAE (NVAE) using NVIB to regularise the information passing from the Transformer encoder to the Transformer decoder through cross-attention. Evaluations of a NVAE, trained on natural language text, demonstrate that NVIB can regularise the number of mixture components in the induced embedding whilst maintaining generation quality and reconstruction capacity.
B 14 Mukund Rungta
Geographic Citation Gaps in NLP Research
[Abstract]
“In a fair world, people have equitable opportunities to education, to conduct scientific research, to publish, and to get credit for their work, regardless of where they live. However, it is common knowledge among researchers that a vast number of papers accepted at top NLP venues come from a handful of western countries and (lately) China; whereas, very few papers from Africa and South America get published. Similar disparities are also believed to exist for paper citation counts. In the spirit of “”””what we do not measure, we cannot improve””””, this work asks a series of questions on the relationship between geographical location and publication success (acceptance in top NLP venues and citation impact). We first created a dataset of 70,000 papers from the ACL Anthology, extracted their meta-information, and generated their citation network. We then show that not only are there substantial geographical disparities in paper acceptance and citation but also that these disparities persist even when controlling for a number of variables such as venue of publication and sub-field of NLP. Further, despite some steps taken by the NLP community to improve geographical diversity, we show that the disparity in publication metrics across locations is still on an increasing trend since the early 2000s.”