Multimodal datasets have been used in a series of NLP tasks, such as image captioning, machine translation, visual question answering, text to image generation, among others. Such datasets vary with regard to the kinds of (meta)data they include and to how they are interconnected. In this tutorial, we present a novel architecture for multimodal datasets by extending the FrameNet model to the annotation of images correlated to text. We describe a pipeline for importing multimodal data for annotation, as well as tools, annotation setups, and guidelines for annotating frames and frame elements in corpora of static and dynamic images having written captions or spoken audio-description. We conclude the tutorial by exploring the different kinds of correlations that can be extracted between modalities, as well as the ones between categories used in FrameNet and those typically deployed by computer vision algorithms.
This tutorial will provide a brief overview of how to semantically annotate multimodal datasets for Semantic Frames and Frame Elements. We will discuss the different types of multimodal annotation made possible so far by Charon, the multimodal annotation tool developed by FrameNet Brasil:
• films comprising dynamic images, audio, and subtitles;
• films adapted for assistive technologies, comprising dynamic images, audio, subtitles, accessible subtitles, and audio-description;
• image-caption pairs, such as those used in traditional multimodal datasets such as Flickr30k (Young et al., 2014) and its extensions (Elliot et al., 2016).
We will also approach best practices for creating and using multimodal datasets.
FrameNet basics: this section presents the foundations of the FrameNet model and discusses its suitability for building semantic representations of multimodal objects.
Charon – The FrameNet Brasil Multimodal Annotation Tool: this section presents the functionalities of the tool, the corpus import pipeline, and possible data reports.
Static image annotation: this section presents the methodology for annotating static images, the specificities of this module, and its current use in research.
Dynamic image annotation: this section presents the methodology used for building a multimodal video-based corpus, the methodology for annotating images, audio transcriptions and subtitles, the specificities of this module, and its current use in research.
Assistive technology output annotation: this section presents the methodology used for building a multimodal accessible video-based corpus, the methodology for annotating audio descriptions and closed captions, the specificities of this module and its current use in research.
Multimodal FrameNet – image-text correlations: this section presents examples of results and new theoretical statements made possible by the use of FrameNet in a multimodal approach. It discusses how FrameNet categories can be correlated with other label sets, such as the ontologies used by computer vision algorithms.
The tutorial is meant to be a guided walkthrough covering the use of the FrameNet Brasil web tool for multimodal annotation.
Through examples, participants will gain a deeper understanding of how to annotate multimodal datasets following the FrameNet guidelines and methodology.
The tutorial bears the hybrid possibility; therefore, it can be attended both in person and virtually. It will consist of a demonstrative slide presentation.
This tutorial is of interest for those working with multimodal datasets and their application to downstream tasks involving the combination of Computer Vision and Natural Language Processing. Due to its practical approach, no knowledge of the fundamentals of Frame Semantics or previous experience in annotation tasks is necessary. Nonetheless, additional references will be provided in case further reading is required.
The tutorial will be offered by the ReINVenTA (Research and Innovation Network on Visual and Text Analysis of Multimodal Objects) Team:
Head of the FrameNet Brasil Computational Linguistics Lab, PI of ReINVenTA and Professor of the Graduate Program in Linguistics at the Federal University of Juiz de Fora, Brazil. Research Productivity Grantee of the Brazilian National Council for Scientific and Technological Development (CNPq). Guest Professor at the Department of Swedish, Multilingualism and Language Technology at the University of Gothenburg between Aug 2022 and Jan 2023.
Professor in the Graduate Program in Linguistics and Applied Linguistics at Federal University of Minas Gerais, Brazil. Research Productivity Grantee of the Brazilian National Council for Scientific and Technological Development (CNPq).
Ph.D. in Linguistics, currently conducting Postdoctoral Research in Frame Semantics theory and Accessible Audiovisual Translation at the Federal University of Minas Gerais, Brazil.
Ph.D. in Linguistics, Researcher at the FrameNet Brasil Lab working on audiovisual interaction within the framework of Frame Semantics. Former Visiting Researcher at Case Western Reserve University and professor of the Communications Department at UniAcademia––Centro Universitário Academia.
Ph.D. candidate in Linguistics at the Federal University of Juiz de Fora, working on a new approach in the field of Frame Semantics that focuses on the automatic processing of combined image-text data to propose a model for the development of Multimodal Machine Translation systems.
Masters student in the Language Technology Program at the University of Gothenburg.
PhD in Cogntive Linguistics, Master in Computational Modeling, researcher at FrameNet Brasil Computational Linguistics Lab, member of the Graduate Program in Linguistics at the Federal University of Juiz de Fora, Brazil.
Ph.D. candidate in Linguistics at the Federal University of Juiz de Fora, currently assessing the feasibility of Construction Grammar based theories and approaches for Natural Language Generation and with previous experience in software development and data science.