Close side menu

GSoC 2021

The FrameNet Brasil Computational Linguistics Lab at the Federal University of Juiz de Fora, Brazil, has been accepted as a mentor organization for Google Summer of Code 2021. This page is the main reference point for students submitting their projects to address the ideas listed below.

 

1. FrameNet 101

A framenet is a semantically oriented computational resource in which language material (words, multi-word expressions and grammatical constructions) are linked to a network of frames that help define their meaning. In the context of Frame Semantics, a frame is a scene, a system of interrelated concepts in which participants on the scene, the props they use, and the way they interact are defined. The key notion in framenet is that the meaning of words – as well as the meaning of other levels of linguistic structure – depends on the frames associated with the words, that is, words may evoke frames. Take a word such as the verb tour, for example. In order to understand this word, a speaker of English recruits the Touring frame, in which there are three core participants: the Tourist, an Attraction and a Place. These three elements must be cognitively present, so that the idea of touring can be interpreted. There’s no tourism without one of those elements. Additionally, frames are interconnected to each other via a series of relations, providing a cognitive semantics structure against which meaning is defined. 

In FrameNet Brasil we apply this kind of semantically oriented structure to tackle important issues in Natural Language Understanding. In the ideas list below, we explain those issues further.

To learn more about FrameNet Brasil, consider the following papers:

TORRENT, T. T.; MATOS, E.; LAGE, L.; LAVIOLA, A.; TAVARES, T.; ALMEIDA, V. G.; SIGILIANO, N. (2018). Towards continuity between the lexicon and the constructicon in FrameNet Brasil. In: LYNGFELT, B.; BORIN, L.; OHARA, K. H.; TORRENT, T. T. (Orgs.). Constructional Approaches to Language. Amsterdam: John Benjamins Publishing Company.

DINIZ DA COSTA, A.; GAMONAL, M. A.; PAIVA, V. M. R. L.; MARÇÃO, N. D.; PERON-CORRÊA, S.; ALMEIDA, V. G.; MATOS, E. E. S.; TORRENT, T. T. (2018). FrameNet-Based Modeling of the Domains of Tourism and Sports for the Development of a Personal Travel Assistant Application. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan: ELRA, p. 6-12.

There are framenets under development for several languages (English, Japanese, German, Swedish, Brazilian Portuguese, Chinese, among others) and also a global initiative to connect them all and develop shared tasks based on framenet data. To learn more about this initiative, visit the Global FrameNet website.

 

2. How to Apply

Successful applicants will turn in projects to address the issues listed in the ideas list, bringing together the kind of structured data FN-Br has been developing through the past decade with the computational techniques they find more suited for achieving the proposed goals. Please note that FN-Br is not only about big data, machine learning and whichever purely statistical approach to language is out there. The work in FN-Br is model-based, besides being also data-driven. The kinds of issues prompting the mentoring process that will take place if FN-Br is accepted for GSoC 2021 are not to be solved by solely training some algorithm from a ton of raw data. With that in mind, applicants should follow the steps below to submit their applications:

  1. Read the papers listed in section 1 of this page;
  2. Look at the data reports available on this website and familiarize yourself with the kind of structure FN-Br builds;
  3. If they have questions and/or need further clarification on any of the ideas, they should feel free to email us at projeto.framenetbr@ufjf.edu.br. There’s also a Slack for the GSoC and if they’d like to join, they should just send us an email and we’ll invite them to it;
  4. Write a 1-3 pages pre-project and submit it to projeto.framenetbr@ufjf.edu.br, one week before the official submission deadline if they’d like to get feedback from FN-Br before the official submission;
  5. Use the feedback provided by our team to improve their proposal.

 

3. Ideas List

Two aspects are important to keep in mind while reading the ideas described below:

  1. The core of FN-Br data model includes three entities: Frames, Frame Elements and Lexical Units (LU). Frames (representing the meaning of a scene or event) are composed by Frame Elements (participants in the scene or event). Lexical Units represent the association – or pairing – of a word or multiword expression to a Frame.
  2. The basic annotation process of a sentence consists of choosing a specific word in the sentence (the target Lexical Unit) and associate semantic labels (Frame Elements) to the other words/expressions in the sentence that are somehow dependent on the target. Originally, only sentences (i.e. texts) were annotated in FN-Br. Nonetheless, since GSoC 2019, the FN-Br Web Annotation Tool also features a module for annotating multimodal corpora, that is, a video plus the transcripts of the audio or the subtitles superimposed to it. The set of annotated target LU/image element plus Frame Elements is called AnnotationSet. As many words and/or image fragments in the sentence can be chosen as targets, it is possible that many AnnotationSets are associated with one sentence or video fragment. 

 

3.1. Automatic Generation of Qualia Relations between Lexical Units in FrameNet

Mentors: Alexandre Diniz da Costa (FN-Br | UFJF) |  Ely Matos (FN-Br | UFJF) | Tiago Torrent (FN-Br | UFJF) 

 

General Context:

FNBr has been implementing qualia relations in its database. These relations derive from the idea of ​​qualia structure, proposed by Pustejovsky (1995) in the Generative Lexicon (GL) theory. Basically, GL assumes that the meaning of words is structured on four generative factors, called qualia roles. Each role captures how humans understand objects and the relationships between these objects in the world, trying to provide some explanation for the linguistic behavior of lexical items. Pustejovsky (1995, p. 85) defines four qualia roles:

  1. Formal: values ​​that establish what differentiates a given object within its semantic domain; it is typically the description of its basic category.
  2. Constitutive: values ​​that express the relationship between a given object and its constituents or parts, such as material, weight, or characteristic parts.
  3. Telic: values ​​related to information about the object’s function or purpose, such as the intention of an agent performing a given action or the object’s intrinsic function.
  4. Agentive: values ​​that determine the origin of the object, such as its creator, type of origin (natural or artificial), or its initial cause.

At FN-Br, this idea was implemented by creating relations between Lexical Units (LUs). Each relation has specific semantics (e.g. part_of, used_for, used_by, is_a, etc.) and is associated with one of the four qualia roles. Because of its specific semantics, each relation is also associated with a background frame that improves and explains the semantics of the relation. Each LUs is associated with a Frame Element of this Background frame. Therefore, the use of qualia relations constitutes a “lexical ontology”, which is used in several processes, such as the disambiguation of LUs and parsing.

 

The Idea:

FN-Br currently has more than 25,000 qualia relations for LUs in English and Brazilian Portuguese. However, this is a relatively small number if we take into account the total number of LUs in the database (more than 27000, including pt, en, es). These relations were created manually – which is a time-consuming and costly process.

This idea proposes the development of a pipeline that takes advantage of databases available on the Internet (lexical resources, semantic networks, ontologies, etc.) to automatically create new qualia relations. Examples of resources that can be used include: VerbNet, ConceptNet, BabelNet, Framester, among others. These resources provide semantic relations between lexical items.

A successful project should implement a (human in the loop) solution for (semi-) automatically extracting qualia relations between words in existing databases and incorporating them to FN-Br.

 

Why this Idea is Innovative:

The innovation presented by FN-Br lies in the use of two complementary theories of lexical semantics. While Frame Semantics allows analyzing the meaning of a lexical item within the context in which it is used (that is, which frame is evoked by that lexical item), the qualia relations from GL allows a more precise specification of the meaning of the item, relating it to other items not due to the linguistic context, but based on a conceptualization of common sense knowledge, thus forming a lexical ontology. However, the effective application of these theories in computational applications (and the evaluation of this application) requires a more complete database – which is the object of this idea.

 

3.2. Enhancing FrameNet Data Compatibility and Visualization Features

Mentors: Ely Matos (FN-Br | UFJF) | Collin Baker (FrameNet | ICSI) | Marcelo Viridiano (FN-Br | UFJF)

 

General Context:

As the FrameNet Brasil Web Annotation Tool has been used for other projects, as well as in the Global FrameNet Shared Annotation Task, new data compatibility features have been demanded by the community.

The Ideas:

This idea is split into two sub-ideas:

 

3.2.1. New Data Compatibility Features for FrameNet Data

This idea revolves around the implementation of data import/export features from/to other formats used by other projects/tools, among which the Berkeley FN XML standard, the Universal Dependencies CONLLU format and the WebAnno standards should be considered.

 

3.2.2. Graph-based Data Visualization

This idea involves a partial migration from FN database to a graph database. The project can include some common graph traversals like frame groups, shortest path between two frames, analyses of frame families for polysemic LUs and others. This visualization tool can be built outside the FrameNet Brasil WebTool, as a new web visualization tool ideally with application to other complex lexical databases, as well as FrameNet.

 

Why this Idea is Innovative:

FrameNet data is rich and dense. All this richness, plus the network based structure of FrameNet, makes traditional list and table based data visualization inadequate. However, no suitable data visualization tool has been built which will illuminate the whole complex structure of Frame Semantic data. These tools would help to meet the urgent need to link the fine-grained semantic representations of FrameNet with other computational tools.

 

3.3. Detecting Joint Meaning Construal by Language and Gesture

(This idea will be developed in a co-mentorship project with the Red Hen Lab. Applicants may choose whether they will apply to FrameNet Brasil or Red Hen. However, if the student gets accepted, mentors from both labs will be involved in the mentorship.)

 

Mentors: Francis Steen (Red Hen) | Fred Belcavello (FN-Br | UFJF) | Mark Turner (Red Hen) | Tiago Torrent (FN-Br | UFJF)

 

General Context:

Both FrameNet Brasil and Red Hen have been investigating how meaning is construed in multimodal communication. While Red Hen has been focusing more on the relation between speech and co-speech gestures, FN-Br has been looking into how frames are evoked by different modalities, especially audio and video. 

In both cases, however, research interest revolves around how different modalities interact for meaning production. 

 

The Idea:

For this idea, we expect projects focused on identifying joint meaning construal patterns. Recent work has defined a non-exhaustive list of construal dimensions, which could be used for inspiration. Also, Red Hen has a collection of multimodal corpora already annotated for the kind of co-speech gesture that accompanies speech. Good examples of these kinds are air quotes gestures, which can accompany very different types of speech with very different functions.

 

Why this Idea is Innovative:

Although research in multimodal communication has advanced greatly in the past decade, practitioners in the field still fall short in ways of analyzing how meaning is construed from the interaction between modalities in large amounts of data. A successful implementation of this idea would then allow for human in the loop solutions for annotating patterns of joint meaning construal in multimodal communication.