The functions of the IEML database

The IEML database is a database to record the USLs (the IEML expressions) with their set of metadata.

In this blog post I will describe the different roles of this database.

Centralize the language’s knowledge

The main function of the IEML database is to collect past interpretations of IEML expressions, for documentation purposes, to ensure consistency of interpretations and to build statistically significant training sets.

An example of USL : **“to fly off”** or “**to go up in the air freely**“

Record the described USLs

The database entries are the described USLs, i.e. an USL (an IEML expression) with a set of metadata, called descriptors. These descriptors can exist in all languages (in French and English for the moment, identified by ISO 639), and can take several distinct types of values:

A set of translations: a verbatim example of a use of the concept of the USL in a particular language. This field can be shared by several USLs if there are homonyms. In this case, we try to disambiguate the translation with indications of context (in brackets). Ex: “to fly off (movement)”
A definition/comment: a defining sentence that helps finding the meaning of the USL. This field must be unique in the whole database, there cannot be two IEML expressions with the same meaning. This field is used to document the USLs, and to justify their particular construction. Ex: “to fly off” : “to go up in the air freely”
A set of tags: This field is used to organize IEML expressions in the database editor.

All the entries in the database are described USLs. However, the list of descriptors is open to some addition of new metadata types, such as wikidata (or wiktionary) URIs, links to resources (e.g. definition by example, photo), or other resources allowing to ground the meaning of the USLs.

Maintaining language coherence

The meaning of a text, its interpretation, can vary from one person to another, depending on many factors, such as culture and native language. The same is true for an USL, which is interpreted as a small text, and is therefore subject to the interpretative variation of the different individuals. This problem brings us to the second advantage of gathering all these USLs in the same database with their descriptors: to have a unified repository where consistency tests can be performed.

Indeed, in IEML, one can easily compute semantic relations between concepts from their scripts, and thus represent the database as a graph of semantic relationships, and it becomes possible to perform computations on the set of concepts as a system (a language from a synchronic point of view).

The database coherence comes from the alignment between the network of semantic relationships of the USLs, and the observed network of semantic relationships of their descriptors. Since this property is not always automatically computable, we have designated a consistent set of tests to verify the alignement properties. These properties, if checked together, will best guarantee the consistency of the database. Briefly, these properties relate to :

The match between the semantic proximity (in natural language) of the translations and the proximity between the USLs. The syntactically close USLs must be semantically close.
The correct use of the IEML grammar (the respect of the composition, transitivity and commutativity of operations …) as well as the morphemes meanings. The same IEML constructions must have the same meaning in different contexts.
The guarantee that none of the USLs have the same meaning, as there are no synonyms or homonyms in IEML.

These consistency checks are an integral part of the database editing process. I will detail in a future blog post this editing process as well as the properties and the consistency tests.

Enabling the evolution of the language

The uses of a particular language change with people and times. To give IEML a chance to become part of the collective life, mechanisms for flexible editing of the database must be put in place.

This is why we chose, to support the database, a versioning tool promoting collaboration: the git protocol. This tool save the different versions of the language and maintain editable copies of the database (branches) to modify the conceptual structure without impacting the totality of the users.

Recording successive language versions

As mentioned earlier, git allows you to record all the intermediate states of the database and thus give a diachronic view of the language. Each new state in the database is named by a “hash” (e.g., 9af91324f0adf…) and can therefore be referenced in applications as a particular state in the history of the database’s conceptual graph.

Example of a database modification graph, each circle represents a modification and also a state of the database, each modification is identified by a commit hash, a message and an author. The branch mechanism allows users to modify the database without impacting the other users.

Enable collaborative editing

The database is hosted on the Github platform, allowing users to take advantage of their ecosystem of collaborative tools. For example, errors can be reported through “issues” (a forum to describe, discuss and find solutions to bugs), the database can be “forked” by a github user (make a branch) to be then proposed as a pull-request (make a merge).

The editorial process integrates all the steps of database editing via Github.

Disseminate the language

A language is living only when it is used. To encourage IEML adoption, we organized the database as a tool supporting the collaborative learning and dissemination of the language.

Produce a learning and research resource

The database, designed to contain hundreds of thousands of IEML concepts, is open-source and easily navigable with the Intlekt editor. Novices and autodidacts alike, when learning the IEML grammar, can draw on a large number of examples here.

Encouraging the use of IEML in a production context

The USL database is intended to serve as an open semantic repository for diverse applications. The format is very simple (scv: space separated values), its content is easily downloadable and is readable with any programming language. Moreover, a python API is available to manipulate the database.

I have just described the main functions of the database, in a future post I will talk about its editing, through the process of modification by a user, as well as consistency testing. In this regard, a future version of the IEML library will integrate an error correction mechanism that will correct in a transparent way the consistency errors of the end user.