Standardization | Activities

UTX (universal terminology exchange) common format for user dictionary

Japanese version


Quick Guide

Two-page A4 sheets. Illustrated, essential information sheets of UTX at a glance, including common errors and an explanation of how to create glossary information. See the brochure for more details.

You can directly scroll, zoom, and print the Flash diagram below.
(The content is identical to the PDF version above.)

 

Brochures

This brochure contains the essentials of UTX.

You can directly scroll, zoom, and print the Flash diagram below.
(The content is identical to the PDF version above.)

 

 

The content is identical to the above trifold version, but the format is A4.

Introduction to UTX

Background

In order to use MT (machine translation) systems such as translation software effectively, it is essential to use user dictionaries. When we use commercial high-end translation software in a CAT (computer-aided translation) workflow, specialized terminology, names of persons, and place names in the target document are often not included in basic system dictionaries, and they are not translated as well as one would expect. However, if these terms are registered into a user dictionary, the accuracy of the MT system can be improved.

Unfortunately, user dictionaries are not always compatible between different MT systems, rendering the effort to create such dictionaries futile. To address this issue, AAMT (Asia-Pacific Association for Machine Translation) has undertaken to establish a set of specifications for sharable dictionaries, which can be used across different MT systems. AAMT created its first version of specification, UPF (Universal PlatForm), with support from IPA (Information-technology Promotion Agency, an institute in Japan) in 1995. In 2006, AAMT started to create new specifications to reflect and incorporate the subsequent advancement of technology and the changing usage of MT. In 2007, the new format received a new name "UTX," short for universal terminology eXchange. In 2009, AAMT has established UTX-Simple, which is the simple version of UTX. In 2011, UTX-Simple has changed its name to "UTX," dropping "-Simple."

We are currently focusing on the following tasks: production and collection of dictionary data; and creation of a user community for generating, sharing, and accumulating user dictionaries in a sustainable way. We will select some domains for which translation is needed, and build and collect actual dictionary data from real-world documents in accordance with the specification of UTX. By carrying out translation with the dictionary and collecting feedback, the UTX specification will be further improved.  

Why use UTX?

Simplicity

  • "Dictionary for the user" - simple and easy to use

    • UTX (tab-delimited text format) requires only three types of information: a source word, its translation, and the part of speech of the source word. Therefore, it is easy to build for anyone.
    • UTX accommodates many kinds of languages.
  • Inclusion of the information to support managing and sharing dictionaries

    • UTX dictionary includes a creator name and creation timestamp.

Entry as a "technical term"

  • Clarification of domains

  • One word, one meaning

    • In principle, one term has one meaning (i.e. one translation) in a specific domain.
    • An entry must be a unique term within an applicable domain.

Benefits of using UTX

  • For users

    • Improvement in translation accuracy in each specific domain.
    • Sharing and reuse of dictionaries is possible through user communities, either locally or over the Internet.
  • For manufacturers of MT systems

    • The entire MT market will be enlivened by promotion of user dictionaries.
    • New demands for and applications of MT can be explored.
    • UTX-XML format retains entry properties that are proprietary to manufacturers, and no data is lost during the conversion to/from UTX-XML.

A community for shared dictionaries - "Open dictionaries for everyone"

  • AAMT will establish two types of dictionary communities for producing, exchanging, sharing, and accumulating dictionaries, and a framework for distribution.
     
  • The official dictionary community (managed by AAMT or its delegate) offers supervised dictionaries with guaranteed quality.
  • The open dictionary community offers free dictionaries with open source license and promotes mutual exchange. AAMT or its delegate provides hosting service only, but no management or guarantee.
  • AAMT will collaborate with Oki's community-oriented machine translation site  Yakushite-net.

Examples of application

  • Open source localization

 

  • Problem 1: Terminology and its translation (including terms for user interface) vary among applications, thus reducing the translator's efficiency and confusing end users. In the above example, the term "format" can be translated into several Japanese words.
  • Solution by using UTX: Terminology used in various applications can be standardized. End users can enjoy more organized, reduced terminology.
  • Problem 2: Translation assets are not effectively recycled when translating a similar application. Since there is no accumulation of translation, whenever starting a new translation project, it is necessary to start from scratch.
  • Solution by using UTX: Translation is more efficient by accumulating, sharing, and reusing translation assets, such as user dictionary and glossaries.
  • Problem 3: Multilingualization is not an easy task.
  • Solution by using UTX: UTX-XML accommodates multiple languages. Since the translation assets (user dictionaries and glossaries) are centralized and reusable through UTX-XML, multilingualization is significantly easier.

  

  • As technical-term dictionary and in-company glossary

  • Problem: The glossary for in-house documents and the glossary for translation are often separate, incoherent, disorganized, and difficult to manage centrally. Accumulation of technical knowledge of a specific domain in the company is desirable but difficult, since the sources are separate and they are all in different formats.

  • Solution by using UTX: It can also be used as a monolingual glossary. Exchange of data with various tools is easy through the standardized specification. Import from existing glossaries is also straightforward.

  • Supporting intercultural communication between individuals

  • Problem: Correct translations of proper nouns are often difficult to find, for example, when a fan wishes to write a fan letter to an overseas writer or a movie star. When a user would like to chat with an overseas friend about sports or online games, the translation of sports players' names or the terminology of the online game may not be available in system dictionaries.

  • Solution by using UTX: Niche glossaries which are not included in commercial specialized dictionaries can be shared to be used for machine translation.

 

  • Translation assistance for developing countries

  • Problem 1: NPOs are always in need of more human resources and funds.
    Solution by using UTX: Translation assistance and automatic translation can be carried out at low cost.

  • Problem 2: Some minor languages only have limited bilingual glossaries or dictionaries. Some domains may have a high priority, such as medical science.
    Solution by using UTX: Since UTX dictionaries can be accumulated in a dictionary community, dictionaries can be gradually compiled in a language with limited bilingual glossaries.

Development and use of tools

  • Term extraction and dictionary building tools

Building a dictionary by adding one word at a time requires a huge effort. We need tools to analyze multiple documents, extract terms, and add them to an existing user dictionary or make a new one instantly.

  • UTX converters (including parsers)

We will need tools which convert from a format unique to a translation application or a translation site to UTX format, and vice versa. A parser which verifies the conformity to the UTX specification must also be included.

  • utx-conv converter

    utx-conv is a perl-based converter. It converts UTX to/from the formats of MT tools. It was originally developed by Francis Bond. AAMT distributes under the Simplified BSD License. Please let us know if you are interested to continue developing it.

  • convert_glossary converter

    convert_glossary (the link at the bottom of the page) is a converter developed by Translation, Training/education, and Testing to convert between TBX-Glossary (*.tbx), UTX (*.utx), GlossML (*.gml), and OLIF (*.olf) glossary formats. For an output, name a file (without a path/folder) with a file extension for the file format that you wish to convert to. For example, type "test.tbx" if you wish to convert from one of the four formats above to a tbx file.

  • Dictionary search tool (glossary search tool)

We will need tools to perform a direct search to a dictionary and glossary to see the translation of a word.

The following tools can use UTX even thought they do not officially support UTX.

  • OmegaT can use a UTX dictionary as a glossary.
  • ApSIC Xbench can use a UTX dictionary as a glossary. Select it as "Tab-delimited Text File."

Download

UTX 1.11 Specification

Download from here (PDF).

Anyone can create UTX dictionaries and tools that use them using this specification.

Please also take a look at our brochures.

 

Changes in UTX-Simple 1.10:

  • Term status (provisional, non-standard, approved, and forbidden)
  • Dictionary administrator and contributors
  • Concept ID and dictionary ID

Changes in UTX 1.11:

  • UTX-Simple is now simply called "UTX"
  • Thus, the version identification of the first line changed from UTX-S to UTX
  • Clarification of some sections

Dictionaries

  • More dictionaries will be available. If you wish to include your UTX dictionaries in this list, please contact us.
Name and Domain Direction of translation Author License Dictionary Version Number of entries
Computational Linguistics Term List English to Japanese Francis Bond Creative Commons 3.0, Attribution (CC-BY)   4092
Japanese to English Francis Bond Creative Commons 3.0, Attribution (CC-BY)   4123
Medical Glossary (Department of Medical Informatics, Kitazato University) English to Japanese Medical Informatics, School of Allied Health Sciences, Kitazato University Creative Commons 3.0, Attribution (CC-BY) 1.01 27126
Japanese-English Standard Dictionary of Legal Terms English to Japanese (the direction has been changed.) Japan Creative Commons 3.0, Attribution No Derivatives (CC-BY-ND) 1.01 5451
 
  • AAMT terminology dictionary (sample)

     

Articles and Presentations

  • "Introduction to UTX, a Specification for a Shared User Dictionary," (PDF) a paper submitted to the Association of Natural Language Processing, Japan (13th annual meeting). Originally written in Japanese, translated into English.
    This paper refers to UTX-Simple 0.91.
  • A presentation was delivered at the LISA China Focus (November 2009).
  • A presentation was delivered at the LISA Open Standards Summit (February 2011).
  • A presentation was delivered at the annual meeting of the Association of Natural Language Processing, Japan (March 2011).
  • An article on UTX1.11 on MultiLingual June issue (May 2011)
    (The first page of the article is available for free. The article includes additional introduction of UTX.)

Contact

If you are interested in UTX, please contact us through the following form: We welcome organizations and individuals who could collaborate with us to develop the specification of UTX, to provide and/or build dictionaries and tools, and to perform evaluation.

Please be sure to fill in the item with *.

Your name*
Telephone number*
E-mail address*
Company or organization
Type* Individual Company or organization Others
Your target language(s), domain(s) of interest,
and other comments
If you wish to receive updates of UTX, please join to the UTX mailing list. Non-AAMT members are also welcome. The ML is not directly managed by AAMT. The correspondence is mostly in Japanese. We might start another mailing list in English, if demands are high.
 

Attention: Please CLICK the button AGAIN on the next screen to submit your message. This site currently does not support SSL. Therefore, the transmission of your data to our server is not encrypted.

 

Disclaimer

By using the specifications of UTX, UTX-Simple, and UTX-XML (hereinafter collectively called “UTX Specifications”) or the dictionaries based on UTX Specifications (hereinafter called “UTX Dictionaries”), you agree to be bound by the following terms. The invalidity or unenforceability of this disclaimer shall in no way affect the validity or enforceability of any other provision herein.

1. To the authors of UTX Dictionaries and related tools from the AAMT and its members:

(1) UTX Specifications are made public, and anyone can use them. The AAMT, however, does not waive any rights thereof and no one may alter UTX Specifications nor make them public.

(2) THE AAMT AND ITS MEMBERS PROVIDE UTX SPECIFICATIONS “AS IS,” WITH NO GUARANTEES WHATSOEVER. YOU SHOULD USE UTX SPECIFICATIONS AND UTX DICTIONARIES AT YOUR OWN RISK.

(3) THE AAMT AND ITS MEMBERS SHALL NOT ASSUME ANY RESPONSIBILITY FOR UTX SPECIFICATIONS AND THE RESULT OF THEIR USE INCLUDING, BUT NOT LIMITED TO, THE EXISTENCE OF INFRINGEMENT OF THIRD PARTIES’ RIGHTS AND THE ACCURACY, ADEQUACY AND QUALITY OF THE TRANSLATION.

(4) THE AAMT AND ITS MEMBERS SHALL NOT ASSUME ANY RESPONSIBILITY FOR VERIFYING NOR DO THEY GUARANTEE THE LEGITIMACY OF THE COPYRIGHT FOR EACH UTX DICTIONARY. YOU AND THE ORIGINAL AUTHOR OF EACH UTX DICTIONARY ARE RESPONSIBLE FOR THE LEGAL PROBLEM IF IN ANY CASE THAT THE ORIGINAL AUTHOR OF THE UTX DICTIONARY IS NOT THE LEGITIMATE HOLDER OF THE APPROPRIATE COPYRIGHT.

(5) The AAMT and its members grant you the permission to stipulate the terms and conditions for the use of UTX Dictionaries by their users for commercial or non-commercial purposes as long as you have the appropriate copyright; provided, however, that the author of UTX Dictionary is solely responsible for verifying the legitimacy of the copyright for data used in the UTX Dictionary.

(6) THE AAMT AND ITS MEMBERS SHALL NOT ASSUME ANY RESPONSIBILITY FOR THE RESULT OF USE OF THE TOOLS RELATED TO UTX DICTIONARIES.

2. To the users of UTX Dictionaries from their authors:

The users of UTX Dictionaries may make use of UTX Dictionaries, in accordance with their license terms and conditions. Since the license terms and conditions of UTX Dictionaries are varied, please confirm the license indicated in the UTX file header.

3. To the users of UTX Dictionaries and related tools from the AAMT and its members:

THE AAMT AND ITS MEMBERS SHALL NOT ASSUME ANY RESPONSIBILITY IN CONNECTION WITH THE UTX SPECIFICATIONS AND THE RESULT OF THEIR USE INCLUDING, BUT NOT LIMITED TO, THE EXISTENCE OF INFRINGEMENT OF THIRD PARTIES’ RIGHTS AND THE ACCURACY, ADEQUACY AND QUALITY OF THE TRANSLATION. You should resolve such problems between you and the author of the UTX Dictionaries.
 

▲TOP

Last Updated: 17 Jun. 2014