|
UTX (universal terminology
exchange)
Open standard for machine translation user dictionary
|
|
Asia-Pacific Association
for
Machine Translation |
Updated: 2008/8/3
Japanese version of this
article
Introduction to UTX
In
order to use MT (machine translation) systems such as translation
software effectively, it is essential to use user dictionaries. When we
use commercial high-end translation software in a CAT (computer-aided
translation) workflow, specialized terminology, names of persons, and
place names in the target document are often not included in basic
system dictionaries, and they are not translated as well as one would
expect. However, if these terms are registered into a user dictionary,
the accuracy of the MT system can be improved.
Unfortunately, user
dictionaries are not always compatible between different MT systems,
rendering the effort to create such dictionaries futile. To address this
issue, AAMT (Asia-Pacific Association for Machine Translation) has
undertaken to establish a set of specifications for sharable
dictionaries, which can be used across different MT systems. AAMT
created its first version of specification,
UPF
(Universal PlatForm), with support from IPA (Information-technology
Promotion Agency, an institute in Japan) in 1995. In 2006, AAMT started
to create new specifications to reflect and incorporate the subsequent
advancement of technology and the changing usage of MT. In 2007, the new
format received a new name "UTX," short for
universal terminology eXchange. As of 2008, AAMT is
working to establish UTX-Simple, which is the simple, stripped-down
version of UTX before building the full XML version.
We are currently
focussing on the following tasks: production and collection
of dictionary data; and creation of a
user community for generating, sharing, and accumulating user
dictionaries in a sustainable way. We will select some domains for which
translation is needed, and build and collect actual dictionary data from
real-world documents in accordance with the specification of UTX-Simple
and UTX-XML. By carrying out translation with the dictionary and
collecting feedback, the UTX specification will be further improved.
Simplicity
-
"Dictionary for the user" - simple and easy to use
- UTX-Simple (tab-delimited text format) requires only
three types of information: a source word, its translation,
and the part of speech of the source word, making it easy to
build.
- UTX accommodates multiple languages.
-
Inclusion of the information to support managing and sharing
dictionaries
- UTX dictionary includes a creator name and creation timestamp.
Entry as a "technical term"
-
Clarification of domains
-
One word, one meaning
- In priciple, one term has one meaning (i.e. one
translation) in a specific domain.
- An entry must be a unique term within an applicable domain.
-
For users
- Improvement in translation accuracy in each specific domain.
- Sharing and reuse of dictionaries is possible through user
communities, either locally or over the Internet.
-
For manufacturers of MT systems
- The entire market of MT will be enlivened by promotion of
user dictionaries.
- New demands for and applications of MT can be explored.
- UTX-XML format retains entry properties that are proprietary
to manufacturers, and no data is lost during the conversion to/from
UTX-XML.
-
AAMT will establish two types of dictionary communities for
producing, sharing, and accumulating dictionaries, and a
framework for distribution.
-
The official dictionary community (managed by AAMT or its
delegate) offers supervised dictionaries with guaranteed quality
for a fee.
-
The open dictionary community offers free dictionaries with open
source license and promotes mutual exchange. AAMT or its
delegate provides hosting service only, but no management or
guarantee.
-
AAMT will collaborate with Oki's community-oriented machine
translation site Yakushite-net.
- Problem 1: Terminology and its translation
(including
terms for user interface) vary among applications, thus
reducing the translator's efficiency and confusing end users.
- Solution by using UTX: Terminology used in
various applications can be standardized. End users can enjoy more
organized, reduced terminology.
- Problem 2: Translation assets are not
effectively recycled when translating a similar application. Since
there is no accumulation of translation, whenever starting a new
translation project, it is necessary to start from scratch.
- Solution by using UTX: Translation is more
efficient by accumulating, sharing, and reusing translation assets,
such as user dictionary and glossaries.
- Problem 3: Multilingualization is not an easy task.
- Solution by using UTX: UTX-XML accommodates
multiple languages. Since the translation assets (user dictionaries
and glossaries) are centralized and reusable through UTX-XML,
multilingualization is significantly easier.
-
As technical-term dictionary and in-company glossary
-
Problem: The glossary for in-house documents and
the glossary for translation are often separate, incoherent,
disorganized, and difficult to manage centrally. Accumulation of
technical knowledge of a specific domain in the company is desirable
but difficult, since the sources are separate and they are all in
different formats.
-
Solution by using UTX: It can also be used as a
monolingual glossary. Exchange of data with various tools is easy
through the standardized specification. Import from existing
glossaries is also straight forward.
-
Supporting intercultural communication between individuals
-
Problem: Correct translations of proper nouns
are often difficult to find, for example, when a fan wishes to write
a fan letter to an overseas writer or a movie star. When a user
would like to chat with an overseas friend about sports or online
games, the translation of sports players' names or the terminology
of the online game may not be available in system dictionaries.
-
Solution by using UTX: Niche glossaries which
are not included in commercial specialized dictionaries can be
shared to be used for machine translation.
-
Translation assistance for developing countries
-
Problem 1: NPOs are always in need of more human resources and funds.
Solution by using UTX: Translation assistance and automatic translation
can be carried out at low cost.
-
Problem 2: Some minor languages only have limited bilingual glossaries
or dictionaries. Some domains may have a high priority, such as medical
science.
Solution by using UTX: Since UTX dictionaries can be accumulated in
a dictionary community, dictionaries can be gradually compiled in a
language with limited bilingual glossaries.
-
Term extraction and dictionary building tools
We will need tools to analyze multiple documents, extract terms,
and add them to a user dictionary or make a new one instantly, not
by building a dictionary one-word-at-a-time.
-
UTX converters (including parsers)
We will need tools which convert from a format unique
to a translation application or a translation site to UTX format,
and vice versa. A parser which verifies the conformity to the UTX
specification must also be included.
-
Dictionary search tool (glossary search tool)
We will need tools to perform a direct search to a dictionary and
glossary to see the translation of a word.
(Anyone can participate in this mailing list, but the
correspondence is mostly in Japanese. We are planning to start another
mailing list in English.)
If you are interested in UTX, please contact us through the
following form: We welcome organizations and individuals who could
collaborate with us to develop the specification of UTX, to provide and/or
build dictionaries and tools, and to perform evaluation.
|
Attention: Please CLICK the button AGAIN on the next screen to submit your message. This site currently does not support
SSL. Therefore, the transmission of your data to our server is not
encrypted.
|
|