mms-zeroshot / uroman /README.txt
Vineel Pratap
init
297e244
raw
history blame
8.72 kB
uroman version 1.2.8
Release date: April 23, 2021
Author: Ulf Hermjakob, USC Information Sciences Institute
uroman is a universal romanizer. It converts text in any script to the Latin alphabet.
Usage: uroman.pl [-l <lang-code>] [--chart] [--no-cache] < STDIN
where the optional <lang-code> is a 3-letter languages code, e.g. ara, bel, bul, deu, ell, eng, fas,
grc, ell, eng, heb, kaz, kir, lav, lit, mkd, mkd2, oss, pnt, pus, rus, srp, srp2, tur, uig, ukr, yid.
--chart specifies chart output (in JSON format) to represent alternative romanizations.
--no-cache disables caching.
Examples: bin/uroman.pl < text/zho.txt
bin/uroman.pl -l tur < text/tur.txt
bin/uroman.pl -l heb --chart < text/heb.txt
bin/uroman.pl < test/multi-script.txt > test/multi-script.uroman.txt
Identifying the input as Arabic, Belarusian, Bulgarian, English, Farsi, German,
Ancient Greek, Modern Greek, Pontic Greek, Hebrew, Kazakh, Kyrgyz, Latvian,
Lithuanian, North Macedonian, Russian, Serbian, Turkish, Ukrainian, Uyghur or Yiddish
will improve romanization for those languages as some letters in those languages
have different sound values from other languages using the same script.
No effect for other languages in this version.
Bibliography: Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track. [Best Demo Paper Award]
Changes in version 1.2.8
* Improved support for Georgian.
* Updated UnicodeData.txt to version 13 (2021) with several new scripts (10% larger).
* Preserve various symbols (as opposed to mapping to the symbols' names).
* Various small improvements.
Changes in version 1.2.7
* Improved support for Pashto.
Changes in version 1.2.6
* Improved support for Ukrainian, Russian and Ogham (ancient Irish script).
* Added support for English Braille.
* Added alternative Romanization for North Macedonian and Serbian (mkd2/srp2)
reflecting a casual style that many native speakers of those languages use
when writing text in Latin script, e.g. non-accented single letters (e.g. "s")
rather than phonetically motivated combinations of letters (e.g. "sh").
* When a line starts with "::lcode xyz ", the new uroman version will switch to
that language for that line. This is used for the new reference test file.
* Various small improvements.
Changes in version 1.2.5
* Improved support for Armenian and eight languages using Cyrillic scripts.
-- For Serbian and Macedonian, which are often written in both Cyrillic
and Latin scripts, uroman will map both official versions to the same
romanized text, e.g. both "Ниш" and "Niš" will be mapped to "Nish" (which
properly reflects the pronunciation of the city's name).
For both Serbian and Macedonian, casual writers often use a simplified
Latin form without diacritics, e.g. "s" to represent not only Cyrillic "с"
and Latin "s", but also "ш" or "š", even if this conflates "s" and "sh" and
other such pairs. The casual romanization can be simulated by using
alternative uroman language codes "srp2" and "mkd2", which romanize
both "Ниш" and "Niš" to "Nis" to reflect the casual Latin spelling.
* Various small improvements.
Changes in version 1.2.4
* Added support for Tifinagh (a script used for Berber languages).
* Bug-fix that generated two emtpy lines for each empty line in cache mode.
Changes in version 1.2.3
* Exclude emojis, dingbats, many other pictographs from being romanized (e.g. to "face")
Changes in version 1.2
* Run-time improvement based on (1) token-based caching and (2) shortcut
romanization (identity) of ASCII strings for default 1-best (non-chart)
output. Speed-up by a factor of 10 for Bengali and Uyghur on medium and
large size texts.
* Incremental improvements for Farsi, Amharic, Russian, Hebrew and related
languages.
* Richer lattice structure (more alternatives) for "Romanization" of English
to support better matching to romanizations of other languages.
Changes output only when --chart option is specified. No change in output for
default 1-best output, which for ASCII characters is always the input string.
Changes in version 1.1 (major upgrade)
* Offers chart output (in JSON format) to represent alternative romanizations.
-- Location of first character is defined to be "line: 1, start:0, end:0".
* Incremental improvements of Hebrew and Greek romanization; Chinese numbers.
* Improved web-interface at http://www.isi.edu/~ulf/uroman.html
-- Shows corresponding original and romanization text in red
when hovering over a text segment.
-- Shows alternative romanizations when hovering over romanized text
marked by dotted underline.
-- Added right-to-left script detection and improved display for right-to-left
script text (as determined line by line).
-- On-page support for some scripts that are often not pre-installed on users'
computers (Burmese, Egyptian, Klingon).
Changes in version 1.0 (major upgrade)
* Upgraded principal internal data structure from string to lattice.
* Improvements mostly in vowelization of South and Southeast Asian languages.
* Vocalic 'r' more consistently treated as vowel (no additional vowel added).
* Repetition signs (Japanese/Chinese/Thai/Khmer/Lao) are mapped to superscript 2.
* Japanese Katakana middle dots now mapped to ASCII space.
* Tibetan intersyllabic mark now mapped to middle dot (U+00B7).
* Some corrections regarding analysis of Chinese numbers.
* Many more foreign diacritics and punctuation marks dropped or mapped to ASCII.
* Zero-width characters dropped, except line/sentence-initial byte order marks.
* Spaces normalized to ASCII space.
* Fixed bug that in some cases mapped signs (such as dagger or bullet) to their verbal descriptions.
* Tested against previous version of uroman with a new uroman visual diff tool.
* Almost an order of magnitude faster.
Changes in version 0.7 (minor upgrade)
* Added script uroman-quick.pl for Arabic script languages, incl. Uyghur.
Much faster, pre-caching mapping of Arabic to Latin characters, simple greedy processing.
Will not convert material from non-Arabic blocks such as any (somewhat unusual) Cyrillic
or Chinese characters in Uyghur texts.
Changes in version 0.6 (minor upgrade)
* Added support for two letter characters used in Uzbek:
(1) character "ʻ" ("modifier letter turned comma", which modifies preceding "g" and "u" letters)
(2) character "ʼ" ("modifier letter apostrophe", which Uzbek uses to mark a glottal stop).
Both are now mapped to "'" (plain ASCII apostrophe).
* Added support for Uyghur vowel characters such as "ې" (Arabic e) and "ۆ" (Arabic oe)
even when they are not preceded by "ئ" (yeh with hamza above).
* Added support for Arabic semicolon "؛", Arabic ligature forms for phrases such as "ﷺ"
("sallallahou alayhe wasallam" = "prayer of God be upon him and his family and peace")
* Added robustness for Arabic letter presentation forms (initial/medial/final/isolated).
However, it is strongly recommended to normalize any presentation form Arabic letters
to their non-presentation form before calling uroman.
* Added force flush directive ($|=1;).
Changes in version 0.5 (minor upgrade)
* Improvements for Uyghur (make sure to use language option: -l uig)
Changes in version 0.4 (minor upgrade)
* Improvements for Thai (special cases for vowel/consonant reordering, e.g. for "sara o"; dropped some aspiration 'h's)
* Minor change for Arabic (added "alef+fathatan" = "an")
New features in version 0.3
* Covers Mandarin (Chinese)
* Improved romanization for numerous languages
* Preserves capitalization (e.g. from Latin, Cyrillic, Greek scripts)
* Maps from native digits to Western numbers
* Faster for South Asian languages
Other features
* Web interface: http://www.isi.edu/~ulf/uroman.html
* Vowelization is provided when locally computable, e.g. for many South Asian
languages and Tibetan.
Limitations
* This version of uroman assumes all CJK ideographs to be Mandarin (Chinese).
This means that Japanese kanji are incorrectly romanized; however, Japanese
hiragana and katakana are properly romanized.
* A romanizer is not a full transliterator. For example, this version of
uroman does not vowelize text that lacks explicit vowelization such as
normal text in Arabic and Hebrew (without diacritics/points).