Homoglyph

From Wikipedia, the free encyclopedia

Different glyphs which are visually similar

This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)

This article may contain original research. Please improve it by verifying the claims made and adding inline citations. Statements consisting only of original research should be removed. (July 2016) (Learn how and when to remove this message)

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Homoglyph" - news * newspapers * books * scholar * JSTOR (July 2016) (Learn how and when to remove this message)

(Learn how and when to remove this message)

The homoglyphs
U+0061 a LATIN SMALL LETTER A and
U+0430 a CYRILLIC SMALL LETTER A overlaid. In the image, both characters are set in Helvetica LT Std Roman.

In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar but may have differing meaning. The designation is also applied to sequences of characters sharing these properties.

In 2008, the Unicode Consortium published its Technical Report #36^[1] on a range of issues deriving from the visual similarity of characters both in single scripts, and similarities between characters in different scripts.

Examples of homoglyphic symbols are (a) the diaeresis and umlaut (both a pair of dots, but with different meaning, although encoded with the same code points); and (b) the hyphen and minus sign (both a short horizontal stroke, but with different meaning, although often encoded with the same code point). Among digits and letters, digit 1 and lowercase l are always encoded separately but in many typefaces are given very similar glyphs, and digit 0 and capital O are always encoded separately but in many typefaces are given very similar glyphs. Virtually every example of a homoglyphic pair of characters can potentially be differentiated graphically with clearly distinguishable glyphs and separate code points, but this is not always done. Typefaces that do not emphatically distinguish the one/el and zero/oh homoglyphs are considered unsuitable for writing formulas, URLs, source code, IDs and other text where characters cannot always be differentiated without context. Fonts which distinguish glyphs by means of a slashed zero, for example, are preferred for those uses.

Related terms

[edit]

The term homograph is sometimes misused synonymously with homoglyph, but in the usual linguistic sense, homographs are words that are spelled the same but have different meanings, a property of words, not characters.

Allographs are typeface design variants that look different but mean the same thing - for example and , or a dollar sign with one or two strokes. The term synoglyph has a similar but slightly more abstract meaning - for example the symbol and the letter (in Lsd) both mean the pound sterling,^[2] but only in that context. Allographs and synoglyphs are also known informally as display variants.

Trema for umlaut and diaeresis

[edit]

The trema is used to indicate umlaut or diaeresis. In the days of early mechanical typewriters it was typed with the same key (using the "backspace and over-type" technique) used for a double inverted comma. However, the umlaut originated specifically as a pair of short vertical lines, not two dots (see Sutterlin). Incidentally, the two dots above the letter in Albanian are described as a diaeresis but do not fulfil the function of a diaeresis. ^[3]

0 and O; 1, l and I

[edit]

Two common and important sets of homoglyphs in use today are the digit zero <0> and the capital letter ; and the digit one <1> , the lowercase letter L and the uppercase i . In the early days of mechanical typewriters it was common to omit keys for the digits <1> and <0> , and the keys for the letters and produced glyphs used for both characters. As typists who had used such typewriters transitioned in the 1970s and 1980s to being computer keyboard operators, their old keyboarding habits continued with them and were an occasional source of confusion.

Most current type designs carefully distinguish between these homoglyphs, usually by drawing the digit zero narrower and drawing the digit one with prominent serifs. Early computer print-outs went even further and marked the zero with a slash or dot, which led to a new conflict involving the Scandinavian letter "O" and the Greek letter Ph (phi). The redesigning of character types to differentiate these characters has meant less confusion.^[4]

Some type designs conform to the DIN 1450 legibility standard by carefully designing such characters to be easy to distinguish: slashed zero to distinguish it from capital ; lowercase l with a tail and uppercase with serifs to distinguish it from the digit <1> ; distinguishing the numeral <5> from the capital ~~; etc.^[5]~~

An example of confusion due to near-homoglyphs arose from the use of a to represent a (thorn). Early English typesetters imported Dutch typesets that did not contain the latter character, so used the letter instead because (in Blackletter typeface) they look sufficiently similar.^[6] It has led in modern times to such phenomena as Ye olde shoppe, implying incorrectly that the word the was formerly written ye /ji:/ rather than the. The spelling of the name Menzies (pronounced Mengis and originally spelled Menyies) arose for the same reason: the letter was substituted for (yogh).

Multi-letter homoglyphs
[edit]

The letter and letters in typefaces Arial, Calibri, Times New Roman, Cambria, Walbaum-Fraktur, and Comic Sans

Stefan Szczotkowski looks like Aeffan Szczotkowski on the gravestone.

Some other combinations of letters look similar, for instance looks similar to , looks similar to , and looks similar to .
In certain narrow-spaced fonts (such as Tahoma), placing the letter next to a letter such as , or will create a homoglyph, such as <cj cl ci> ( ).
When some characters are placed next to each other, seen together at a glance they give the visual impression of another, unrelated character. A more precise way of saying this is that some typographic ligatures can look similar to standalone glyphs. For example, the ligature (of and ) can look similar to in some typefaces or fonts. This potential for confusion is sometimes an argument made against the use of ligatures.^[citation needed]

Canonicalization
[edit]

Homoglyphs of all kinds can be detected through a process called 'dual canonicalization'.^[4] The first step in this process is to identify homoglyph sets, namely characters appearing the same to a given observer. From here, a single token is specified to represent the homoglyph set. This token is called a canon. The next step is to convert each character in the text to the corresponding canon in a process called canonicalization. If the canons of two runs of text are the same but the original text is different, then a homoglyph exists in the text.

Homoglyph prevention
[edit]

Homoglyph attacks can be mitigated through a combination of user awareness and proactive measures. It is crucial to educate users about the risks associated with homoglyph attacks, urging them to meticulously inspect URLs before clicking.^[7] Employing advanced security solutions, particularly those capable of scanning for homoglyph variations in domain names, can automate the detection and prevention of potential threats. Additionally, implementing stringent domain name monitoring and registration policies can help identify and neutralize homoglyph-related risks promptly. By fostering a culture of cyber vigilance and leveraging cutting-edge technologies, organizations can fortify their defenses against homoglyph attacks, ensuring a more secure online environment.

Unicode homoglyphs
[edit]

The three most prominent European alphabets (Greek, Cyrillic and Latin) share many letter forms that are encoded in Unicode under separate code points.

Unicode has code points for many strongly homoglyphic characters, known as "confusables".^[1] These present security risks in a variety of situations (addressed in UTR#36)^[8] and were called to particular attention in regard to internationalized domain names. In theory at least, one might deliberately spoof a domain name by replacing one character with its homoglyph, thus creating a second domain name, not readily distinguishable from the first, that can be exploited in phishing (see main article IDN homograph attack). In many typefaces, the Greek letter , the Cyrillic letter and the Latin letter are visually identical, as are the Latin letter and the Cyrillic letter (the same can be applied to the Latin letters "aBceHKopTxy" and the Cyrillic letters "aVseNKorTkhu"). A domain name can be spoofed simply by substituting one of these forms for another in a separately registered name. There are also many examples of near-homoglyphs within the same script such as (with an acute accent) and (with a tittle), (E-acute) and ( with dot above) and (E-grave), (capital with an acute accent) and (lowercase with acute accent). When discussing this specific security issue, any two sequences of similar characters may be assessed in terms of its potential to be taken as a homoglyph pair, or if the sequences clearly appear to be words, as pseudo-homographs (noting again that these terms may themselves cause confusion in other contexts). In the Chinese language, many simplified Chinese characters are homoglyphs of the corresponding traditional Chinese characters.
Efforts by TLD registries and Web browser designers aim to minimize the risks of homoglyphic confusion. Commonly, this is achieved by prohibiting names which mix character sets from multiple languages (toys-Ia-us.org, using the Cyrillic letter <Ia> , would be invalid, but wikipedia.org and wikipedia.org still exist as different websites); Canada's .ca registry goes one step further by requiring names which differ only in diacritics to have the same owner and same registrar.^[9] The handling of Chinese characters varies: in .org and .info registration of one variant renders the other unavailable to anyone, while in .biz the traditional and simplified versions of the same name are delivered as a two-domain bundle which both point to the same domain name server.
Relevant documentation will be found both on the developers' Web sites, and on an IDN Forum^[10] provided by ICANN.

ES1845 JCUKEN-QWERTY hybrid layout keyboard

The Cyrillic letter (U+0421 S CYRILLIC CAPITAL LETTER ES) not only looks like Latin (U+0043 C LATIN CAPITAL LETTER C), but also occupies the same button in JCUKEN-QWERTY hybrid layout keyboards. This design nuance can be seen on the C/S button represented in Keyboard Monument in Yekaterinburg.

See also
[edit]

IDN homograph attack - Visually similar letters in domain names

Duplicate characters in Unicode - Unicode characters that have been encoded twice

Vehicle registration plates of Bosnia and Herzegovina use only numbers and letters that look the same in the Latin and Cyrillic alphabets.

Yaminjeongeum, South Korean language game of intentionally substituting Hangul characters for homoglyphs.

References
[edit]

^ ^a ^b "UTR #36: Unicode Security Considerations". www.unicode.org.

^ Walton, Chas (October 7, 2020). "A writer's guide to diacritics and special characters". Text Wizard.

^ Describing these as homoglyphs is questionable as there are probably no languages in which the glyph can fulfil both these roles. It would be just as valid to describe, say, a grave accent as a homoglyph because it fulfils different roles in different languages.

^ ^a ^b Helfrich, James; Neff, Rick (2012). "Dual canonicalization: An answer to the homograph attack". 2012 e Crime Researchers Summit. eCrime Researchers Summit (eCrime), 2012. pp. 1-10. doi:10.1109/eCrime.2012.6489517. ISBN 978-1-4673-2543-1.

^ Nigel Tao, Chuck Bigelow, and Rob Pike. Go fonts: DIN Legibility Standard". 2016.

^ Hill, Will (30 June 2020). "Chapter 25: Typography and the printed English text" (PDF). The Routledge Handbook of the English Writing System. Taylor & Francis. p. 6. ISBN 9780367581565. Archived from the original (PDF) on 10 July 2022. Retrieved 24 January 2024. The types used by Caxton and his contemporaries originated in Holland and Belgium, and did not provide for the continuing use of elements of the Old English alphabet such as thorn , eth , and yogh . The substitution of visually similar typographic forms has led to some anomalies which persist to this day in the reprinting of archaic texts and the spelling of regional words. The widely misunderstood 'ye' occurs through a habit of printer's usage that originates in Caxton's time, when printers would substitute the (often accompanied by a superscript ) in place of the thorn or the eth , both of which were used to denote both the voiced and non-voiced sounds, /d/ and /th/ (Anderson, D. (1969) The Art of Written Forms. New York: Holt, Rinehart and Winston, p 169)

^ https://governance.dev/phishing-domain-check, accessed on February 12, 2024

^ "UTR #36: Unicode Security Considerations". unicode.org.

^ "Register a .CA in French!". Archived from the original on 2013-03-28. Retrieved 2013-03-29.

^ "ICANN Email Archives: [idn-guidelines]". forum.icann.org.

External links
[edit]

Look up homoglyph in Wiktionary, the free dictionary.

https://www.unicode.org/Public/security/latest/confusables.txt - recommended confusable mapping for IDN.

v
t
e
Unicode
Unicode

Unicode Consortium

ISO/IEC 10646 (Universal Character Set)

Versions

Code points

Block
List

Universal Character Set

Character charts

Character property

Plane

Private Use Area

Characters
Special purpose

BOM

Combining grapheme joiner

Left-to-right mark / Right-to-left mark

Soft hyphen

Variant form

Word joiner

Zero-width joiner

Zero-width non-joiner

Zero-width space

Lists

Characters

CJK Unified Ideographs

Combining character

Duplicate characters

Numerals

Scripts

Spaces

Symbols

Halfwidth and fullwidth

Alias names and abbreviations

Whitespace characters

Processing
Algorithms

Bidirectional text

Collation
ISO/IEC 14651

Equivalence

Variation sequences

International Ideographs Core

Comparison of encodings

BOCU-1

CESU-8

Punycode

SCSU

UTF-1

UTF-7

UTF-8

UTF-16/UCS-2

UTF-32/UCS-4

UTF-EBCDIC

On pairs of
code points

Combining character

Compatibility characters

Duplicate characters

Equivalence

Homoglyph

Precomposed character
list

Z-variant

Variation sequences

Regional indicator symbol

Emoji skin color

Usage

Domain names (IDN)

Email

Fonts

HTML
entity references

numeric references

Input

International Ideographs Core

Related standards

Common Locale Data Repository (CLDR)

GB 18030

ISO/IEC 8859

DIN 91379

ISO 15924

Related topics

Anomalies

ConScript Unicode Registry

Ideographic Research Group

International Components for Unicode

People involved with Unicode

Han unification

Scripts and symbols in Unicode
Common and
inherited scripts

Combining marks

Diacritics

Punctuation marks

Spaces

Numbers

Modern scripts

Adlam

Arabic

Armenian

Balinese

Bamum

Batak

Bengali

Beria Erfe

Bopomofo

Braille

Buhid

Burmese

Canadian Aboriginal

Chakma

Cham

Cherokee

CJK Unified Ideographs (Han)

Cyrillic

Deseret

Devanagari

Garay

Ge`ez

Georgian

Greek

Gujarati

Gunjala Gondi

Gurmukhi

Gurung Khema

Hangul

Hanifi Rohingya

Hanja

Hanunuoo

Hebrew

Hiragana

Javanese

Kanji

Kannada

Katakana

Kayah Li

Khmer

Kirat Rai

Lao

Latin

Lepcha

Limbu

Lisu (Fraser)

Lontara

Malayalam

Masaram Gondi

Mende Kikakui

Medefaidrin

Miao (Pollard)

Mongolian

Mru

N'Ko

Nag Mundari

New Tai Lue

Nushu

Nyiakeng Puachue Hmong

Odia

Ol Chiki

Ol Onal

Osage

Osmanya

Pahawh Hmong

Pau Cin Hau

Pracalit (Newa)

Ranjana

Rejang

Samaritan

Saurashtra

Shavian

Sinhala

Sorang Sompeng

Sundanese

Sunuwar

Syriac

Tagbanwa

Tai Le

Tai Tham

Tai Viet

Tai Yo

Tamil

Tangsa

Telugu

Thaana

Thai

Tibetan

Tifinagh

Tirhuta

Tolong Siki

Toto

Vai

Wancho

Warang Citi

Yi

Ancient and
historic scripts

Ahom

Anatolian hieroglyphs

Ancient North Arabian

Avestan

Bassa Vah

Bhaiksuki

Brahmi

Carian

Caucasian Albanian

Coptic

Cuneiform

Cypriot

Cypro-Minoan

Dives Akuru

Dogra

Egyptian hieroglyphs

Elbasan

Elymaic

Glagolitic

Gothic

Grantha

Hatran

Imperial Aramaic

Inscriptional Pahlavi

Inscriptional Parthian

Kaithi

Kawi

Kharosthi

Khitan small script

Khojki

Khudawadi

Khwarezmian (Chorasmian)

Linear A

Linear B

Lycian

Lydian

Mahajani

Makasar

Mandaic

Manichaean

Marchen

Meetei Mayek

Meroitic

Modi

Multani

Nabataean

Nandinagari

Ogham

Old Hungarian

Old Italic

Old Permic

Old Persian cuneiform

Old Sogdian

Old Turkic

Old Uyghur

Palmyrene

'Phags-pa

Phoenician

Psalter Pahlavi

Runic

Sharada

Siddham

Sidetic

Sogdian

South Arabian

Soyombo

Sylheti Nagri

Tagalog (Baybayin)

Takri

Tangut

Todhri

Tulu Tigalari

Ugaritic

Vithkuqi

Yezidi

Zanabazar Square

Notational scripts

Duployan

SignWriting

Symbols, emojis

Cultural, political, and religious symbols

Currency

Control Pictures

Mathematical operators and symbols
Glossary

Phonetic symbols (including IPA)

Emoji

Category: Unicode

Category: Unicode blocks

~~~~Retrieved from "https://en.wikipedia.org/w/index.php?title=Homoglyph&oldid=1288793246"~~~~

Categories:
Typography
Unicode
Hidden categories:
Articles with short description
Short description is different from Wikidata
Articles that may contain original research from July 2016
All articles that may contain original research
Articles needing additional references from July 2016
All articles needing additional references
Articles with multiple maintenance issues
All articles with unsourced statements
Articles with unsourced statements from August 2009