Internationalized domain names (IDNs)
What is an internationalized domain name (IDN)?
Internationalized domain names (IDNs) are domain names that either:
- are written in languages/scripts using Latin letters with diacritics (accents marks such as é or ü) or
- do not use the Latin alphabet at all
IDNs allow native speakers of non-Latin based scripts to access the Internet in their own language. Since Internet usage is rising around the world and the world is full of many different languages and scripts, IDNs offer a great way to connect with your target market no matter what they speak.
What is ASCII?
ASCII, pronounced ask-ee, stands for the American Standard Code for Information Interchange. ASCII was originally based on the English alphabet and consists of 128 characters including A-Z, 0-9, punctuation, spaces, and other control codes that can be found on a standard English keyboard. These 128 characters are then assigned a number from 0 to 127 to represent them in data transfer from one computer to the other. While ASCII code was originally developed for teletypewriters (a device used to send and receive messages), it found broader application with the development of personal computers.
What is non-ASCII?
Non-ASCII characters are those that are not part of the ASCII character set. These characters include letters, symbols, and characters from various languages and scripts around the world. As computer systems evolved to support multiple languages and internationalization, extended character encodings like UTF-8 (Unicode Transformation Format 8-bit) became more common. UTF-8 can represent a much larger range of characters compared to ASCII, making it suitable for a global audience.
What is UTF-8?
UTF-8 (Unicode Transformation Format 8-bit) is a character encoding that's designed to encode a wide range of characters from various languages and scripts in a standardized and efficient manner. It's part of the Unicode standard, which aims to provide a universal character encoding that covers all the world's writing systems and characters. UTF-8 is a variable-width encoding, meaning that different characters can be represented using different numbers of bytes. It uses one to four bytes to represent characters. Basic ASCII characters (those in the ASCII character set) are represented using a single byte, making it backward compatible with ASCII. However, UTF-8 can also encode characters from outside the ASCII range, such as, non-Latin scripts, emojis and more.
What is unicode?
Unicode is a universal character encoding standard that is used to support characters in non-ASCII scripts. The Internet was originally built on ASCII, which is based on the English alphabet and consists of only 128 characters. Unicode allows for support of all the languages around the world and their unique character sets - Unicode can support over 1 million characters! The way Unicode works is by allowing more bits, short for binary digit, which are units of information on a machine. ASCII characters only require about 7 bits, while Unicode uses 16 bits. This is necessary because it takes more bits to process languages such as Chinese, Arabic, and Russian. There are different types of Unicode including UTF-8 and UTF-16, the two most common. UTF-8 has become the typical standard used on the web because it adjusts the number of bits used depending on the character. This means that ASCII characters in UTF-8 only take up the bits they need to process.
What is punycode?
Punycode is a way to represent International Domain Names (IDNs) with the limited character set (A-Z, 0-9) supported by the domain name system. For example, "münich" would be encoded as "mnich-kva". An IDN takes the punycode encoding, and adds a "xn--" in front of it. So "münich.com" would become "xn--mnich-kva.com"
Did the rules for .COM and .NET IDNs change?
Yes, there are new rules concerning .COM and .NET Internationalized Domain Name (IDN) registrations. These rules were changed on 03/21/2005. Listed below are the rules that were put into place:
1. Verisign, the central registry of .COM and .NET, is taking a more restrictive approach as to what characters are permitted within IDN registrations that contain the ENG and GER language tags. Specifically:
- Domains registered with the language tag of ENG will only be allowed for registrations that consist of characters a-z, 0-9, and -. The reason why we are retaining the ENG table is that in the future, we could add characters to the table which would make registrations using them in an IDN. However, in the interim, no new IDNs could be registered with a language tag containing the ENG value.
- Domains registered with the language tag of GER will only allow for registrations that consist of characters a-z, 0-9, -, ä, ö, and ü. The ß character will continue to be disallowed however, as is currently the case, following the IDNA RFCs.
- At this time, existing registrations that are tagged as ENG or GER will remain in the zone and unaffected by this change. No changes in the future are envisioned except as noted in the following.
2. With the exception of characters 0-9 and the dash, domains that commingle Latin and Cyrillic characters for any language tag will no longer be permitted. At this time Verisign will not be making any changes to existing registrations that commingle Latin and Cyrillic characters. However, there may be a need in the future to place any existing registrations on REGISTRY-HOLD, for they may no longer comply to registry specifications.