Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
Why do Unicode code points appear as
U+
<codepoint>
?
For example,
U+2202
represents the character
∂
.
Why not
U-
(dash or hyphen character) or anything else?
The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as
The Unicode Standard
, version 2.0.0
, published in 1996 (source: archived PDF copy on Unicode Consortium web site).
The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".
My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.
I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The
Unicode Standard 2.0
declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as
U+nnnn
, where
nnnn
is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits.
The Unicode Standard 3.0.0
, published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in
The Unicode Standard
, version 6.0.0
, where a table of BNF syntax notation defines symbols
U+HHHH
and
U-HHHHHHHH
(p. 559).
The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the
Python language defines the following string literals
:
u'xyz'
to indicate a Unicode string, a sequence of Unicode characters
'\uxxxx'
to indicate a string with a unicode character denoted by four hex digits
'\Uxxxxxxxx'
to indicate a string with a unicode character denoted by eight hex digits
–
–
It depends on what version of the Unicode standard you are talking about. From
Wikipedia
:
Older versions of the standard used
similar notations, but with slightly
different rules. For example, Unicode
3.0 used "U-" followed by eight digits, and allowed "U+" to be used
only with exactly four digits to
indicate a code unit, not a code
point.
–
–
–
–
Thanks for contributing an answer to Stack Overflow!
-
Please be sure to
answer the question
. Provide details and share your research!
But
avoid
…
-
Asking for help, clarification, or responding to other answers.
-
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our
tips on writing great answers
.