codepoint - Why is 'U+' used to designate a Unicode code point?

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

Why do Unicode code points appear as U+ <codepoint> ?

For example, U+2202 represents the character ∂ .

Why not U- (dash or hyphen character) or anything else?

The Unicode Standard needs some notation for talking about code points and character names. It adopted the convention of "U+" followed by four or more hexadecimal digits at least as far back as The Unicode Standard , version 2.0.0 , published in 1996 (source: archived PDF copy on Unicode Consortium web site).

The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".

My personal recollection from early-1990's software industry discussions about Unicode, is that a convention of "U+" followed by four hexadecimal digits was common during the Unicode 1.0 and Unicode 2.0 era. At the time, Unicode was seen as a 16-bit system. With the advent of Unicode 3.0 and the encoding of characters at code points of U+010000 and above, the convention of "U-" followed by six hexadecimal digits came in to use, specifically to highlight the extra two digits in the number. (Or maybe it was the other way around, a shift from "U-" to "U+".) In my experience, the "U+" convention is now much more common than the "U-" convention, and few people use the difference between "U+" and "U-" to indicate the number of digits.

I wasn't able to find documentation of the shift from "U+" to "U-", though. Archived mailing list messages from the 1990's should have evidence of it, but I can't conveniently point to any. The Unicode Standard 2.0 declared, "Unicode character codes have a uniform width of 16 bits." (p. 2-3). It laid down its convention that "an individual Unicode value is expressed as U+nnnn , where nnnn is a four digit number in hexadecimal notation" (p. 1-5). Surrogate values were allocated, but no character codes were defined above U+FFFF, and there was no mention of UTF-16 or UTF-32. It used "U+" with four digits. The Unicode Standard 3.0.0 , published in 2000, defined UTF-16 (p. 46-47) and discussed code points of U+010000 and above. It used "U+" with four digits in some places, and with six digits in other places. The firmest trace I found was in The Unicode Standard , version 6.0.0 , where a table of BNF syntax notation defines symbols U+HHHH and U-HHHHHHHH (p. 559).

The "U+" notation is not the only convention for representing Unicode code points or code units. For instance, the Python language defines the following string literals :


    u'xyz'

to indicate a Unicode string, a sequence of Unicode characters


    '\uxxxx'

to indicate a string with a unicode character denoted by four hex digits


    '\Uxxxxxxxx'

to indicate a string with a unicode character denoted by eight hex digits

Thanks for this explaination @Jim. It is really helpful. I would looking at those linked docs. – Senthil Kumaran Jan 17, 2012 at 14:57 unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html also supports U+HHHH and U-HHHHHHHH. – Shawn Kovac Sep 8, 2015 at 18:20

It depends on what version of the Unicode standard you are talking about. From Wikipedia :

Older versions of the standard used similar notations, but with slightly different rules. For example, Unicode 3.0 used "U-" followed by eight digits, and allowed "U+" to be used only with exactly four digits to indicate a code unit, not a code point.

That was the helpful reference. But the reason for that change is not mentioned. Was it just a whim of the committee? – Senthil Kumaran Aug 13, 2009 at 18:23 I don't see the "U-" convention in either The Unicode Standard 3.0.0 or The Unicode Standard 2.0.0 as archived on the Unicode Consortium's web site. I think Wikipedia is wrong here. – Jim DeLaHunt Jan 17, 2012 at 7:08 It's in the preface ( unicode.org/versions/Unicode3.0.0/Preface.pdf ), but only mentioned briefly. – Sean Bright Jan 17, 2012 at 11:33 They didn't even have to flip a coin:

(


    /ˈɛks/

) sounds more like

hex

than

(


    /eɪtʃ/

) does. – Frédéric Hamidi May 28, 2011 at 10:03

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question . Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers .