Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

i realize this is pretty basic, as i am reading about Unicode in Wikipedia and wherever it points. but this "U+0000" semantic is not completely explained. it appears to me that "U" always equals 0.

why is that "U+" part of the notation? what exactly does it mean? (it appears to be some base value, but i cannot understand when or why it is ever non-zero.)

also, if i receive a string of text from some other source, how do i know if that string is encoded UTF-8 or UTF-16 or UTF-32? is there some way i can automatically determine that by context?

  • From Wikipedia, article Unicode , section Architecture and Terminology :

    Unicode defines a codespace of 1,114,112 code points in the range 0 to 10FFFF (hexadecimal). Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number. For code points in the Basic Multilingual Plane (BMP), four digits are used (e.g., U+0058 for the character LATIN CAPITAL LETTER X); for code points outside the BMP, five or six digits are used.

    This convention was introduced so that the readers understand that the code point is specifically a Unicode code point. For example, the letter ă (LATIN SMALL LETTER A WITH BREVE) is U+0103; in Code Page 852 it has the code 0xC7, in Code Page 1250 it has the code 0xE3, but when I write U+0103 everybody understands that I mean the Unicode code point and they can look it up.

  • For languages written with the Latin alphabet, UTF-16 and UTF-32 strings will most likely contain lots and lots of bytes with the value 0, which should not appear in UTF-8 encoded strings. By looking at which bytes are zero you can also infer the byte order of UTF-16 and UTF-32 strings, even in the absence of a Byte Order Mark .

    So for example if you get the bytes

     0xC3 0x89 0x70 0xC3 0xA9 0x65
    

    this is most likely Épée in UTF-8 encoding. In little-endian UTF-16 this would be

     0x00 0xC9 0x00 0x70 0x00 0xE9 0x00 0x65
    

    (Note how every even-numbered byte is zero.)

    so i learned that the "U" isn't added to anything, it's just a notation similar to "0x" preceding hexadecimal or "H" following (that seems to be deprecated, thankfully). and it also looks like everything i am seeing, in terms of texts strings coming in is UTF-8. so no character that is outside of the 7-bit range of ASCII is encoded as a single byte. e.g. the ¢ character is never a single byte of 0xA2 in UTF-8. if you want ¢ in UTF-8, it must be encoded as two bytes 0xC2A2. so with UTF-8, there is no "extended ASCII". and of those characters have to be two or more bytes. – robert bristow-johnson May 23, 2018 at 21:11 @robertbristow-johnson: yes, some languages uses the escape notation \u1234 (and possibly \U103456). Note: U+XXXX (and \u) is decoded, because it is a logical (and abstract) representation of the character: it indicates a character on the unicode tables, without telling us how it will be put as bytes. Instead 0xAA is encoded, so a physical representation of the bytes. – Giacomo Catenazzi May 24, 2018 at 8:42

    Thanks for contributing an answer to Stack Overflow!

    • Please be sure to answer the question. Provide details and share your research!

    But avoid

    • Asking for help, clarification, or responding to other answers.
    • Making statements based on opinion; back them up with references or personal experience.

    To learn more, see our tips on writing great answers.

  •