Welcome to the Treehouse Community

Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.

Looking to learn something new?

Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.

Start your free trial

C# C# Streams and Data Processing Reading Data Encoding

Aaron Selonke
Aaron Selonke
10,323 Points

The relationship between Unicode and UTF-16....?

What I can make out so far is

1) unicode is a Character Set that uses a 'code point' to translate a code of number (and sometimes letters) into a Character. This includes any character or symbol on the keyboard and all of the characters of foreign alphabets as well. :ok:

2) UTF-16 is the default encoding of C#, which translates machine readable binary data into numbers, which are then translated into characters with the Unicode Character set.
UTF-16 characters are made up of TWO bytes :ok:

I think so far what is declared about is accurate, after this things get fuzzy.

A byte is made up of 8 binaray bits (eight, ones and zeros), correct?
If a UTF-16 character is made up of two of these bytes (or 16 bits), correct?

............. In the video Carling uses the UnicodeEncoding class to get the 2 bytes and insert them in a 2 item array.

byte[] unicodeBytes = UnicodeEncoding.Unicode.GetBytes(new char[] {'h'});
// returns  byte[2]{104,0}

What is {104,0} ??

I was expecting either two-8-digit-binary-bits. or something like a Unicode codepoint, which for the small letter 'h' is u0068

but {104,0} How should this two number array by understood?:shipit:

Aaron Selonke
Aaron Selonke
10,323 Points

I reviewed the video again, Carling shows that in C# a byte is an 'integral type' and an 'unassigned 8-bit integer' with a value between 0-255. That kind of explains the two numbers returned in the array. They are Unassgined 8-bit integers. Still there is not a clear connection between a byte, C#'s version of a byte ( an 'unassigned' 8-bit integer), and the Unicode codepoint.....

2 Answers

Steven Parker
Steven Parker
231,007 Points

Those are the decimal byte values ("unsigned", not "unassigned").

Good job figuring that out. :+1: But the remaining issue is about storage vs. values. A Byte is a unit of storage that is 8 bits in size. Now the values it can store can be expressed many ways:

  • unsigned decimal: 0 - 255
  • signed decimal: -128 - 127
  • hexadecimal: 00 - FF
  • octal: 000 - 377
  • binary: 00000000 - 11111111

On the other hand, Unicode is a set of values, which have a range too big to be stored in a single byte. It is commonly stored as two bytes used together to get 16 bits of storage. The value of a particular unicode character can also be expressed many ways, similar to the ranges listed above, plus more since it can be shown as two individual bytes (as it was in the code you included) or as one 16-bit value (like "u0068").

Does that clear it up?

Aaron Selonke
Aaron Selonke
10,323 Points

I think I got it from here. Appreciate the help as always, Thanks

HIDAYATULLAH ARGHANDABI
HIDAYATULLAH ARGHANDABI
21,058 Points

Most programming languages read UTF 16 To be noted