Characters, Strings, and Unicode

A string consists of a set of zero or more characters. A character is an alphabetic letter, a number, a punctuation mark, or even non-printing control codes, which usually aid in formatting the text. On a computer, a character is represented by a specific number associated with it. For example, the character "A" is usually represented by the number 65, while the character "3" is usually associated with the number 51.

This representation is convenient, since a string of text characters can be readily stored as a series of small integral numbers. For example, the word "Hello" is stored as 72, 105, 108, 108, 111. Couldn't be any simpler. How are the numbers assigned and associated? It's just a matter of mutual consent by those who use them. As long as everyone agrees on the associations, the system works well. That said, we have experienced a certain amount of growing pains over the years. With the global growth of computer use, larger character sets are needed to represent the necessary characters. We have clearly reached the point where every programmer must consider alternate character sets for his applications. Failure to do so can carry severe penalties. When you find you can no longer read data files from an outside source, or can no longer read text from the Internet, it will be too late. The following sections describe the most-common and most-used character sets.

ASCII

ASCII was the first character set to be used on small computers. In fact, all of the other sets described here use ASCII codes as-is for a base. ASCII is a set of 128 characters, numbered from 0 to 127. It was designed for American English, so it defines only unaccented letters, numbers, punctuation, and control codes. As long as you only need English text, ASCII works fairly well.

ASCII needs just 7 bits of storage per character, so it was convenient to store each character in a byte. The last bit was simply ignored. Of course, that meant that the values from 128 to 255 were unused. That void wouldn't last long.

OEM

OEM is the acronym for "Original Equipment Manufacturer". IBM introduced the IBM PC in 1981. Along with it came their version of an expanded character set. It's been known as the OEM character set ever since. In fact, that character set is still the default for the Windows Console Device on the very latest version of Windows.

The first 128 characters are identical to ASCII. However, IBM decided to use the remaining 128 characters for other purposes. They defined them for the most common accented characters, line drawing characters, and special symbols and punctuation. Of course, this was an improvement, but many characters in non-English languages were unavailable. This led to new OEM character sets (German, Cyrillic...), with many different interpretations for that second set of character codes. Of course, this caused a good amount of confusion trying to understand the contents of strings from an external source. Not an ideal solution.

ANSI

Some time later, the ANSI character set evolved. Once again, the first 128 characters are the same as ASCII. But there are many ways to handle the second set. The decoding system, called "code pages", handles these items accurately, even if cumbersome. In reality, many languages need hundreds or thousands of characters. Clearly, the character codes can't possibly be squeezed into a byte. The solution? Multi-byte characters. Some characters are one byte, and some are more. If a particular character needs a multi-byte representation, a special ID byte is inserted, followed by the identifying data. A multi-byte character may consist of two, three, or even more bytes. That special ID byte determines what data will follow.

Multi-byte ANSI imposes a unique problem. You can't just scan your way through a string, byte-by-byte. Some characters are multi-byte! You must use care to treat them accurately, or your data will be destroyed. A word of warning... it's virtually impossible to scan backwards through a multi-byte string. That's because ANSI uses the same numeric values for both the ID byte, and the data which follows. When you look backwards and find an ID value, you can't tell if it's an ID or data. It just won't work well.

UNICODE

Unicode was created to represent every language into a single character set. While there are several Unicode formats, we'll concentrate on the only two varieties with real usage: UTF-8 and UTF-16. PowerBASIC uses UTF-16, which stores each character as a two-byte unsigned word. UTF-16 is used natively by Windows, COM, Visual Basic, Java, etc.

UTF-16 UNICODE

Just as before, the first 128 values represent ASCII characters. Other characters, primarily in non-English languages, have been assigned the higher values. At this time, and for the foreseeable future, UTF-16 is the character set of choice for all of your applications. It is the best way to store all of your data to keep it secure and understandable.

UTF-8 UNICODE

UTF-8 is somewhat of a hybrid between ANSI and UTF-16. It is used when the size of the text is of utmost importance. That makes it an obvious choice for downloading from the Internet. UTF-8 uses the same single byte characters for ASCII values. Further, it even uses the identical algorithm for multi-byte character, with one glowing exception: the ID byte and the data bytes are always unique! With that knowledge in hand, it is possible to scan backwards from any position. PowerBASIC does not support the use of UTF-8 within standard code. That's because UTF-8 is much slower in performance than UTF-16. That said, PowerBASIC does provide conversion functions to/from UTF-8, so you have it readily available for all of your Internet applications. UTF-8 files are byte orientated and should be opened as an ANSI file (CHR=ANSI).