MUCH ABOUT CHARACTER SETS FOR COMPUTERS

Computer History Vignettes

By Bob Bemer

This paper derives from the time that Berkeley Associates sued IBM for $300 million. Because it involved the ALT key on PCs, which is fairly similar to an Escape key, I was scheduled to be IBM's star witness. A question arose on how to explain the dispute to a jury that, at least at that time, might be unfamiliar with computers. I put the following explanations down to help do this.

NOTE: This story is not finished yet.

THEORY OF CODE SETS (for people not in that business)

Codes are things that stand for other things. Remember the old gag about the comedian's club where everyone laughed when one of them said "46", because they all knew what joke number 46 was in the comedian's jokebook? To children the number 25 may stand for the letter "Y", it being the 25th letter in the Roman alphabet. A picture of a heart may stand for love, a picture of a dagger for hate. That's the basis of the old rebus puzzles.

Codes are used for many reasons -- secrecy, ability to represent things in other ways, and (opposite to secrecy) ensuring that everyone understands some things in the same way.

Token is another name, perhaps more general, for a code. Tokens also stand for items, actions, anything. Macintosh users sometimes call a picture token an icon, which is really a Russian religious painting. Webster's Dictionary says that a white flag is a token of surrender (an action).

Computers usually handle data in fixed sizes. One hears the term "byte" for a group of (so many) bits. For 8 bits it is better to call it an "octet". Each byte can hold a coded character. So what is the difference between a coded character representation and a token? A token may be made up of several coded representations (character bytes). Like Chinese ideographs. For example, putting three characters (two characters for "woman" under one character for "roof") gives an ideograph for "trouble".

Sets are groups of things with certain properties in common. Such as paper money of the United States. We have bills for 1, 5, 10, 20, 50, 100, 500 (etc.) dollars. That's the "paper money set". Another set is the set of $1 money; it has two members, the dollar bill and the dollar coin. Note that the number of members in a set is not the number of examples of any one that you have. With five one-dollar bills and four one-dollar coins I have nine dollars, but the set still has only two members.

The more members a set has, the easier it is to find subsets of the members. If we only had bills for one, fifty, and a thousand dollars, paying a debt of $45 dollars would take 45 copies of the dollar bill. As opposed to two twenties and a five. So when variety is needed, larger sets (those with more members) are easier to use.

Other examples of sets include alphabets (Roman, Cyrillic, Hebrew, etc.), the Arabic digits 0 through 9, punctuation marks, dominoes, suits in a deck of cards, Mah Jongh tiles, Monopoly houses and hotels and properties.

To use computers, and typewriters, we must have a set of keys/codes that includes letters of the alphabet, digits, punctuation, and (not least) a space. Children's typewriters often have only capital letters. Adult's will have capitals and lower case (not separately on the keyboard, but achieved by means of a shift key). The old Linotype for newspapers had even larger sets, with italic and bold letters, and even letters of different sizes and design.

The size of a set (its number of members) plays a great part in the flexibility of using it. In the old Linotype, with a larger set, we used italic letters for emphasis; on a typewriter we "make do" with the smaller set by underlining the regular letters.

EXTENSION and EXPANSION of SETS

When one "makes do" with an existing set to create more combinations, that is called "SET EXTENSION". In the case of coded sets, it is called "CODE EXTENSION". When the existing set is felt to be totally inadequate, new members may be added. This is called "SET EXPANSION", or "CODE EXPANSION". For example, the government may promote the two dollar bill, and add a new bill for two hundred dollars. The paper money set is expanded.

For another example of this distinction, imagine a set consisting of thirteen cards -- Ace, 2, 3, 4 ... 10, J, Q, K. Each is marked with a number or letter, and each has a number of black circles on it to match the count. Not the pips that you normally associate with cards, but just black circles. Suppose we have many of these sets of thirteen cards. How can we play bridge or poker?

We will have to "make do", by "extending the set". This can be done in several ways (but not by using different colors on the backs, which would be a giveaway to the other players). One way would be to write a big "S" for "spades" on one set, and "H", "D", and "C" on another three sets. Then put all four sets together. Another way would be to mark the upper left corner for spades, upper right for hearts, lower left ... etc. Still other methods could be devised.

But that 13-card set is awkward, which is why we use the current set of 52 cards, where the pips have colors and shapes for uniqueness. Note that here we doubled the set size twice. Could it be done by expansion that doubles the set size only once? Yes. Have a set of 26 cards, 13 with black circles and 13 with red circles. Now we need just two sets of those cards, and in each case we need to distinguish only between spades-clubs (for the black circles) and hearts-diamonds (for the red).

SETS FOR COMPUTERS

Most everyone knows or has heard that computers work by recognizing 1's and 0's, as usually represented by ON-OFF, punch/no-punch in a hole position, or some other means having only two states. No pictures, no colors, etc., enter into the encoding of information (although the reverse is true -- the bits (2-state items) can create colors and shapes, as in video games and movies).

Years ago, 6 bits were used to create the character set. Nowadays 8 are common. But those bits are NEVER arranged in anything other than what is effectively a straight line. When sent by telephone line or satellite they go either   a) one after the other in "serial" transmission, or else   b) side-by-side on multiple lines in "parallel" transmission. In other words, each bit goes either in its own position (like second) or on its own channel, like the fourth parallel wire. In army terms, by file, or by rank and file.

Not like dominoes. In dominoes you can turn around a 2-4 tile and get a 4-2 tile. In computers the 11001111 is different from 11110011. So the number of members in any set defined by two states (like ON-OFF) is 2 times 2 times 2 ... as many times as you have bits. 6 bits give 64 members in the set; 8 bits give 256 members.

Now for some surprises about set extension and expansion. Computers and communications did not used to be so interconnected. Once we had TeleTypes and TELEXes that were not themselves computers, but used sets of codes in the same way. But until about 1960 they used codes of 5 bits, coded in 5 tracks of punched round holes in paper tape. By our previous formula, how many members to the set? 32. But wait, how did they send 26 letters and 10 digits? That's 36 (not to mention the space and a few others), which is more than 32. It was done by "code extension", using the fact that hole combinations can be assigned to represent "control" as well as "text" characters.

For plain typewriters, ask "if this key is depressed will it print something?". If it does, it is a text character. Not so the "backspace" key, which is a control. We used it for underlining, which extends the set. And for overstriking, which also extends the set. Nor does the shift key itself print. It just changes lower case to upper (capitals), digits to punctuation, etc.

The shift key is quite special. It can serve to affect only the next key you type, or it can remain shifted, affecting all the keys until it is released. That is why there are two keys, SHIFT and CAPS LOCK. The first is "nonlocking" and the second is "locking". Remember this for later.

Teletypewriters also use a backspace key to back up both print element and paper tape to hit the "delete" character, which does an "editing" function by overstriking any printed character with a black rectangle, and also by punching holes in all tracks. Then, no matter what character used to be there, it is now just the single delete character, which the reader on the other end, driving the receiving teletypewriter, just ignores. OK, our set really has 29 members, not 32, because Delete, BackSpace, and Shift can't count.

If one code is assigned to shift the ribbon color from black to red, the control set increases, and the text set decreases to 28. But in small sets, a useful balance must be struck. The lesson is: if controls are available, each may be used to "extend" the set in some way. Such a color code essentially doubles the set size, for now we have a black "A" and red "A".

PICTURES ARE EASIER TO UNDERSTAND

Here is a made-up paper tape code on 5 tracks, as we might have adapted today's internal code of personal computers.

     <------ direction of paper tape movement
  -------------------------------------------------------------------
 /                                o o o o o o o o o o o o o o  o  o /   16
  /               o o o o o o o o                 o o o o o o  o  o  /   8
 /        o o o o         o o o o         o o o o         o o  o  o /    4
  /   o o     o o     o o     o o     o o     o o     o o      o  o  /   2
 /  o   o   o   o   o   o   o   o   o   o   o   o   o   o   o     o /    1
  /------------------------------------------------------------------/
    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ; _ BS SH DEL
    " # $ % & ? ( ) * + , - . / ! 0 1 2 3 4 5 6 7 8 9 : < > BS SH DEL
             direction of reading the paper tape ------>
See those powers of 2 at the right? Give each punched hole the shown value, and add those values -- e.g., "G" has a value 7 (1+2+4), so it's the 7th letter in the alphabet -- the basis for this sample code. Suppose we want to send:
     /-----------------------------------------------------/
     /       o                       o         o         /
    /    o             o o       o   o o   o   o          /
     /       o o       o     o             o             /
    /        o         o   o           o           o      /
     /     o   o   o     o o o     o o o   o o o   o o   /
    /-----------------------------------------------------/
         H A V E   A   N I C E   D A Y ,   M A Y   3 1
Assume the SHIFT key is LOCKING. First we have to shift out to get the comma. Then the shift back into the alphabet. Finally shift out again for the two digits:
      /----------------------------------------------------------/
      /      o                       o o   o       o   o       /
     /   o             o o       o   o o o o   o   o   o        /
      /      o o       o     o         o   o   o       o       /
     /       o         o   o           o o o           o o      /
      /    o   o   o     o o o     o o   o     o o o     o o   /
     /----------------------------------------------------------/
         H A V E   A   N I C E   D A Y   ,     M A Y     3 1
                                       :   :           :
                                   SHIFT   SHIFT   SHIFT
                                    OUT     IN      OUT
It took three codes here to get the comma. You can see why most early teletypewriters did not use extension. Remember old style messages?

    HAPPY BIRTHDAY STOP SEND MONEY STOP -- (no shifting necessary)

Now assume the SHIFT key is NON-LOCKING. We shift for the comma and two digits:

      /--------------------------------------------------------/
      /     o                       o o         o   o   o    /
     /  o             o o       o   o o o   o   o   o   o     /
      /     o o       o     o         o     o       o   o    /
     /      o         o   o           o o           o o o     /
      /   o   o   o     o o o     o o   o   o o o     o   o  /
     /--------------------------------------------------------/
        H A V E   A   N I C E   D A Y   ,   M A Y     3   1
                                      :             :   :
                                  SHIFT         SHIFT   SHIFT
All of this had to do with Code EXTENSION. Now let's do it with Code EXPANSION. We redesign the tape reader and its logic so as to have SIX tracks, not FIVE. Now we follow the NON-LOCKING mode, but instead of preceding those three characters by the SHIFT, we indicate that shifted quality by a hole in the 6th track. That looks like this:
       /-------------------------------------------------/
     /                                o           o o  /    <--- the shift
      /     o                       o         o         /          track
     /  o             o o       o   o o   o   o        /
      /     o o       o     o             o             /
     /      o         o   o           o           o    /
      /   o   o   o     o o o     o o o   o o o   o o   /
     /-------------------------------------------------/
        H A V E   A   N I C E   D A Y ,   M A Y   3 1
Now we shall have to rethink seriously. The 5-track code had 29 combinations devoted to TEXT characters, and 3 (Shift, Backspace, and Delete) devoted to CONTROL characters. So each row (character) was either TEXT or CONTROL.

But in our 6-track example the 6-bit characters are split into two parts -- 5 tracks for the TEXT part and 1 track for the CONTROL part. The five tracks were punched according to what text key was down, and the sixth track was punched whenever the shift key was down. And when read at the other end, the sixth track was processed ahead of time to actuate the shift key before the real text character was printed.

The shift key code in the 5-track examples is called a "precedence" code. It is a signal to "treat the next character in another way". ASCII (American Standard Code for Information Interchange), the code of Personal Computers (and more), has lots of these precedence codes in it -- ESCape, CANcel, four Device Control codes, Data Link Escape, four information separators, etc. That is because ASCII was for a long time a 7-bit code, with 128 combinations, and a lot of extension had to be done. Now there are many 8-bit variants of ASCII, and it shows again the general principle of:

"What is done by a precedence code in an extended set may be done exactly the same way by an added bit in an expanded set."
WHAT DOES THIS MEAN?

  1. Code extension has been practiced since the beginning of the Chinese language!
  2. Devices like shift keys were used for code extension since the first use of typewriters!
  3. Reserved characters have been used for code extension since the first use of teletypewriters!
  4. Characters have been split into two classes, TEXT and CONTROL, since at least 1957. For computers, such characters did have effect upon programmed branching, thus allowing alternate actions.
  5. A precedence code can be mapped into the extra bits of a larger set, keeping the text or control meaning identical. Mapping in the reverse direction is also equivalent.
  6. Reserved characters for precedence codes, with meaning dependent upon the character following them, were defined in the pre-ASCII proposals of 1960. Among those meanings for following characters could be: a) a different text character, b) a different control characters, or c) to put an ENTIRELY NEW set of text and control characters into force.
  7. The rationale for point 6 was published in 1960 by Bob Bemer [1], while employed by IBM. It was not patented by IBM then. Even if a patent application had been submitted, it would have had to refer to previous precedence code technology, although this was the furthest that precedence code concept had been extended!
  8. The first standards proposals (mid-1969) for controls for video terminals and keyboards enumerated these controls (as we see them today) in the 8-bit expanded set, but with the clear proviso that they could also be done with ESCape sequences in the extended (smaller) set.
APPENDIX

Information Separators of ASCII

Paper tape and teletypewriters have been used in our examples, but similar methods were also used in computers. An early example is the IBM 1401, circa 1957. It had a 6-bit internal code, but there were actually seven bits for each character. The seventh was called the Word Mark. When it was 1, not 0, the computer circuitry was signalled that it was both a TEXT and a CONTROL character, and that the character was the last one in the word being read. This self-delimiting process permitted variable length words.

In fact, when ASCII was in the standardization process, I was quite familiar with how this worked. So the four information separators of ASCII were derived from the Word Mark principle, in the REVERSE process, going from a bit in a larger set to a separate character in a smaller set.

Note on Limited Sets

This quote was found as a telegram in a novel [2]:

OUR SAFE MANUFACTURED BY EMPIRE SAFE CABINET COMPANY, MODEL G-23, DATED 1887 STOP SCOTLAND YARD INVENTORY ITEMS IN PAMELA'S ROOM MORNING AFTER BURGLARY ... IMMEDIATELY IF HELPFUL STOP
Why isn't it authentic?

Well, if there was no period in the telegraph code, thus forcing two uses of "STOP" to represent the period, where did the apostrophe, hyphen. and comma come from?)

A Reminder of the 8x16 code "ASCII"


    NUL   DLE   SP   0   @   P   `   p
    SOH   DC1    !   1   A   Q   a   q
    STX   DC2    "   2   B   R   b   r
    ETX   DC3    #   3   C   S   c   s
    EOT   DC4    $   4   D   T   d   t
    ENQ   NAK    %   5   E   U   e   u
    ACK   SYN    &   6   F   V   f   v
    BEL   ETB    '   7   G   W   g   w
    BS    CAN    (   8   H   X   h   x
    HT    EM     )   9   I   Y   i   y
    LF    SUB    *   :   J   Z   j   z
    VT    ESC    +   ;   K   [   k   {
    FF    FS     ,   <   L   \   l   |
    CR    GS     -   =   M   ]   m   }
    SO    RS     .   >   N   ^   n   ~
    SI    US     /   ?   O   _   o  DEL
REFERENCES

  1. R.W.Bemer, "ESCape - a proposal for character code compatibility",
    Commun. ACM 3, No. 2, 71-72 (1960 Feb)
  2. Elliott Roosevelt,"Murder and the First Lady", Readers Digest
    Condensed Books, 1984 Vol. 4, p. 326:
  3. Jukka Korpela, "A tutorial on character code issues", See it on the Web.
A superb paper. The best I've seen. You'll have to use your brain, your English language training, and all your education to understand this master teacher from Finland. But he gives all the other references you could need.

Back to History Index            Back to Home Page