Non Ascii Char’s and UTF-8

February 23, 2010

ISO-8859-1

ISO-8859-1 has been the default character set in most browsers.
The first 128 characters of ISO-8859-1 is the original ASCII character-set (the numbers from 0-9, the uppercase and lowercase English alphabet, and some special characters).

The second part of ISO-8859-1 (codes from 160-255) contains some characters used in Western European countries and some commonly used special characters.

Entities are used to implement reserved characters or to express characters that cannot easily be entered with the keyboard.

ISO 8859-1, more formally cited as ISO/IEC 8859-1 or less formally as Latin-1, is part 1 of ISO/IEC 8859, a standard character encoding defined by ISO. It encodes what it refers to as Latin alphabet no. 1, consisting of 191 characters from the Latin script, each encoded as a single 8-bit code value.

ISO/IEC 8859-1 suffers from a number of deficiencies, including the omission of a few French diacritics and the lack of a Euro symbol. For this reason ISO/IEC 8859-15 has been developed as an update of ISO/IEC 8859-1 to add the required additional characters. (This required however the removal of some less used characters from ISO/IEC 8859-1, including fraction symbols and letter-free diacritics: ¤, ¦, ¨, ´, ¸, ¼, ½ and ¾.)

The name Latin-1 is an informal alias unrecognized by ISO or the IANA, but is perhaps meaningful in some computer software.

The following table shows ISO-8859-1, with the 3-letter abbreviations for the control characters.

DCS

ISO-8859-1
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xA xB xC xD xE xF
0x NUL

SOH

STX

ETX

EOT

ENQ

ACK

BEL

BS

HT

LF

VT

FF

CR

SO

SI
1x DLE

DC1

DC2

DC3

DC4

NAK

SYN

ETB

CAN

EM

SUB

ESC

FS

GS

RS

US
2x SP

!

"

#

$

%

&

(

)

*

+

,

.

/
3x 0

1

2

3

4

5

6

7

8

9

:

;

< =

>

?
4x @

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O
5x P

Q

R

S

T

U

V

W

X

Y

Z

[

\

]

^

_
6x `

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o
7x p

q

r

s

t

u

v

w

x

y

z

{

|

}

~

DEL
8x PAD

HOP

BPH

NBH

IND

NEL

SSA

ESA

HTS

HTJ

VTS

PLD

PLU

RI

SS2

SS3
9x PU1

PU2

STS

CCH

MW

SPA

EPA

SOS

SGCI

SCI

CSI

ST

OSC

PM

APC
Ax NBSP

¡

¢

£

¤

¥

¦

§

¨

©

ª

"

¬

­

®

¯
Bx °

±

²

³

´

µ

·

¸

¹

º

"

¼

½

¾

¿
Cx À

Á

Â

Ã

Ä

Å

Æ

Ç

È

É

Ê

Ë

Ì

Í

Î

Ï
Dx Ð

Ñ

Ò

Ó

Ô

Õ

Ö

×

Ø

Ù

Ú

Û

Ü

Ý

Þ

ß
Ex à

á

â

ã

a

å

æ

ç

è

é

ê

ë

ì

í

î

ï
Fx ð

ñ

ò

ó

ô

õ

ö

÷

ø

ù

ú

û

ü

ý

þ

ÿ


IBM PC or MS-DOS Codepage 437, often abbreviated to CP437 and also known as DOS-US or OEM-US, is the original character set of the IBM PC, circa 1981.

The following is a table representing CP437 using the ASCII (0-127) and MS-DOS (128-255):

the title: "ASCII Character Codes" is deceptive, the 2nd half is NOT ascii, and was changed in Windows.

The repertoire of CP437 was taken from the character set of Wang word-processing machines, as explicitly admitted by Bill Gates in the interview of him and Paul Allen in the 2nd of October 1995 edition of Fortune Magazine:

"… we were also fascinated by dedicated word processors from Wang, because we believed that general-purpose machines could do that just as well. That’s why, when it came time to design the keyboard for the IBM PC, we put the funny Wang character set into the machine–you know, smiley faces and boxes and triangles and stuff. We were thinking we’d like to do a clone of Wang word-processing software someday."

CP437 is inadequate for internationalisation, as it lacks characters necessary for some languages, such as À (capital A with grave) for French, and has only a few Greek letters. Later MS-DOS character sets, such as CP850 (DOS Latin-1), CP852 (DOS Central-European) and CP737 (DOS Greek), filled the gaps for international use while still being nearly compatible with CP437 by retaining the box-drawing characters. All CP437 characters are in Unicode and in Microsoft’s WGL4 character set, therefore in most of the fonts on Microsoft Windows, and also in the VGA font of Linux, and the ISO 10646 fonts for X11.


· [·]     © [©]     ® [®]     ™ [™]     ‘ [‘]     ° [°] ex: 32°F

For info on more special characters see
HTML Escape Character Codes (and ‘cursor over’)


Reserved Characters in HTML

Some characters are reserved in HTML and XHTML. For example, you cannot use
the greater than or less than signs within your text because the browser could
mistake them for markup.

HTML and XHTML processors must support the five special characters listed in
the table below:

Character Entity Number Entity Name Description
" " " quotation mark
‘ (does not work in IE) apostrophe 
& & & ampersand
< < < less-than
> > > greater-than

Note: Entity names are case sensitive!


ISO 8859-1 Symbols

Char
(&#xxx;)
Entity Name Description
  (160)   non-breaking space
¡ (161) ¡ inverted exclamation mark
¢ (162) ¢ cent
£ £ pound
¤ ¤ currency
¥ ¥ yen
¦ ¦ broken vertical bar
§ § section
¨ ¨ spacing diaeresis
© © copyright
ª (170) ª feminine ordinal indicator
" « angle quotation mark (left)
¬ ¬ negation
­ ­ soft hyphen
® ® registered trademark
¯ ¯ spacing macron
° ° degree
± ± plus-or-minus 
² ² superscript 2
³ ³ superscript 3
´ (180) ´ spacing acute
µ µ micro
paragraph
· · middle dot
¸ ¸ spacing cedilla
¹ ¹ superscript 1
º º masculine ordinal indicator
" » angle quotation mark (right)
¼ ¼ fraction 1/4
½ ½ fraction 1/2
¾ (190) ¾ fraction 3/4
¿ (191) ¿ inverted question mark
× (215) × multiplication
÷ (247) ÷ division

Char. (#) Entity Name Description
À (192) À capital a, grave accent
Á (193) Á capital a, acute accent
  capital a, circumflex accent
à à capital a, tilde
Ä Ä capital a, umlaut mark
Å Å capital a, ring
Æ Æ capital ae
Ç Ç capital c, cedilla
È (200) È capital e, grave accent
É É capital e, acute accent
Ê Ê capital e, circumflex accent
Ë Ë capital e, umlaut mark
Ì Ì capital i, grave accent
Í Í capital i, acute accent
Î Î capital i, circumflex accent
Ï Ï capital i, umlaut mark
Ð Ð capital eth, Icelandic
Ñ Ñ capital n, tilde
Ò (210) Ò capital o, grave accent
Ó Ó capital o, acute accent
Ô Ô capital o, circumflex accent
Õ Õ capital o, tilde
Ö (214) Ö capital o, umlaut mark
Ø (216) Ø capital o, slash
Ù Ù capital u, grave accent
Ú Ú capital u, acute accent
Û Û capital u, circumflex accent
Ü (220) Ü capital u, umlaut mark
Ý Ý capital y, acute accent
Þ Þ capital THORN, Icelandic
ß ß small sharp s, German
à à small a, grave accent
á á small a, acute accent
â â small a, circumflex accent
ã ã small a, tilde
a ä small a, umlaut mark
å å small a, ring
æ (230) æ small ae
ç ç small c, cedilla
è è small e, grave accent
é é small e, acute accent
ê ê small e, circumflex accent
ë ë small e, umlaut mark
ì ì small i, grave accent
í í small i, acute accent
î î small i, circumflex accent
ï ï small i, umlaut mark
ð (240) ð small eth, Icelandic
ñ ñ small n, tilde
ò ò small o, grave accent
ó ó small o, acute accent
ô ô small o, circumflex accent
õ õ small o, tilde
ö (246) ö small o, umlaut mark
ø (248) ø small o, slash
ù ù small u, grave accent
ú (250) ú small u, acute accent
û û small u, circumflex accent
ü ü small u, umlaut mark
ý ý small y, acute accent
þ þ small thorn, Icelandic
ÿ ÿ small y, umlaut mark

Unicode Transformation Format

UTF-8 vs. UTF-16 and UTF-32

"UTF-8 is also the most common Unicode encoding used in HTML documents on the World Wide Web."

UTF-8 was developed by Unix luminaries Ken Thompson and Rob Pike, specifically to support arbitrary language characters on Unix-like systems

If you want to store arbitrary international text, you might use UTF-16 or UTF-32, but they must be able to store byte 0; since you can’t use byte 0 in a filename, they don’t work at all.

The {other} filesystem[s] is[are] also not flexible in another way: There’s no mechanism to find out what encoding is used on a given filesystem. If [strange characters show up in] a given filename, there’s no obvious way to find out what encoding they used. In theory, you could store the encoding system with the filename, and then use multiple system calls to find out what encoding was used for each name.. but really, who needs that kind of complexity?!?

If you want to store arbitrary language characters in filenames using todays’ Unix/Linux/POSIX filesystem, the only widely-used answer that "simply works" for all languages is UTF-8. Wikipedia’s UTF-8 entry and Markus Kuhn’s UTF-8 and Unicode FAQ (at the University of Cambridge, MA) have more information about UTF-8. UTF-8 was developed by Unix luminaries Ken Thompson and Rob Pike, specifically to support arbitrary language characters on Unix-like systems, and it’s widely acknowledged to have a great design.

Unicode defines an adequate character set but an unreasonable representation. The Unicode standard states that all characters are 16 bits wide and are communicated and stored in 16-bit units. It also reserves a pair of characters (hexadecimal FFFE and FEFF) to detect byte order in transmitted text, requiring state in the byte stream. (The Unicode committee was thinking of files, not pipes.) To adopt Unicode, we would have had to convert all text going into and out of Plan 9 between ASCII and Unicode, which cannot be done. Within a single program, in command of all its input and output, it is possible to define characters as 16-bit quantities; in the context of a networked system with hundreds of applications on diverse machines by different manufacturers, it is impossible
– Ken Thompson

The problems of filenames in Unix/Linux/POSIX are particularly jarring in part because there are so many other things in POSIX systems that are well-designed. In contrast, Microsoft Windows has a legion of design problems, often caused by its legacy, that will probably be harder to fix over time. These include its irregular filesystem rules that are also a problem yet will be harder to fix (so that "c:\stuff\com1.txt" refers to the COM1 serial port, not to a file), its distinction between binary and text files *, its monolithic design, and the Windows registry.

The Windows character set is often called "ANSI character set" or "8-bit ASCII" or "ASCII-8″, but this is seriously misleading. It has not been approved by ANSI. (Historical background: Microsoft based the design of the set on a draft for an ANSI standard. A glossary by Microsoft explicitly admits this.)

IETF Policy on Character Sets and Languages (RFC 2277) clearly favors UTF-8. It requires support to it in Internet protocols

Note that UTF-8 is efficient, if the data consists dominantly of ASCII characters with just a few "special characters" in addition to them, and reasonably efficient for dominantly ISO Latin 1 text.

FONTS
Even in circumstances where Unicode is supported in principle, the support usually does not cover all Unicode characters. For example, a font available may cover just some part of Unicode which is practically important in some area. On the other hand, for data transfer it is essential to know which Unicode characters the recipient is able to handle. For such reasons, various subsets of the Unicode character repertoire have been and will be defined. For example, the Minimum European Subset specified by ENV 1973:1995 was intended to provide a first step towards the implementation of large character sets in Europe. It was replaced by three Multilingual European Subsets (MES-1, MES-2, MES-3, with MES-2 based on the Minimum European Subset), defined in a CEN Workshop Agreement, namely CWA 13873.

(In Unicode terminology, "abstract character" is a character as an element of a character repertoire, whereas "character" refers to "coded character representation", which effectively means a code value. It would be natural to assume that the opposite of an abstract character is a concrete character, as something that actual appears in some physical form on paper or screen; but oh no, the Unicode concept "character" is more concrete than an "abstract character" only in the sense that it has a fixed code position! An actual physical form of an abstract character, with a specific shape and size, is a glyph. Confusing, isn’t it?)

A glyph – a visual appearance

It is important to distinguish the character concept from the glyph concept. A glyph is a presentation of a particular shape which a character may have when rendered or displayed

http://www.cs.tut.fi/~jkorpela/chars.html

Red Hat Linux 8.0 (September 2002) was the first distribution to take the leap of switching to UTF-8 as the default encoding for most locales. The only exceptions were Chinese/Japanese/Korean locales, for which there were at the time still too many specialized tools available that did not yet support UTF-8. This first mass deployment of UTF-8 under Linux caused most remaining issues to be ironed out rather quickly during 2003. SuSE Linux then switched its default locales to UTF-8 as well, as of version 9.1 (May 2004). It was followed by Ubuntu Linux, the first Debian-derivative that switched to UTF-8 as the system-wide default encoding. With the migration of the three most popular Linux distributions, UTF-8 related bugs have now been fixed in practically all well-maintained Linux tools.

UTF-8 (UCS[1] Transformation Format – 8-bit) is a multibyte character encoding for Unicode. UTF-8 is like UTF-16 and UTF-32, because it can represent every character in the Unicode character set. But unlike UTF-16 and UTF-32, it possesses the advantages of being backward-compatible with ASCII. And it has the advantage of avoiding the complications of endianness and the resulting need to use byte order marks (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.[2][3] The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.[4] The Internet Mail Consortium (IMC) recommends that all e‑mail programs be able to display and create mail using UTF-8.[5] UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.

Advantages

The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.

UTF-8 is the only encoding for XML entities that does not require a BOM or an indication of the encoding.[22]

UTF-8 and UTF-16 are the standard encodings for Unicode text in HTML documents, with UTF-8 as the preferred and most used encoding.

UTF-8 strings can be fairly reliably recognized as such by a simple heuristic algorithm.[23] The chance of a random string of bytes being valid UTF-8 and not pure ASCII is 3.9% for a two-byte sequence, 0.41% for a three-byte sequence and 0.026% for a four-byte sequence.[24] ISO/IEC 8859-1 is even less likely to be mis-recognized as UTF-8: the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol. This is an advantage that most other encodings do not have, causing errors (mojibake) if the receiving application isn’t told and can’t guess the correct encoding. Even UTF-16 can be mistaken for other encodings (like in the bush hid the facts bug).

Sorting of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points.

UTF-8 can encode any Unicode character, avoiding the need to figure out and set a "code page" or otherwise indicate what character set is in use, and allowing output in multiple languages at the same time. For many languages there has been more than one single-byte encoding in usage, so even knowing the language was insufficient information to display it correctly.

UTF-8 can encode any Unicode character. Files in different languages can be displayed correctly without having to choose the correct code page or font. For instance Chinese and Arabic can be in the same text without special codes inserted to switch the encoding.

UTF-8 is "self-synchronizing": character boundaries are easily found when searching either forwards or backwards. If bytes are lost due to error or corruption, one can always locate the beginning of the next character and thus limit the damage. Many multi-byte encodings are much harder to resynchronize.

Any byte oriented string searching algorithm can be used with UTF-8 data, since the sequence of bytes for a character cannot occur anywhere else. Some older variable-length encodings (such as Shift JIS) did not have this property and thus made string-matching algorithms rather complicated.

The Quick Brown Fox… and other Pangrams

These were traditionally used in typewriter instruction; now they are useful for stress-testing computer fonts and keyboard input methods. Here are a few examples.

English: The quick brown fox jumps over the lazy dog.
Jamaican: Chruu, a kwik di kwik brong fox a jomp huova di liezi daag de, yu no siit?
Irish: "An ḃfuil do ċroí ag bualaḋ ó ḟaitíos an ġrá a ṁeall lena ṗóg éada ó ṡlí do leasa ṫú?" "D’ḟuascail Íosa Úrṁac na hÓiġe Beannaiṫe pór Éava agus Áḋaiṁ."
Dutch: Pa’s wijze lynx bezag vroom het fikse aquaduct.
German: Falsches Üben von Xylophonmusik qualt jeden größeren Zwerg. (1)
German: Im finſteren Jagdſchloß am offenen Felsquellwaſſer patzte der affig-flatterhafte kauzig-höf‌liche Backer über ſeinem verſifften kniffligen C-Xylophon. (2)
Norwegian: Blåbærsyltetøy ("blueberry jam", includes every extra letter used in Norwegian).
Swedish: Flygande backasiner söka strax hwila på mjuka tuvor.
Icelandic: Sævör grét áðan því úlpan var ónýt.
Finnish: (5) Törkylempijavongahdus (This is a perfect pangram, every letter appears only once. Translating it is an art on its own, but I’ll say "rude lover’s yelp". :-D)
Finnish: (5) Albert osti fagotin ja töraytti puhkuvan melodian. (Albert bought a bassoon and hooted an impressive melody.)
Finnish: (5) On sangen hauskaa, etta polkupyöra on maanteiden jokapaivainen ilmiö. (It’s pleasantly amusing, that the bicycle is an everyday sight on the roads.)
Polish: Pchnąć w tę łódź jeża lub osiem skrzyń fig.
Czech: Příliš žluťoučký kůň úpěl ďábelské kódy.
Slovak: Starý kôň na hŕbe kníh žuje tíško povadnuté ruže, na stĺpe sa ďateľ učí kvákať novú ódu o živote.
Greek (monotonic): ξεσκεπάζω την ψυχοφθόρα βδελυγμία
Greek (polytonic): ξεσκεπάζω τὴν ψυχοφθόρα βδελυγμία
Russian: Съешь же ещё этих мягких французских булок да выпей чаю.
Russian: В чащах юга жил-был цитрус? Да, но фальшивый экземпляр! ёъ.
Bulgarian: Жълтата дюля беше щастлива, че пухът, който цъфна, замръзна като гьон.
Sami (Northern): Vuol Ruoŧa geđggiid leat máŋga luosa ja čuovžža.
Hungarian: Árvíztűrő tükörfúrógép.
Spanish: El pingüino Wenceslao hizo kilómetros bajo exhaustiva lluvia y frío, añoraba a su querido cachorro.
Portuguese: O próximo vôo à noite sobre o Atlântico, põe freqüentemente o único médico. (3)
French: Les naïfs ægithales hâtifs pondant à Noël où il gèle sont sûrs d’être déçus en voyant leurs drôles d’œufs abîmés.
Esperanto: Eĥoŝanĝo ĉiuĵaŭde.
Hebrew: זה כיף סתם לשמוע איך תנצח קרפד עץ טוב בגן.

Leave a Reply

Your email address will not be published. Required fields are marked *

We try to post all comments within 1 business day