Previous Message
Next Message

Invalid characters

Sent by Lachlan Hunt on 4 September 2004 12:12


David Leader wrote:

> Bruce asked about a program to 'fix' smart quotes to provide valid HTML 
> characters. The answers he got told him how to turn off smart quotes in 
> Word (ugh!)...

 From my experience, there's no problem with smart quotes in Word if you 
use know how to handle them.  The issue occurs when an incorrect 
character encoding is declared, that is different from what the file is 
actually saved as.

The problem is that Word and Notepad use windows-1252 as the default 
character encoding.  (This seems to be the case in notepad when ANSI is 
selected while saving files with these quotes.)

The left and right double quotes are saved as the bytes 0x93 (decimal 
147) and 0x94 (148).  These are the windows-1252 codes that are mapped 
[1] to U+201C (8220) and U+201D (8221) respectively.  So, if the 
character encoding is decared as windows-1252, there is no problem.  The 
problem does occur when the character encoding is declared as 
iso-8859-1, us-ascii or other encoding that uses those bytes for other 
characters.

In iso-8859-1, these bytes are defined as control characters [2].  In 
us-ascii, these bytes aren't used since it only uses from 0x00 to 0x7F 
(127).  These quotes do not exists as characters in either of these 
character encodings.

When I tested this, I served the same file as ISO-8859-1, US-ASCII and 
Windows-1252.  These were the results:
Firefox 0.9.3 and Opera 7.52
   ISO-8859-1:   Displayed quotes
   US-ASCII:     Displayed quotes
   Windows-1252: Displayed quotes

IE6 WinXP SP2
   ISO-8859-1:   Displayed quotes
   US-ASCII:     Displayed invalid characters
   Windows-1252: Displayed quotes

W3 Markup Validator
   ISO-8859-1:   Invalid Characters Found
   US-ASCII:     Invalid Characters Found
   Windows-1252: Validated Correctly

Thus, the simple solution is to declare the character encoding as 
windows-1252, however I do not recommend that.  The better solution is 
to use UTF-8 (recommended), or if you have to use ISO-8859-1, then use 
decimal or hexadecimal character references as already pointed out 
(below).  If you decide to use UTF-8, make sure your editor actually 
does save the file as UTF-8.  Don't just stick that in the meta element 
or change the HTTP charset paramter and expect it to work.

> Left double quotation mark;        “
> Right double quotation mark;        ”
> Left single quotation mark;        ‘
> Right single quotation mark;        ’

> ...and for real typographic style:
> 
> 'en dash'                –
> 'em dash'                —

If you require any other characters, then the full list is available in 
the Unicode Code Charts [3].  Remember that the codes in those charts 
are in hexadecimal, not decimal, so if you need to use them then either 
convert them to decimal and encode them as shown above, or use “, 
” or whatever character you are using.

For more information, Jukka Korpela's character histories [4] and Ian 
Hickson's unicode character encoder, decoder and identifer [5] are very 
useful resources for various character encoding issues.

[1] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
[2] http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
[3] http://www.unicode.org/charts/
[4] http://www.cs.tut.fi/~jkorpela/chars.html
[5] http://software.hixie.ch/utilities/cgi/unicode-decoder/

-- 
Lachlan Hunt
http://www.lachy.id.au/

______________________________________________________________________
css-discuss [EMAIL-REMOVED]]
http://www.css-discuss.org/mailman/listinfo/css-d
List wiki/FAQ -- http://css-discuss.incutio.com/
Supported by evolt.org -- http://www.evolt.org/help_support_evolt/
Previous Message
Next Message

Message thread:

Possibly related: