Sent by Lachlan Hunt on 4 September 2004 12:12
David Leader wrote:
> Bruce asked about a program to 'fix' smart quotes to provide valid HTML
> characters. The answers he got told him how to turn off smart quotes in
> Word (ugh!)...
From my experience, there's no problem with smart quotes in Word if you
use know how to handle them. The issue occurs when an incorrect
character encoding is declared, that is different from what the file is
actually saved as.
The problem is that Word and Notepad use windows-1252 as the default
character encoding. (This seems to be the case in notepad when ANSI is
selected while saving files with these quotes.)
The left and right double quotes are saved as the bytes 0x93 (decimal
147) and 0x94 (148). These are the windows-1252 codes that are mapped
[1] to U+201C (8220) and U+201D (8221) respectively. So, if the
character encoding is decared as windows-1252, there is no problem. The
problem does occur when the character encoding is declared as
iso-8859-1, us-ascii or other encoding that uses those bytes for other
characters.
In iso-8859-1, these bytes are defined as control characters [2]. In
us-ascii, these bytes aren't used since it only uses from 0x00 to 0x7F
(127). These quotes do not exists as characters in either of these
character encodings.
When I tested this, I served the same file as ISO-8859-1, US-ASCII and
Windows-1252. These were the results:
Firefox 0.9.3 and Opera 7.52
ISO-8859-1: Displayed quotes
US-ASCII: Displayed quotes
Windows-1252: Displayed quotes
IE6 WinXP SP2
ISO-8859-1: Displayed quotes
US-ASCII: Displayed invalid characters
Windows-1252: Displayed quotes
W3 Markup Validator
ISO-8859-1: Invalid Characters Found
US-ASCII: Invalid Characters Found
Windows-1252: Validated Correctly
Thus, the simple solution is to declare the character encoding as
windows-1252, however I do not recommend that. The better solution is
to use UTF-8 (recommended), or if you have to use ISO-8859-1, then use
decimal or hexadecimal character references as already pointed out
(below). If you decide to use UTF-8, make sure your editor actually
does save the file as UTF-8. Don't just stick that in the meta element
or change the HTTP charset paramter and expect it to work.
> Left double quotation mark; “
> Right double quotation mark; ”
> Left single quotation mark; ‘
> Right single quotation mark; ’
> ...and for real typographic style:
>
> 'en dash' –
> 'em dash' —
If you require any other characters, then the full list is available in
the Unicode Code Charts [3]. Remember that the codes in those charts
are in hexadecimal, not decimal, so if you need to use them then either
convert them to decimal and encode them as shown above, or use “,
” or whatever character you are using.
For more information, Jukka Korpela's character histories [4] and Ian
Hickson's unicode character encoder, decoder and identifer [5] are very
useful resources for various character encoding issues.
[1] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
[2] http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
[3] http://www.unicode.org/charts/
[4] http://www.cs.tut.fi/~jkorpela/chars.html
[5] http://software.hixie.ch/utilities/cgi/unicode-decoder/
--
Lachlan Hunt
http://www.lachy.id.au/
______________________________________________________________________
css-discuss [EMAIL-REMOVED]]
http://www.css-discuss.org/mailman/listinfo/css-d
List wiki/FAQ -- http://css-discuss.incutio.com/
Supported by evolt.org -- http://www.evolt.org/help_support_evolt/