Package org.w3c.tidy
Class EncodingUtils
java.lang.Object
org.w3c.tidy.EncodingUtils
- Version:
- $Revision: 622 $ ($Author: fgiust $)
- Author:
- Fabrizio Giustina
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstates for ISO 2022 A document in ISO-2022 based encoding uses some ESC sequences called "designator" to switch character sets.static final intstate ESC.static final intstate ESCD.static final intstate ESCDP.static final intstate ESCP.static final intstate NONASCII.static final intUTF-16 high surrogate.static final intutf16 low surrogate.static final intMax UTF-16 value.static final intMax UTF-88 valid char value.static final intthe default (big-endian) UNICODE BOM.static final intthe big-endian (default) UNICODE BOM.static final intthe little-endian UNICODE BOM.static final intthe UTF-8 UNICODE BOM.static final intUTF-16 surrogate pair areas: high surrogates begin.static final intUTF-16 surrogate pair areas: high surrogates end.static final intUTF-16 surrogate pair areas: low surrogates begin.static final intUTF-16 surrogate pair areas: low surrogates end.static final intUTF-16 surrogates begin. -
Method Summary
Modifier and TypeMethodDescriptionprotected static intdecodeMacRoman(int c) Function to convert from MacRoman to Unicode.protected static intdecodeWin1252(int c) Function for conversion from Windows-1252 to Unicode.
-
Field Details
-
UNICODE_BOM_BE
public static final int UNICODE_BOM_BEthe big-endian (default) UNICODE BOM.- See Also:
-
UNICODE_BOM
public static final int UNICODE_BOMthe default (big-endian) UNICODE BOM.- See Also:
-
UNICODE_BOM_LE
public static final int UNICODE_BOM_LEthe little-endian UNICODE BOM.- See Also:
-
UNICODE_BOM_UTF8
public static final int UNICODE_BOM_UTF8the UTF-8 UNICODE BOM.- See Also:
-
FSM_ASCII
public static final int FSM_ASCIIstates for ISO 2022 A document in ISO-2022 based encoding uses some ESC sequences called "designator" to switch character sets. The designators defined and used in ISO-2022-JP are: "ESC" + "(" + ? for ISO646 variants "ESC" + "$" + ? and "ESC" + "$" + "(" + ? for multibyte character sets. State ASCII.- See Also:
-
FSM_ESC
public static final int FSM_ESCstate ESC.- See Also:
-
FSM_ESCD
public static final int FSM_ESCDstate ESCD.- See Also:
-
FSM_ESCDP
public static final int FSM_ESCDPstate ESCDP.- See Also:
-
FSM_ESCP
public static final int FSM_ESCPstate ESCP.- See Also:
-
FSM_NONASCII
public static final int FSM_NONASCIIstate NONASCII.- See Also:
-
MAX_UTF8_FROM_UCS4
public static final int MAX_UTF8_FROM_UCS4Max UTF-88 valid char value.- See Also:
-
MAX_UTF16_FROM_UCS4
public static final int MAX_UTF16_FROM_UCS4Max UTF-16 value.- See Also:
-
LOW_UTF16_SURROGATE
public static final int LOW_UTF16_SURROGATEutf16 low surrogate.- See Also:
-
UTF16_SURROGATES_BEGIN
public static final int UTF16_SURROGATES_BEGINUTF-16 surrogates begin.- See Also:
-
UTF16_LOW_SURROGATE_BEGIN
public static final int UTF16_LOW_SURROGATE_BEGINUTF-16 surrogate pair areas: low surrogates begin.- See Also:
-
UTF16_LOW_SURROGATE_END
public static final int UTF16_LOW_SURROGATE_ENDUTF-16 surrogate pair areas: low surrogates end.- See Also:
-
UTF16_HIGH_SURROGATE_BEGIN
public static final int UTF16_HIGH_SURROGATE_BEGINUTF-16 surrogate pair areas: high surrogates begin.- See Also:
-
UTF16_HIGH_SURROGATE_END
public static final int UTF16_HIGH_SURROGATE_ENDUTF-16 surrogate pair areas: high surrogates end.- See Also:
-
HIGH_UTF16_SURROGATE
public static final int HIGH_UTF16_SURROGATEUTF-16 high surrogate.- See Also:
-
-
Method Details
-
decodeWin1252
protected static int decodeWin1252(int c) Function for conversion from Windows-1252 to Unicode.- Parameters:
c- char to decode- Returns:
- decoded char
-
decodeMacRoman
protected static int decodeMacRoman(int c) Function to convert from MacRoman to Unicode.- Parameters:
c- char to decode- Returns:
- decoded char
-