Package org.w3c.tidy

Class EncodingUtils


  • public final class EncodingUtils
    extends java.lang.Object
    Version:
    $Revision: 622 $ ($Author: fgiust $)
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      (package private) static interface  EncodingUtils.GetBytes
      Getter callback: called to retrieve 1 or more additional UTF-8 bytes.
      (package private) static interface  EncodingUtils.PutBytes
      Putter callbacks: called to store 1 or more additional UTF-8 bytes.
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      private EncodingUtils()
      don't instantiate.
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      protected static int decodeMacRoman​(int c)
      Function to convert from MacRoman to Unicode.
      (package private) static int decodeSymbolFont​(int c)
      Function to convert from Symbol Font chars to Unicode.
      (package private) static boolean decodeUTF8BytesToChar​(int[] c, int firstByte, byte[] successorBytes, EncodingUtils.GetBytes getter, int[] count, int startInSuccessorBytesArray)
      Decodes an array of bytes to a char.
      protected static int decodeWin1252​(int c)
      Function for conversion from Windows-1252 to Unicode.
      (package private) static boolean encodeCharToUTF8Bytes​(int c, byte[] encodebuf, EncodingUtils.PutBytes putter, int[] count)
      Encode a char to an array of bytes.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • UNICODE_BOM_BE

        public static final int UNICODE_BOM_BE
        the big-endian (default) UNICODE BOM.
        See Also:
        Constant Field Values
      • UNICODE_BOM

        public static final int UNICODE_BOM
        the default (big-endian) UNICODE BOM.
        See Also:
        Constant Field Values
      • UNICODE_BOM_LE

        public static final int UNICODE_BOM_LE
        the little-endian UNICODE BOM.
        See Also:
        Constant Field Values
      • UNICODE_BOM_UTF8

        public static final int UNICODE_BOM_UTF8
        the UTF-8 UNICODE BOM.
        See Also:
        Constant Field Values
      • FSM_ASCII

        public static final int FSM_ASCII
        states for ISO 2022 A document in ISO-2022 based encoding uses some ESC sequences called "designator" to switch character sets. The designators defined and used in ISO-2022-JP are: "ESC" + "(" + ? for ISO646 variants "ESC" + "$" + ? and "ESC" + "$" + "(" + ? for multibyte character sets. State ASCII.
        See Also:
        Constant Field Values
      • MAX_UTF8_FROM_UCS4

        public static final int MAX_UTF8_FROM_UCS4
        Max UTF-88 valid char value.
        See Also:
        Constant Field Values
      • MAX_UTF16_FROM_UCS4

        public static final int MAX_UTF16_FROM_UCS4
        Max UTF-16 value.
        See Also:
        Constant Field Values
      • LOW_UTF16_SURROGATE

        public static final int LOW_UTF16_SURROGATE
        utf16 low surrogate.
        See Also:
        Constant Field Values
      • UTF16_SURROGATES_BEGIN

        public static final int UTF16_SURROGATES_BEGIN
        UTF-16 surrogates begin.
        See Also:
        Constant Field Values
      • UTF16_LOW_SURROGATE_BEGIN

        public static final int UTF16_LOW_SURROGATE_BEGIN
        UTF-16 surrogate pair areas: low surrogates begin.
        See Also:
        Constant Field Values
      • UTF16_LOW_SURROGATE_END

        public static final int UTF16_LOW_SURROGATE_END
        UTF-16 surrogate pair areas: low surrogates end.
        See Also:
        Constant Field Values
      • UTF16_HIGH_SURROGATE_BEGIN

        public static final int UTF16_HIGH_SURROGATE_BEGIN
        UTF-16 surrogate pair areas: high surrogates begin.
        See Also:
        Constant Field Values
      • UTF16_HIGH_SURROGATE_END

        public static final int UTF16_HIGH_SURROGATE_END
        UTF-16 surrogate pair areas: high surrogates end.
        See Also:
        Constant Field Values
      • HIGH_UTF16_SURROGATE

        public static final int HIGH_UTF16_SURROGATE
        UTF-16 high surrogate.
        See Also:
        Constant Field Values
      • UTF8_BYTE_SWAP_NOT_A_CHAR

        private static final int UTF8_BYTE_SWAP_NOT_A_CHAR
        UTF-8 bye swap: invalid char.
        See Also:
        Constant Field Values
      • UTF8_NOT_A_CHAR

        private static final int UTF8_NOT_A_CHAR
        UTF-8 invalid char.
        See Also:
        Constant Field Values
      • WIN2UNICODE

        private static final int[] WIN2UNICODE
        Mapping for Windows Western character set (128-159) to Unicode.
      • MAC2UNICODE

        private static final int[] MAC2UNICODE
        John Love-Jensen contributed this table for mapping MacRoman character set to Unicode.
      • SYMBOL2UNICODE

        private static final int[] SYMBOL2UNICODE
        table to map symbol font characters to Unicode; undefined characters are mapped to 0x0000 and characters without any unicode equivalent are mapped to '?'. Is this appropriate?
      • VALID_UTF8

        private static final ValidUTF8Sequence[] VALID_UTF8
        Array of valid UTF8 sequences.
      • NUM_UTF8_SEQUENCES

        private static final int NUM_UTF8_SEQUENCES
        number of valid utf8 sequances.
      • OFFSET_UTF8_SEQUENCES

        private static final int[] OFFSET_UTF8_SEQUENCES
        Offset for utf8 sequences.
    • Constructor Detail

      • EncodingUtils

        private EncodingUtils()
        don't instantiate.
    • Method Detail

      • decodeWin1252

        protected static int decodeWin1252​(int c)
        Function for conversion from Windows-1252 to Unicode.
        Parameters:
        c - char to decode
        Returns:
        decoded char
      • decodeMacRoman

        protected static int decodeMacRoman​(int c)
        Function to convert from MacRoman to Unicode.
        Parameters:
        c - char to decode
        Returns:
        decoded char
      • decodeSymbolFont

        static int decodeSymbolFont​(int c)
        Function to convert from Symbol Font chars to Unicode.
        Parameters:
        c - char to decode
        Returns:
        decoded char
      • decodeUTF8BytesToChar

        static boolean decodeUTF8BytesToChar​(int[] c,
                                             int firstByte,
                                             byte[] successorBytes,
                                             EncodingUtils.GetBytes getter,
                                             int[] count,
                                             int startInSuccessorBytesArray)
        Decodes an array of bytes to a char.
        Parameters:
        c - will contain the decoded char
        firstByte - first input byte
        successorBytes - array containing successor bytes (can be null if a getter is provided).
        getter - callback used to get new bytes if successorBytes doesn't contain enough bytes
        count - will contain the number of bytes read
        startInSuccessorBytesArray - starting offset for bytes in successorBytes
        Returns:
        true if error
      • encodeCharToUTF8Bytes

        static boolean encodeCharToUTF8Bytes​(int c,
                                             byte[] encodebuf,
                                             EncodingUtils.PutBytes putter,
                                             int[] count)
        Encode a char to an array of bytes.
        Parameters:
        c - char to encode
        encodebuf - will contain the decoded bytes
        putter - if not null it will be called to write bytes to out
        count - number of bytes written
        Returns:
        false= ok, true= error