|
 |  |  |  | What international encodings are supported by Xerces-J? |  |  |  |  |
| |
- UTF-8
- UTF-16 Big Endian, UTF-16 Little Endian
- IBM-1208
- ISO Latin-1 (ISO-8859-1)
-
ISO Latin-2 (ISO-8859-2) [Bosnian, Croatian, Czech,
Hungarian, Polish, Romanian, Serbian (in Latin transcription),
Serbocroatian, Slovak, Slovenian, Upper and Lower Sorbian]
- ISO Latin-3 (ISO-8859-3) [Maltese, Esperanto]
- ISO Latin-4 (ISO-8859-4)
- ISO Latin Cyrillic (ISO-8859-5)
- ISO Latin Arabic (ISO-8859-6)
- ISO Latin Greek (ISO-8859-7)
- ISO Latin Hebrew (ISO-8859-8)
- ISO Latin-5 (ISO-8859-9) [Turkish]
- ISO Latin-7 (ISO-8859-13)
- ISO Latin-9 (ISO-8859-15)
- Extended Unix Code, packed for Japanese (euc-jp, eucjis)
- Japanese Shift JIS (shift-jis)
- Chinese (big5)
- Chinese for PRC (mixed 1/2 byte) (gb2312)
- Japanese ISO-2022-JP (iso-2022-jp)
- Cyrillic (koi8-r)
- Extended Unix Code, packed for Korean (euc-kr)
- Russian Unix, Cyrillic (koi8-r)
- Windows Thai (cp874)
- Latin 1 Windows (cp1252) (and all other cp125? encodings recognized by IANA)
- cp858
- EBCDIC encodings:
- EBCDIC US (ebcdic-cp-us)
- EBCDIC Canada (ebcdic-cp-ca)
- EBCDIC Netherland (ebcdic-cp-nl)
- EBCDIC Denmark (ebcdic-cp-dk)
- EBCDIC Norway (ebcdic-cp-no)
- EBCDIC Finland (ebcdic-cp-fi)
- EBCDIC Sweden (ebcdic-cp-se)
- EBCDIC Italy (ebcdic-cp-it)
- EBCDIC Spain, Latin America (ebcdic-cp-es)
- EBCDIC Great Britain (ebcdic-cp-gb)
- EBCDIC France (ebcdic-cp-fr)
- EBCDIC Hebrew (ebcdic-cp-he)
- EBCDIC Switzerland (ebcdic-cp-ch)
- EBCDIC Roece (ebcdic-cp-roece)
- EBCDIC Yugoslavia (ebcdic-cp-yu)
- EBCDIC Iceland (ebcdic-cp-is)
- EBCDIC Urdu (ebcdic-cp-ar2)
- Latin 0 EBCDIC
- EBCDIC Arabic (ebcdic-cp-ar1)
 | UCS-4 is not yet supported, but it is hoped that support will be available soon. |
|
 |  |  |  | Is there any way I can determine what encoding an entity was
written in, or what XML version the document conformed to, if I'm
using SAX?
|  |  |  |  |
| |
The answer to this question is that, yes there is a way, but it's
not particularly beautiful. There is no way in SAX 2.0.0 or
2.0.1 to get hold of these pieces of information; the SAX
Locator2 interface from the 1.1 extensions--still in Alpha at
the time of writing--does provide methods to accomplish this,
but since Xerces is required to support precisely SAX 2.0.0 by
Sun TCK rules, we cannot ship this interface. However, we can
still support the appropriate methods on the objects we
provide to implement the SAX Locator interface. Therefore,
assuming Locator is an instance of the SAX
Locator interface that Xerces has passed back in a
setDocumentLocator call,
you can use a method like this to determine the encoding of
the entity currently being parsed:
 |  |  |  |
import java.lang.reflect.Method;
private String getEncoding(Locator locator) {
String encoding = null;
Method getEncoding = null;
try {
getEncoding = locator.getClass().getMethod("getEncoding", new Class[]{});
if(getEncoding != null) {
encoding = (String)getEncoding.invoke(locator, null);
}
} catch (Exception e) {
// either this locator object doesn't have this
// method, or we're on an old JDK
}
return encoding;
}
|  |  |  |  |
This code has the advantage that it will compile on JDK
1.1.8, though it will only produce non-null results on 1.2.x
JDK's and later. Substituting getXMLVersion for
getEncoding will enable you to determine the
version of XML to which the instance document conforms.
|
|