org.crosswire.common.xml
Class XMLUtil

java.lang.Object
  extended by org.crosswire.common.xml.XMLUtil

public final class XMLUtil
extends Object

Utilities for working with SAX XML parsing.

Author:
Joe Walker, DM Smith
See Also:
The GNU Lesser General Public License for details.

Field Summary
private static PropertyMap badEntities
           
private static Set<String> goodEntities
           
private static Pattern invalidCharacterPattern
          Pattern that negates the allowable XML 4 byte unicode characters.
private static org.slf4j.Logger log
          The log stream
private static Pattern openHTMLTagPattern
          Pattern that matches open <br>,<hr> and <img> tags.
private static Pattern validCharacterEntityPattern
          Pattern for numeric entities.
 
Constructor Summary
private XMLUtil()
          Prevent instantiation
 
Method Summary
static String cleanAllCharacters(String broken)
          Remove all invalid characters in the input, replacing them with a space.
static String cleanAllEntities(String broken)
          For each entity in the input that is not allowed in XML, replace the entity with its unicode equivalent or remove it.
static String cleanAllTags(String broken)
          XML parse failed, so we can try getting rid of all the tags and having another go.
static String closeEmptyTags(String broken)
          Common HTML tags such as <br>,<hr> and <img> may be left open causing XML parsing to fail.
static void debugSAXAttributes(Attributes attrs)
          Show the attributes of an element as debug
static String escape(String s)
          Normalizes the given string
static String getAttributeName(Attributes attrs, int index)
          Get the full name of the attribute, including the namespace if any.
static org.jdom2.Document getDocument(String subject)
          Get and load an XML file from the classpath and a few other places into a JDOM Document object.
private static String handleEntity(String entity)
          Replace entity with its unicode equivalent, if it is not a valid XML entity.
static String recloseTags(String broken)
          Strip all closing tags from the end of the XML fragment, and then re-close all tags that are open at the end of the string.
static String writeToString(SAXEventProvider provider)
          Serialize a SAXEventProvider into an XML String
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

goodEntities

private static Set<String> goodEntities

badEntities

private static PropertyMap badEntities

validCharacterEntityPattern

private static Pattern validCharacterEntityPattern
Pattern for numeric entities.


invalidCharacterPattern

private static Pattern invalidCharacterPattern
Pattern that negates the allowable XML 4 byte unicode characters. Valid are: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]


openHTMLTagPattern

private static Pattern openHTMLTagPattern
Pattern that matches open <br>,<hr> and <img> tags.


log

private static final org.slf4j.Logger log
The log stream

Constructor Detail

XMLUtil

private XMLUtil()
Prevent instantiation

Method Detail

getDocument

public static org.jdom2.Document getDocument(String subject)
                                      throws org.jdom2.JDOMException,
                                             IOException
Get and load an XML file from the classpath and a few other places into a JDOM Document object.

Parameters:
subject - The name of the desired resource (without any extension)
Returns:
The requested resource
Throws:
IOException - if there is a problem reading the file
org.jdom2.JDOMException - If the resource is not valid XML

writeToString

public static String writeToString(SAXEventProvider provider)
                            throws SAXException
Serialize a SAXEventProvider into an XML String

Parameters:
provider - The source of SAX events
Returns:
a serialized string
Throws:
SAXException

getAttributeName

public static String getAttributeName(Attributes attrs,
                                      int index)
Get the full name of the attribute, including the namespace if any.

Parameters:
attrs - the collection of attributes
index - the index of the desired attribute
Returns:
the requested attribute

debugSAXAttributes

public static void debugSAXAttributes(Attributes attrs)
Show the attributes of an element as debug

Parameters:
attrs -

escape

public static String escape(String s)
Normalizes the given string

Parameters:
s -
Returns:
the escaped string

cleanAllEntities

public static String cleanAllEntities(String broken)
For each entity in the input that is not allowed in XML, replace the entity with its unicode equivalent or remove it. For each instance of a bare &, replace it with &
XML only allows 4 entities: &amp;, &quot;, &lt; and &gt;.

Parameters:
broken - the string to handle entities
Returns:
the string with entities appropriately fixed up

cleanAllCharacters

public static String cleanAllCharacters(String broken)
Remove all invalid characters in the input, replacing them with a space. XML has stringent requirements as to which characters are or are not allowed. The set of allowable characters are:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Note: Java handles to ￿

Parameters:
broken - the string to be cleaned
Returns:
the cleaned string

recloseTags

public static String recloseTags(String broken)
Strip all closing tags from the end of the XML fragment, and then re-close all tags that are open at the end of the string.

Parameters:
broken - the string to be cleaned.
Returns:
cleaned string, or null if the string could not be cleaned due to more broken XML

closeEmptyTags

public static String closeEmptyTags(String broken)
Common HTML tags such as <br>,<hr> and <img> may be left open causing XML parsing to fail. This method closes these tags.

Parameters:
broken - the string to be cleaned
Returns:
the cleaned string

cleanAllTags

public static String cleanAllTags(String broken)
XML parse failed, so we can try getting rid of all the tags and having another go. We define a tag to start at a < and end at the end of the next word (where a word is what comes in between spaces) that does not contain an = sign, or at a >, whichever is earlier.

Parameters:
broken -
Returns:
the string without any tags

handleEntity

private static String handleEntity(String entity)
Replace entity with its unicode equivalent, if it is not a valid XML entity. Otherwise strip it out. XML only allows 4 entities: &amp;, &quot;, &lt; and &gt;.

Parameters:
entity - the entity to be replaced
Returns:
the substitution for the entity, either itself, the unicode equivalent or an empty string.

Copyright ¨ 2003-2015