TagSoup - Parse even the worst HTML as XML : Kasun's Tech Thoughts

If you are in ever in need of parsing XML to retrieve some contents from a HTML file, you know how irritating it is. Generally, HTML is supposed to contain a well-formed XML. But it's quite often far from it most of the time. HTML files contain mismatched tags, missing end-tags etc. So, if you parse these HTML files with a popular Java XML Parser like Xerces, it will throw an exception, and will stop parsing. Here, I have used SAX interface with Java JAXP; the outcome is same for DOM.

[java] SaxParseException: The indexing file contains incorrect xml syntax.
[java] org.xml.sax.SAXParseException: The element type "div" must be terminated by the matching end-tag "</div>".[java]     at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
[java]     at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
[java]     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
---

[java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
[java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)

---
[java]     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
[java]     at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
[java]     at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
[java]     at com.nexwave.nquindexer.SaxDocFileParser.parseDocument(SaxDocFileParser.java:96)

Why does this happen? Because the provided html file contained the following malformed xml.

This is <B>bold, <div>a div with bold text </b> plain div </div>normal text

Unfortunately, the html files in the web contain lot of syntax issues like this. TagSoup to the rescue.

TagSoup

TagSoup is a SAX-compliant parser written in Java (C++ port is now available) that parses malformed html like the above according to a specially designed schema for html. And, TagSoup is a Free and Open Source software released under Apache License 2.0. (aah, I saw that little smile in your face ;-) ) Read more about it at tagsoup home.

How to use it?

TagSoup can be invoked directly via command-line. See http://home.ccil.org/~cowan/tagsoup/#program for more info. This blog post if primarily intended on showing you how to use it in Java.

TagSoup provides SAX interface, so, if you are already familiar with SAX (Simple API for XML), and JAXP (Java API for XML Processing), then you are almost there. If not, read these up-to-the-point articles from IBM developerWorks; JAXP, SAX, Vendor independence in SAX.

If your project already have a SAX implementation designed to work with other parsers like Xerces, then you can just use TagSoup by changing two system properties, "org.xml.sax.driver", and "javax.xml.parsers.SAXParserFactory".

java -Dorg.xml.sax.driver=org.ccil.cowan.tagsoup.Parser -Djavax.xml.parsers.SAXParserFactory=org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl -cp myjar-1.0.jar:tagsoup-1.2.1.jar org.Main

//Runs the main class org.Main in myjar.jar using the tagsoup as the sax parser.

I use this in my Apache ANT script for the DocBook WebHelp Search Indexer . For that, I've added following content to my build.xml. Have a look at it here from line 91.

<java classname="org.Main" fork="true">

    <sysproperty key="org.xml.sax.driver" value="org.ccil.cowan.tagsoup.Parser"/>

    <sysproperty key="javax.xml.parsers.SAXParserFactory" value="org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl"/>

    <classpath>

        <path refid="classpath"/>

        <pathelement location="myjar-1.0.jar"/> 

As you can see, we can achieve vendor independence easily here. So, in case you want to use Xerces as your sax parser, then just replace "org.ccil.cowan.tagsoup.Parser" with "org.apache.xerces.parsers.SAXParser", and "org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl" with "org.apache.xerces.jaxp.SAXParserFactoryImpl".

Kasun's Tech Thoughts

Sunday, October 9, 2011

TagSoup - Parse even the worst HTML as XML

No comments:

Post a Comment