Thursday, October 20, 2011

Gentoo Maven Integration - Progress Report

Here's my progress report for the first phase of Gentoo Maven Integration project for 
finishing the work under Google Summer of Code and starting move in to 
a voluntary position. 

The goal of this project was to build maven and it's huge number of 
dependencies from source, and then facilitate the packagers for 
packaging maven-based Java packages. There are two eclasses which will 
facilitate bootstrapping maven along with building maven-based 
packages, and packaging Maven plugins. These eclasses address some 
fundamental issues of incompatibilities between Gentoo build system 
and Maven build system. 

There were two main goals for the project. One is building 
maven-from-source. It is now completed and has been thoroughly tested. 
There are around 40+ ebuilds that are direct dependencies of Maven 
which were packaged/bumped during the project period. General users of 
maven can have the full benefit from this package now. Please file 
bugs at https://bugs.gentoo.org/ if you find any. 

The second phase was a lengthy process and the scope wasn't fit for 
one and half months time. But with mentors' blessing, I've made a 
quite a big progress and was able to emerge a minimal package built 
via native Maven. 

Let me describe the surface details of the second phase. The idea was 
to facilitate the packagers to package maven-based packages. This has 
been a long-time blocker for Gentoo-Java (which extends to more than 
3-4 years). For this phase, we needed several requirements including 
dependency management issues and rewriting of pom.xml to match 
Gentoo's needs. One requirement in it was the need to have a mechanism 
to use the installed system jars instead of downloading the jars from 
maven repos. One another is that pom needs to respect the Gentoo SLOT 
system. Further, configuration details needed to be added to tell the 
JDK and JRE versions needed for building (ie need to add config bits 
to maven-compiler-plugin section in the pom). And, then it needs 
several maven plugins to build packages. There were enormous amount of 
plugins available that most of them need special attention separately. 
For the second phase, the hard part is over. And, as I said, I was 
able to emerge a minimal maven-based implementation. Maven isn't much 
cooperative when it comes to dependency management, but our solution 
worked well. 

Along with that, the first iteration of work is complete. I'm hoping 
to be the maintainer for Maven under Gentoo Java herd for the 
foreseeable future. And, I'm eagerly waiting to wear the Gentoo 
Developer hat one day. I'm interested in knowing the generic plan for 
recruiting developers who come via Summer of Code as well. 

There are few things to be done to bring the use of the Maven 
integration to it's full potential. These are more like plug-ins to 
the core base, and beautifying the process. I need to make new plans 
for these with help from Java herd. 

 * There's only five maven plugins has been packaged. Have a fresh 
look at maven-surefire-plugin. Needs to add all other plugins. 

 * Currently, when MAVEN_PARENTPOM_UNIQUE_ID is set to rewrite 
<parent> node of the pom, it rewrites all the poms in the project 
including sub-modules. The most probable usecase is that rewrite the 
parent element of the top-level/aggregator pom. The configuration bits 
needed are already there (-w option), but the implementation needs to 
be done. 

 * Merge the separate ebuilds of  maven-2.2.1 maven-2.2.1 release in 
to one. There are around 20+ ebuilds dedicated for this. These ebuilds 
probably won't be needed separately so it's ok to merge these 
together. Need to evaluate possibility of issues of having all these 
together. 

Here are some references if you are interested in getting deeper in 
Maven in Gentoo. Feel free to contact me if you like to extend your 
helping hand for the project. I'm at kasunbg +spamfree at gmail.com
 
 * The wiki 1 - Developer and User guide for Maven in Gentoo - 
https://overlays.gentoo.org/proj/java/wiki/Maven_Integration
 * The wiki 2 - Manpage for java-maven-2 eclass - 
http://overlays.gentoo.org/proj/java/wiki/Maven_Eclass_Manpage
 * Repository 1 - gsoc-maven-overlay - 
https://overlays.gentoo.org/svn/proj/java/gsoc-maven-overlay/
 * Repository 2 - Branch for Javatoolkit - 
http://overlays.gentoo.org/proj/java/browser/projects/javatoolkit/branches/kasun/
       * TracBrowser view - 
http://overlays.gentoo.org/proj/java/browser/gsoc-maven-overlay/

Sunday, October 9, 2011

Siddhi CEP has won a bronze award at NBQSA 2011










Siddhi Complex Event Processing Engine has won the bronze award in Tertiary category at the recently concluded National Best Quality Software Awards 2011 held in Hotel Galadari, Colombo, Sri Lanka. The winners are now published at NBQSA official website. Further, Siddhi has been nominated to APICTA (Asia Pacific ICT Awards) 2011 which will be held during November 8 – 11, 2011 in Thailand.


Siddhi was our final year project at Dept. of Computer Science & Engineering, Faculty of Engineering, University of Moratuwa. It was a four member project with other members being Suho, Isuru, and Subash. We were overwhelmed with joy on the day we heard that we have secured a place at NBQSA. They weren't specific about the exact award we got. So, we have been waiting with nervously, and full of excitement for the moment for winning the award. It's really a pleasure to know that a year old hard work has gained benefits.

Along with us, another two other projects from our department has won a merit, and a special recognition award. Further, another two projects from Dept. of Electronic & Telecommunication of University of Moratuwa also able to secure a merit and a special recognition award. In total, it's five awards.

Special thanks should go to our supervisors, Dr. Srinath Perera, and Ms. Vishaka Nanayakkara who have given us enormous support. Our project coordinator Dr. Shantha Fernando, Dr. Sanjiva Weerawarana who has provided us guidance when in need should also be remembered.

The Siddhi Team (from left-to-right Suho, me, and Subash; Isuru missed it!)

Award winners at NBQSA 2011


TagSoup - Parse even the worst HTML as XML

 If you are in ever in need of parsing XML to retrieve some contents from a HTML file, you know how irritating it is. Generally, HTML is supposed to contain a well-formed XML. But it's quite often far from it most of the time. HTML files contain mismatched tags, missing end-tags etc. So, if you parse these HTML files with a popular Java XML Parser like Xerces, it will throw an exception, and will stop parsing. Here, I have used SAX interface with Java JAXP; the outcome is same for DOM.

[java] SaxParseException: The indexing file contains incorrect xml syntax.
[java] org.xml.sax.SAXParseException: The element type "div" must be terminated by the matching end-tag "</div>".
[java]     at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
[java]     at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
[java]     at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
---
[java]     at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
[java]     at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
---
[java]     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
[java]     at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
[java]     at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
[java]     at com.nexwave.nquindexer.SaxDocFileParser.parseDocument(SaxDocFileParser.java:96)

Why does this happen? Because the provided html file contained the following malformed xml.

This is <B>bold, <div>a div with bold text </b> plain div </div>normal text

Unfortunately, the html files in the web contain lot of syntax issues like this. TagSoup to the rescue. 

TagSoup

TagSoup is a SAX-compliant parser written in Java (C++ port is now available) that parses malformed html like the above according to a specially designed schema for html. And, TagSoup is a Free and Open Source software released under Apache License 2.0. (aah, I saw that little smile in your face ;-) ) Read more about it at tagsoup home.

How to use it?

TagSoup can be invoked directly via command-line. See http://home.ccil.org/~cowan/tagsoup/#program for more info. This blog post if primarily intended on showing you how to use it in Java. 
TagSoup provides SAX interface, so, if you are already familiar with SAX (Simple API for XML), and JAXP (Java API for XML Processing), then you are almost there. If not, read these up-to-the-point articles from IBM developerWorks; JAXP, SAX, Vendor independence in SAX.

If your project already have a SAX implementation designed to work with other parsers like Xerces, then you can just use TagSoup by changing two system properties, "org.xml.sax.driver", and "javax.xml.parsers.SAXParserFactory".
 
 java -Dorg.xml.sax.driver=org.ccil.cowan.tagsoup.Parser -Djavax.xml.parsers.SAXParserFactory=org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl  -cp myjar-1.0.jar:tagsoup-1.2.1.jar org.Main
//Runs the main class org.Main in myjar.jar using the tagsoup as the sax parser.

I use this in my Apache ANT script for the DocBook WebHelp Search Indexer . For that, I've added following content to my build.xml. Have a look at it here from line 91.

<java classname="org.Main" fork="true">
    <sysproperty key="org.xml.sax.driver" value="org.ccil.cowan.tagsoup.Parser"/>
    <sysproperty key="javax.xml.parsers.SAXParserFactory" value="org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl"/>
    <classpath>
        <path refid="classpath"/>
        <pathelement location="myjar-1.0.jar"/>
        <pathelement location="tagsoup-1.2.1.jar"/>
    </classpath>
</java>

As you can see, we can achieve vendor independence easily here. So, in case you want to use Xerces as your sax parser, then just replace "org.ccil.cowan.tagsoup.Parser" with "org.apache.xerces.parsers.SAXParser", and "org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl" with "org.apache.xerces.jaxp.SAXParserFactoryImpl".