Here's my progress report for the first phase of Gentoo Maven Integration project for
finishing the work under Google Summer of Code and starting move in to
a voluntary position.
The goal of this project was to build maven and it's huge number of
dependencies from source, and then facilitate the packagers for
packaging maven-based Java packages. There are two eclasses which will
facilitate bootstrapping maven along with building maven-based
packages, and packaging Maven plugins. These eclasses address some
fundamental issues of incompatibilities between Gentoo build system
and Maven build system.
There were two main goals for the project. One is building
maven-from-source. It is now completed and has been thoroughly tested.
There are around 40+ ebuilds that are direct dependencies of Maven
which were packaged/bumped during the project period. General users of
maven can have the full benefit from this package now. Please file
bugs at https://bugs.gentoo.org/ if you find any.
The second phase was a lengthy process and the scope wasn't fit for
one and half months time. But with mentors' blessing, I've made a
quite a big progress and was able to emerge a minimal package built
via native Maven.
Let me describe the surface details of the second phase. The idea was
to facilitate the packagers to package maven-based packages. This has
been a long-time blocker for Gentoo-Java (which extends to more than
3-4 years). For this phase, we needed several requirements including
dependency management issues and rewriting of pom.xml to match
Gentoo's needs. One requirement in it was the need to have a mechanism
to use the installed system jars instead of downloading the jars from
maven repos. One another is that pom needs to respect the Gentoo SLOT
system. Further, configuration details needed to be added to tell the
JDK and JRE versions needed for building (ie need to add config bits
to maven-compiler-plugin section in the pom). And, then it needs
several maven plugins to build packages. There were enormous amount of
plugins available that most of them need special attention separately.
For the second phase, the hard part is over. And, as I said, I was
able to emerge a minimal maven-based implementation. Maven isn't much
cooperative when it comes to dependency management, but our solution
worked well.
Along with that, the first iteration of work is complete. I'm hoping
to be the maintainer for Maven under Gentoo Java herd for the
foreseeable future. And, I'm eagerly waiting to wear the Gentoo
Developer hat one day. I'm interested in knowing the generic plan for
recruiting developers who come via Summer of Code as well.
There are few things to be done to bring the use of the Maven
integration to it's full potential. These are more like plug-ins to
the core base, and beautifying the process. I need to make new plans
for these with help from Java herd.
* There's only five maven plugins has been packaged. Have a fresh
look at maven-surefire-plugin. Needs to add all other plugins.
* Currently, when MAVEN_PARENTPOM_UNIQUE_ID is set to rewrite
<parent> node of the pom, it rewrites all the poms in the project
including sub-modules. The most probable usecase is that rewrite the
parent element of the top-level/aggregator pom. The configuration bits
needed are already there (-w option), but the implementation needs to
be done.
* Merge the separate ebuilds of maven-2.2.1 maven-2.2.1 release in
to one. There are around 20+ ebuilds dedicated for this. These ebuilds
probably won't be needed separately so it's ok to merge these
together. Need to evaluate possibility of issues of having all these
together.
Here are some references if you are interested in getting deeper in
Maven in Gentoo. Feel free to contact me if you like to extend your
helping hand for the project. I'm at kasunbg +spamfree at gmail.com
* The wiki 1 - Developer and User guide for Maven in Gentoo -
https://overlays.gentoo.org/proj/java/wiki/Maven_Integration
* The wiki 2 - Manpage for java-maven-2 eclass -
http://overlays.gentoo.org/proj/java/wiki/Maven_Eclass_Manpage
* Repository 1 - gsoc-maven-overlay -
https://overlays.gentoo.org/svn/proj/java/gsoc-maven-overlay/
* Repository 2 - Branch for Javatoolkit -
http://overlays.gentoo.org/proj/java/browser/projects/javatoolkit/branches/kasun/
* TracBrowser view -
http://overlays.gentoo.org/proj/java/browser/gsoc-maven-overlay/
Thursday, October 20, 2011
Sunday, October 9, 2011
Siddhi CEP has won a bronze award at NBQSA 2011
Siddhi Complex Event Processing Engine has won the bronze award in Tertiary category at the recently concluded National Best Quality Software Awards 2011 held in Hotel Galadari, Colombo, Sri Lanka. The winners are now published at NBQSA official website. Further, Siddhi has been nominated to APICTA (Asia Pacific ICT Awards) 2011 which will be held during November 8 – 11, 2011 in Thailand.
Siddhi was our final year project at Dept. of Computer Science & Engineering, Faculty of Engineering, University of Moratuwa. It was a four member project with other members being Suho, Isuru, and Subash. We were overwhelmed with joy on the day we heard that we have secured a place at NBQSA. They weren't specific about the exact award we got. So, we have been waiting with nervously, and full of excitement for the moment for winning the award. It's really a pleasure to know that a year old hard work has gained benefits.
Along with us, another two other projects from our department has won a merit, and a special recognition award. Further, another two projects from Dept. of Electronic & Telecommunication of University of Moratuwa also able to secure a merit and a special recognition award. In total, it's five awards.
Special thanks should go to our supervisors, Dr. Srinath Perera, and Ms. Vishaka Nanayakkara who have given us enormous support. Our project coordinator Dr. Shantha Fernando, Dr. Sanjiva Weerawarana who has provided us guidance when in need should also be remembered.
The Siddhi Team (from left-to-right Suho, me, and Subash; Isuru missed it!) |
Award winners at NBQSA 2011 |
Keywords:
cep,
nbqsa awards,
siddhi
TagSoup - Parse even the worst HTML as XML
If you are in ever in need of parsing XML to retrieve some contents from a HTML file, you know how irritating it is. Generally, HTML is supposed to contain a well-formed XML. But it's quite often far from it most of the time. HTML files contain mismatched tags, missing end-tags etc. So, if you parse these HTML files with a popular Java XML Parser like Xerces, it will throw an exception, and will stop parsing. Here, I have used SAX interface with Java JAXP; the outcome is same for DOM.
Why does this happen? Because the provided html file contained the following malformed xml.
[java] SaxParseException: The indexing file contains incorrect xml syntax.
[java] org.xml.sax.SAXParseException: The element type "div" must be terminated by the matching end-tag "</div>".[java] at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
[java] at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
[java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
---
[java] org.xml.sax.SAXParseException: The element type "div" must be terminated by the matching end-tag "</div>".[java] at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
[java] at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
[java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
---
[java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
[java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
[java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
---
[java] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
[java] at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
[java] at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
[java] at com.nexwave.nquindexer.SaxDocFileParser.parseDocument(SaxDocFileParser.java:96)
[java] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
[java] at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
[java] at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
[java] at com.nexwave.nquindexer.SaxDocFileParser.parseDocument(SaxDocFileParser.java:96)
Why does this happen? Because the provided html file contained the following malformed xml.
This is <B>bold, <div>a div with bold text </b> plain div </div>normal text
Unfortunately, the html files in the web contain lot of syntax issues like this. TagSoup to the rescue.
TagSoup
TagSoup is a SAX-compliant parser written in Java (C++ port is now available) that parses malformed html like the above according to a specially designed schema for html. And, TagSoup is a Free and Open Source software released under Apache License 2.0. (aah, I saw that little smile in your face ;-) ) Read more about it at tagsoup home.
How to use it?
TagSoup can be invoked directly via command-line. See http://home.ccil.org/~cowan/tagsoup/#program for more info. This blog post if primarily intended on showing you how to use it in Java.
TagSoup provides SAX interface, so, if you are already familiar with SAX (Simple API for XML), and JAXP (Java API for XML Processing), then you are almost there. If not, read these up-to-the-point articles from IBM developerWorks; JAXP, SAX, Vendor independence in SAX.
If your project already have a SAX implementation designed to work with other parsers like Xerces, then you can just use TagSoup by changing two system properties, "org.xml.sax.driver", and "javax.xml.parsers.SAXParserFactory".
java -Dorg.xml.sax.driver=org.ccil.cowan.tagsoup.Parser -Djavax.xml.parsers.SAXParserFactory=org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl -cp myjar-1.0.jar:tagsoup-1.2.1.jar org.Main
//Runs the main class org.Main in myjar.jar using the tagsoup as the sax parser.
I use this in my Apache ANT script for the DocBook WebHelp Search Indexer . For that, I've added following content to my build.xml. Have a look at it here from line 91.
<java classname="org.Main" fork="true">
<sysproperty key="org.xml.sax.driver" value="org.ccil.cowan.tagsoup.Parser"/>
<sysproperty key="javax.xml.parsers.SAXParserFactory" value="org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl"/>
<sysproperty key="javax.xml.parsers.SAXParserFactory" value="org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl"/>
<classpath>
<path refid="classpath"/>
<pathelement location="myjar-1.0.jar"/>
<path refid="classpath"/>
<pathelement location="myjar-1.0.jar"/>
<pathelement location="tagsoup-1.2.1.jar"/>
</classpath>
</java>
</classpath>
</java>
As you can see, we can achieve vendor independence easily here. So, in case you want to use Xerces as your sax parser, then just replace "org.ccil.cowan.tagsoup.Parser" with "org.apache.xerces.parsers.SAXParser", and "org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl" with "org.apache.xerces.jaxp.SAXParserFactoryImpl".
Keywords:
java,
sax,
tagsoup,
xml,
xml-parsing
Subscribe to:
Posts (Atom)