Wednesday, April 28, 2010

Proposal "Web Help Output for DocBook" for GSoC 2010

UPDATE: DocBook WebHelp XSL Customization is now integrated DocBook XSL release starting from version 1.76.0. The release is available for download at Release notes are at DocBook WebHelp Project (22/10/2010)

UPDATE: Google Summer of Code 2010 program finished on 20th August. See DocBook WebHelp Project for the end notes, features and to view the demo of the beta release. 

The modified schedule can be found in WebHelpGsoc2010. Though the schedule is not necessary at this moment it may give an idea of the development process went on which might help for a new developer coming in to WebHelp.
- 23/08/2010

Google Summer of Code 2010 - Project Proposal

ProjectWebHelp Output for DocBook
Student NameKasun Gajasinghe
IMkasun (irc://
Time zone
MentorsDavid Cramer, Jirka Kosek

DocBook is a set of standards and tools for technical documentation. A vital requirement for technical publications is to produce a Web-based help format that is synchronized with the content. So the documentation is up-to-date making site maintenance easier. This will contain client-side searching with support for stemming, table of contents, Index and a HTML export ability. The main idea is to generate a Web Help Output from the DocBook content XML files using an Ant build.

About me
Participating in GSoC
I am passionate about Open Source World and love contributing to free software. I hope that Google Summer of Code will be a great opportunity for me to become part of another open source community, contribute for the development of the project, make new friends, and develop new skills. I believe that GSoC will be a excellent starting point for this.
Why DocBook

DocBook is a leading format for documentation and is especially popular with Open Source projects. So, I am particularly interested in DocBook and hope to become a permanent member of DocBook project.
I have planned to devote 35-40 hours per week for this project.

I have researched with suggestions from my co-mentors on ways to implementing client-side searching and came up with with following options.

Use Lucene QueryParser
Use the Java Indexer of the htmlsearch demo plugin as a base and add needed features

As Lucene works in Server-side, we have to compile it into JavaScript to make it work in client-side. For that Jirka suggested the use of GWT. But unfortunately Lucene isn't ported to GWT yet. I've looked at it and found that Luke, the Lucene Index toolbox has ported to GWT. Then, I went to Lucene IRC channel to get further details (#lucene @ freenode). Their I found that it is not possible to use Lucene only in client-side. They said that having queryparser in JavaScript can not be done and Luke uses a Java back-end in server-side for searching. So have to give up this option.
Java indexer is a good starting point. It does basic indexing and stores it in js files with keys (words) and their relevant file names. Then, it does basic searching based on given key words. This could be used as the base and improve the code and add new features. I have downloaded the source code and studied it. The proposed enhancements are listed under Proposal section.
For Table of Content tree generation, Considered,

Frameset approach with the tree included in a separate file.
Generate complete toc for every generated files and make it appear to be a pane

I chose the second method. As with that, "Deep linking" happens automatically and will be functional to some good extent under a no javascript environment. And this is the method mentor recommended. Further, researching will happen in the following days.

DocBook is a set of standards and tools for technical documentation. It was initially and is primarily intended for technical documentation, but has been extended for use in other domains. The current DocBook schema is available in several languages including RELAX NG and DTD and is maintained by the DocBook Technical Committee of OASIS. The DocBook Open Repository is a project hosted on SourceForge that maintains a set of XSL stylesheets for converting a DocBook instance into a variety of output formats, including various html formats, pdf (via XSL-FO), man pages. The currently supported html output formats include monolithic html, chunked html, Microsoft HTML Help (.chm), Eclipse documentation plugins, and Java Help.

This proposal is to add a browser-independent, platform-independent documentation “Web Help” output format using a combination of HTML, CSS, and JavaScript with a search index created at build time by an indexer application written in Java.

Search is done in client-side. For that, I plan to use the “htmlsearch1.04” demo plugin from DITA Open Toolkit as a base and enhance it with the needed features. As DocBook is included as one of their supported products, it will be compatible for this project. Further, it's license allows the use of it in commercial applications as well.

The proposed design for searching is, first generate JavaScript files which contains all glossary terms of the html files with matched file locations (i.e. as key, value pairs). Then, for a given query, keywords are extracted and then by the use of generated glossary/index, the output will be displayed.
The enhancements currently planned are,

  • Support for stemming and lemmatization for a given query
  • Search with Boolean operators (AND, OR)
  • Meta-data such as 'Prev' and 'Next' in the content page will be ignored when indexing.
  • Improve support for Asian Languages (Japanese and other Asian languages, meta tag content is used.)
  • As searching in client-side may slow-down the application, necessary optimization will be adopted.

I plan to use YUI library for the TOC tree generation. I will abandon the frameset-based approach and instead use a CSS-based mechanism in which the TOC is generated in every page and CSS is used to properly format it for viewing. With this approach, synchronization with the content file happens automatically. Further with this mechanism deep-linking happens automatically.
UI design will be developed using CSS and other technologies and will be little similar to Eclipse Help.
The Planned development schedule is given below.

Development Schedule
I am already proficient in Java, JavaScript, XML and CSS, but will start studying XSL in the bonding period and continue to learn it while doing the programming.

Community Bonding Period: April 26 - May 24Get to know the mentor and the community
Study the required API and features for WebHelp
Preparing the development environment 
Look for a good searching approach.
Start designing a good model
Interim Period: May 24 - July 12Dividing the development process into stages with the help of the mentor
Developing the TOC tree using a CSS-based mechanism (YUI)
Implementing the synchronization with the content
Adding an index with the help of DocBook schema
Designing client-side search mechanism with all the things such as stemming and lemmatization into consideration and start coding. 
Designing a better user interface.
July 12 - July 16Submitting mid-term evaluations and continue with the development
Interim Period: July 16 - August 9Completing TOC with synchronization
Continue developing the search mechanism
Testing the synchronization and searching
Developing the User Interface
August 9 - August 16Refine the code and testing the code and doing necessary improvements.
August 20Final evaluation deadline
August 30Submitting required code to Google

References and Resources

[1] My Blog: Kasun's Tech Thoughts
[2] Twitter:
[3] My Google Code Hosting Profile
(Projects hosted: documentation-aggregation-application, KFinder:A file searcher, cse-checkers(Java), cse-l3-2009-070137m:A Firefox extension)
[4] DocBook 5.0: The Definitive Guide
[5] DocBook XSL: The Complete Guide
[6] dita-users · DITA users yahoo group
[7] YUI Library
[8] Documentation Aggregation Application
[9] Delicious Extension for Google Chrome
[10] University of Moratuwa, Sri Lanka
[11] Deparment of Computer Science & Engineering, Faculty of Engineering

No comments:

Post a Comment