an article:
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
It includes the following 4 parsers, which all focus on "dirty, ill-formed and unsuitable" html:
NekoHTML
http://nekohtml.sourceforge.net/
HtmlCleaner
looks like based on DOM.
http://htmlcleaner.sourceforge.net/javause.php
TagSoup
SAX parser. poor doc.
There are a variety of other HTML SAX parsers written in Java, notably NekoHTML, JTidy (a port of the C library and tool HTML Tidy), and HTML Parser. All have their good and bad points: the general view around the Web seems to be that TagSoup is the slowest, but also the most robust and reliable.
http://ccil.org/~cowan/XML/tagsoup/
jTidy
DOM parser
http://jtidy.sourceforge.net/
another doc:
http://www.javalobby.org/java/forums/t105916.html
http://stackprinter.appspot.com/export?question=3152138&format=HTML&service=stackoverflow&linktohome=false
other parsers:
HTML Parser
http://htmlparser.sourceforge.net/
cobra
DOM parser. Support xpath.
http://lobobrowser.org/cobra/java-html-parser.jsp
Java Mozilla Html Parser
http://mozillaparser.sourceforge.net/
HotSAX a small fast SAX2 parser for HTML, XHTML and XML http://hotsax.sourceforge.net/
Jericho http://jericho.htmlparser.net/docs/index.html
Compared to a tree based parser such as DOM, the memory and resource requirements can be far better if only small sections of the document need to be parsed or modified. Incorrect or badly formatted HTML can easily be ignored, unlike tree based parsers which must identify every node in the document from top to bottom.Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.jsoup http://jsoup.org/
HtmlUnit http://htmlunit.sourceforge.net/
VietSpider
dedicate for spider. seems be a tool sw instead of library. http://binhgiang.sourceforge.net/site/home2.jsp
Jaxen Jaxen is an open source XPath library written in Java. It is adaptable to many different object models, including DOM, XOM, dom4j, and JDOM. Is it also possible to write adapters that treat non-XML trees such as compiled Java byte code or Java beans as XML, thus enabling you to query these trees with XPath too. http://jaxen.org/
======================
jsoup and Jericho are attractive to me. Especially jsoup remind me of hpricot.