java html parsers

技术2025-06-02 85

an article:

http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/

It includes the following 4 parsers, which all focus on "dirty, ill-formed and unsuitable" html:

NekoHTML

http://nekohtml.sourceforge.net/

HtmlCleaner

looks like based on DOM.

http://htmlcleaner.sourceforge.net/javause.php

TagSoup

SAX parser. poor doc.

There are a variety of other HTML SAX parsers written in Java, notably NekoHTML, JTidy (a port of the C library and tool HTML Tidy), and HTML Parser. All have their good and bad points: the general view around the Web seems to be that TagSoup is the slowest, but also the most robust and reliable.

http://ccil.org/~cowan/XML/tagsoup/

jTidy

DOM parser

http://jtidy.sourceforge.net/

another doc:

http://www.javalobby.org/java/forums/t105916.html

http://stackprinter.appspot.com/export?question=3152138&format=HTML&service=stackoverflow&linktohome=false

other parsers:

HTML Parser

http://htmlparser.sourceforge.net/

cobra

DOM parser. Support xpath.

http://lobobrowser.org/cobra/java-html-parser.jsp

Java Mozilla Html Parser

http://mozillaparser.sourceforge.net/

HotSAX a small fast SAX2 parser for HTML, XHTML and XML http://hotsax.sourceforge.net/

Jericho http://jericho.htmlparser.net/docs/index.html

Compared to a tree based parser such as DOM, the memory and resource requirements can be far better if only small sections of the document need to be parsed or modified. Incorrect or badly formatted HTML can easily be ignored, unlike tree based parsers which must identify every node in the document from top to bottom.Compared to an event based parser such as SAX, the interface is on a much higher level and more intuitive, and a tree representation of the document element hierarchy is easily created if required.

jsoup http://jsoup.org/

HtmlUnit http://htmlunit.sourceforge.net/

VietSpider

dedicate for spider. seems be a tool sw instead of library. http://binhgiang.sourceforge.net/site/home2.jsp

Jaxen Jaxen is an open source XPath library written in Java. It is adaptable to many different object models, including DOM, XOM, dom4j, and JDOM. Is it also possible to write adapters that treat non-XML trees such as compiled Java byte code or Java beans as XML, thus enabling you to query these trees with XPath too. http://jaxen.org/

======================

jsoup and Jericho are attractive to me. Especially jsoup remind me of hpricot.

最新回复(0)