Quantcast
Viewing latest article 5
Browse Latest Browse All 20

Forgiving HTML Parser for Node and Browsers

Chris Winberry needed an HTML parser for a project he was working on and started to use John's parser but found it to be a touch too strict for some of the HTML he was using (sloppy HTML? never). It was also too heavy to run on a server that would see considerable traffic, and so, being lazy, he wrote a new one from the ground up that is both light weight (extremely simple DOM) and very forgiving.

Which brings us to node-htmlparser which works in both Node:

JAVASCRIPT:
  1.  
  2. var htmlparser = require("node-htmlparser");
  3. var rawHtml = "Xyz <script language= javascript>var foo = '<<bar>>';</  script><!--<!-- Waah! -- -->";
  4. var handler = new htmlparser.DefaultHandler(function (error) {
  5.     if (error)
  6.       [...do something for errors...]
  7.     else
  8.       [...parsing done, do something...]
  9. });
  10. var parser = new htmlparser.Parser(handler);
  11. parser.ParseComplete(rawHtml);
  12. sys.puts(sys.inspect(handler.dom, false, null));
  13.  

and on a modern browser:

JAVASCRIPT:
  1.  
  2. var handler = new Tautologistics.NodeHtmlParser.DefaultHandler(function (error) {
  3.     if (error)
  4.       [...do something for errors...]
  5.     else
  6.       [...parsing done, do something...]
  7. });
  8. var parser = new Tautologistics.NodeHtmlParser.Parser(handler);
  9. parser.ParseComplete(document.body.innerHTML);
  10. alert(JSON.stringify(handler.dom, null, 2));
  11.  


Viewing latest article 5
Browse Latest Browse All 20

Trending Articles