TagSoup is the fastest pure JS SAX/DOM XML/HTML parser.
npm install --save-prod tag-soup
⚠️ API documentation is available here.
import {createSaxParser} from 'tag-soup';
// Or use
// import {createXmlSaxParser, createHtmlSaxParser} from 'tag-soup';
const saxParser = createSaxParser({
startTag(token) {
console.log(token); // → {tokenType: 1, name: 'foo', …}
},
endTag(token) {
console.log(token); // → {tokenType: 101, data: 'okay', …}
},
});
saxParser.parse('<foo>okay');
SAX parser invokes callbacks during parsing.
Callbacks receive tokens which represent structures read from the input. Tokens are pooled objects so when handler callback finishes they are returned to the pool and reused. Object pooling drastically reduces memory consumption and allows passing a lot of data to the callback.
If you need to retain token after callback finishes use
token.clone()
which returns the deep copy of
the token.
startTag
and endTag
callbacks are always invoked in the correct order even if tags in the input were incorrectly
nested or missed.
For self-closing tags only
startTag
callback in invoked.
All SAX parser factories accept two arguments
the handler with callbacks and
options. The most generic parser factory
createSaxParser
doesn't have any defaults.
For createXmlSaxParser
defaults are
xmlParserOptions
:
For createHtmlSaxParser
defaults are
htmlParserOptions
:
p
, li
, td
and others follow implicit end rules, so <p>foo<p>bar
is parsed as <p>foo</p><p>bar</p>
;You can alter how the parser works through options which give you fine-grained control over parsing dialect.
By default, TagSoup uses speedy-entites
to decode XML and HTML
entities. Parser created by createHtmlSaxParser
decodes only legacy HTML entities. This is done to reduce the bundle
size.
To decode all HTML entities use this snippet below. It would add 10 kB gzipped to the bundle size.
import {decodeHtml} from 'speedy-entities/lib/full';
const htmlParser = createHtmlSaxParser({
decodeText: decodeHtml,
decodeAttribute: decodeHtml,
});
With speedy-entites
you can create a custom decoder
that would recognize custom entities.
aacute
,Aacute
,acirc
,Acirc
,acute
,aelig
,AElig
,agrave
,Agrave
,amp
,AMP
,aring
,Aring
,atilde
,Atilde
,auml
,Auml
,brvbar
,ccedil
,Ccedil
,cedil
,cent
,copy
,COPY
,curren
,deg
,divide
,eacute
,Eacute
,ecirc
,Ecirc
,egrave
,Egrave
,eth
,ETH
,euml
,Euml
,frac12
,frac14
,frac34
,gt
,GT
,iacute
,Iacute
,icirc
,Icirc
,iexcl
,igrave
,Igrave
,iquest
,iuml
,Iuml
,laquo
,lt
,LT
,macr
,micro
,middot
,nbsp
,not
,ntilde
,Ntilde
,oacute
,Oacute
,ocirc
,Ocirc
,ograve
,Ograve
,ordf
,ordm
,oslash
,Oslash
,otilde
,Otilde
,ouml
,Ouml
,para
,plusmn
,pound
,quot
,QUOT
,raquo
,reg
,REG
,sect
,shy
,sup1
,sup2
,sup3
,szlig
,thorn
,THORN
,times
,uacute
,Uacute
,ucirc
,Ucirc
,ugrave
,Ugrave
,uml
,uuml
,Uuml
,yacute
,Yacute
,yen
andyuml
SAX parsers support streaming. You can use
saxParser.write(chunk)
to parse input data
chunk by chunk.
const saxParser = createSaxParser({/*callbacks*/});
saxParser.write('<foo>ok');
// Triggers startTag callabck for "foo" tag.
saxParser.write('ay');
// Doesn't trigger any callbacks.
saxParser.write('</foo>');
// Triggers text callback for "okay" and endTag callback for "foo" tag.
import {createDomParser} from 'tag-soup';
// Or use
// import {createXmlDomParser, createHtmlDomParser} from 'tag-soup';
// Minimal DOM handler example
const domParser = createDomParser<any>({
element(token) {
return {tagName: token.name, children: []};
},
appendChild(parentNode, node) {
parentNode.children.push(node);
},
});
const domNode = domParser.parse('<foo>okay');
console.log(domNode[0].children[0].data); // → 'okay'
DOM parser assembles a node three using a handler that describes how nodes are created and appended.
The generic parser factory createDomParser
requires a handler to be provided.
Both createXmlDomParser
and
createHtmlDomParser
use
domHandler
if no other handler was provided and use
default options (xmlParserOptions
and htmlParserOptions
respectively) which
can be overridden.
DOM parsers support streaming. You can use
domParser.write(chunk)
to parse input data
chunk by chunk.
const domParser = createXmlDomParser();
domParser.write('<foo>ok');
// → [{nodeType: 1, tagName: 'foo', children: [], …}]
domParser.write('ay');
// → [{nodeType: 1, tagName: 'foo', children: [], …}]
domParser.write('</foo>');
// → [{nodeType: 1, tagName: 'foo', children: [{nodeType: 3, data: 'okay', …}], …}]
To run a performance test use npm ci && npm run build && npm run perf
.
Performance was measured when parsing the 3.81 MB HTML file.
Results are in operations per second. The higher number is better.
Ops/sec | |
---|---|
createSaxParser ¹ |
|
createXmlSaxParser ¹ |
|
createHtmlSaxParser ¹ |
|
createSaxParser |
|
createXmlSaxParser |
|
createHtmlSaxParser |
|
@fb55/htmlparser2 |
|
@isaacs/sax-js |
¹ Parsers were provided a handler with a single
text
callback. This configuration can be
useful if you want to strip tags from the input.
Ops/sec | |
---|---|
createDomParser |
|
createXmlDomParser |
|
createHtmlDomParser |
|
@fb55/htmlparser2 |
|
@inikulin/parse5 |
The performance was measured when parsing
258 files with 95 kB in size on average from
htmlparser-benchmark
.
Results are in operations per second. The higher number is better.
Ops/sec | |
---|---|
createSaxParser |
|
createXmlSaxParser |
|
createHtmlSaxParser |
|
@fb55/htmlparser2 |
Ops/sec | |
---|---|
createDomParser |
|
createXmlDomParser |
|
createHtmlDomParser |
|
@fb55/htmlparser2 |
|
@inikulin/parse5 |
TagSoup doesn't resolve some weird element structures that malformed HTML may cause.
For example, assume the following markup:
<p><strong>okay
<p>nope
With DOMParser
this markup would be transformed to:
<p><strong>okay</strong></p>
<p><strong>nope</strong></p>
TagSoup doesn't insert the second strong
tag:
<p><strong>okay</strong></p>
<p>nope</p> <!-- Note the absent "strong" tag -->
Generated using TypeDoc