# he [![Build status](https://travis-ci.org/mathiasbynens/he.svg?branch=master)](https://travis-ci.org/mathiasbynens/he) [![Code coverage status](http://img.shields.io/coveralls/mathiasbynens/he/master.svg)](https://coveralls.io/r/mathiasbynens/he) [![Dependency status](https://gemnasium.com/mathiasbynens/he.svg)](https://gemnasium.com/mathiasbynens/he) _he_ (for “HTML entities”) is a robust HTML entity encoder/decoder written in JavaScript. It supports [all standardized named character references as per HTML](http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html), handles [ambiguous ampersands](https://mathiasbynens.be/notes/ambiguous-ampersands) and other edge cases [just like a browser would](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references), has an extensive test suite, and — contrary to many other JavaScript solutions — _he_ handles astral Unicode symbols just fine. [An online demo is available.](http://mothereff.in/html-entities) ## Installation Via [npm](http://npmjs.org/): ```bash npm install he ``` Via [Bower](http://bower.io/): ```bash bower install he ``` Via [Component](https://github.com/component/component): ```bash component install mathiasbynens/he ``` In a browser: ```html ``` In [Narwhal](http://narwhaljs.org/), [Node.js](http://nodejs.org/), and [RingoJS](http://ringojs.org/): ```js var he = require('he'); ``` In [Rhino](http://www.mozilla.org/rhino/): ```js load('he.js'); ``` Using an AMD loader like [RequireJS](http://requirejs.org/): ```js require( { 'paths': { 'he': 'path/to/he' } }, ['he'], function(he) { console.log(he); } ); ``` ## API ### `he.version` A string representing the semantic version number. ### `he.encode(text, options)` This function takes a string of text and encodes (by default) any symbols that aren’t printable ASCII symbols and `&`, `<`, `>`, `"`, `'`, and `` ` ``, replacing them with character references. ```js he.encode('foo © bar ≠ baz 𝌆 qux'); // → 'foo © bar ≠ baz 𝌆 qux' ``` As long as the input string contains [allowed code points](http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream) only, the return value of this function is always valid HTML. Any [(invalid) code points that cannot be represented using a character reference](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#table-charref-overrides) in the input are not encoded. ```js he.encode('foo \0 bar'); // → 'foo \0 bar' ``` The `options` object is optional. It recognizes the following properties: #### `useNamedReferences` The default value for the `useNamedReferences` option is `false`. This means that `encode()` will not use any named character references (e.g. `©`) in the output — hexadecimal escapes (e.g. `©`) will be used instead. Set it to `true` to enable the use of named references. **Note that if compatibility with older browsers is a concern, this option should remain disabled.** ```js // Using the global default setting (defaults to `false`): he.encode('foo © bar ≠ baz 𝌆 qux'); // → 'foo © bar ≠ baz 𝌆 qux' // Passing an `options` object to `encode`, to explicitly disallow named references: he.encode('foo © bar ≠ baz 𝌆 qux', { 'useNamedReferences': false }); // → 'foo © bar ≠ baz 𝌆 qux' // Passing an `options` object to `encode`, to explicitly allow named references: he.encode('foo © bar ≠ baz 𝌆 qux', { 'useNamedReferences': true }); // → 'foo © bar ≠ baz 𝌆 qux' ``` #### `encodeEverything` The default value for the `encodeEverything` option is `false`. This means that `encode()` will not use any character references for printable ASCII symbols that don’t need escaping. Set it to `true` to encode every symbol in the input string. When set to `true`, this option takes precedence over `allowUnsafeSymbols` (i.e. setting the latter to `true` in such a case has no effect). ```js // Using the global default setting (defaults to `false`): he.encode('foo © bar ≠ baz 𝌆 qux'); // → 'foo © bar ≠ baz 𝌆 qux' // Passing an `options` object to `encode`, to explicitly encode all symbols: he.encode('foo © bar ≠ baz 𝌆 qux', { 'encodeEverything': true }); // → 'foo © bar ≠ baz 𝌆 qux' // This setting can be combined with the `useNamedReferences` option: he.encode('foo © bar ≠ baz 𝌆 qux', { 'encodeEverything': true, 'useNamedReferences': true }); // → 'foo © bar ≠ baz 𝌆 qux' ``` #### `strict` The default value for the `strict` option is `false`. This means that `encode()` will encode any HTML text content you feed it, even if it contains any symbols that cause [parse errors](http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream). To throw an error when such invalid HTML is encountered, set the `strict` option to `true`. This option makes it possible to use _he_ as part of HTML parsers and HTML validators. ```js // Using the global default setting (defaults to `false`, i.e. error-tolerant mode): he.encode('\x01'); // → '' // Passing an `options` object to `encode`, to explicitly enable error-tolerant mode: he.encode('\x01', { 'strict': false }); // → '' // Passing an `options` object to `encode`, to explicitly enable strict mode: he.encode('\x01', { 'strict': true }); // → Parse error ``` #### `allowUnsafeSymbols` The default value for the `allowUnsafeSymbols` option is `false`. This means that characters that are unsafe for use in HTML content (`&`, `<`, `>`, `"`, `'`, and `` ` ``) will be encoded. When set to `true`, only non-ASCII characters will be encoded. If the `encodeEverything` option is set to `true`, this option will be ignored. ```js he.encode('foo © and & ampersand', { 'allowUnsafeSymbols': true }); // → 'foo © and & ampersand' ``` #### Overriding default `encode` options globally The global default setting can be overridden by modifying the `he.encode.options` object. This saves you from passing in an `options` object for every call to `encode` if you want to use the non-default setting. ```js // Read the global default setting: he.encode.options.useNamedReferences; // → `false` by default // Override the global default setting: he.encode.options.useNamedReferences = true; // Using the global default setting, which is now `true`: he.encode('foo © bar ≠ baz 𝌆 qux'); // → 'foo © bar ≠ baz 𝌆 qux' ``` ### `he.decode(html, options)` This function takes a string of HTML and decodes any named and numerical character references in it using [the algorithm described in section 12.2.4.69 of the HTML spec](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references). ```js he.decode('foo © bar ≠ baz 𝌆 qux'); // → 'foo © bar ≠ baz 𝌆 qux' ``` The `options` object is optional. It recognizes the following properties: #### `isAttributeValue` The default value for the `isAttributeValue` option is `false`. This means that `decode()` will decode the string as if it were used in [a text context in an HTML document](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state). HTML has different rules for [parsing character references in attribute values](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#character-reference-in-attribute-value-state) — set this option to `true` to treat the input string as if it were used as an attribute value. ```js // Using the global default setting (defaults to `false`, i.e. HTML text context): he.decode('foo&bar'); // → 'foo&bar' // Passing an `options` object to `decode`, to explicitly assume an HTML text context: he.decode('foo&bar', { 'isAttributeValue': false }); // → 'foo&bar' // Passing an `options` object to `decode`, to explicitly assume an HTML attribute value context: he.decode('foo&bar', { 'isAttributeValue': true }); // → 'foo&bar' ``` #### `strict` The default value for the `strict` option is `false`. This means that `decode()` will decode any HTML text content you feed it, even if it contains any entities that cause [parse errors](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references). To throw an error when such invalid HTML is encountered, set the `strict` option to `true`. This option makes it possible to use _he_ as part of HTML parsers and HTML validators. ```js // Using the global default setting (defaults to `false`, i.e. error-tolerant mode): he.decode('foo&bar'); // → 'foo&bar' // Passing an `options` object to `decode`, to explicitly enable error-tolerant mode: he.decode('foo&bar', { 'strict': false }); // → 'foo&bar' // Passing an `options` object to `decode`, to explicitly enable strict mode: he.decode('foo&bar', { 'strict': true }); // → Parse error ``` #### Overriding default `decode` options globally The global default settings for the `decode` function can be overridden by modifying the `he.decode.options` object. This saves you from passing in an `options` object for every call to `decode` if you want to use a non-default setting. ```js // Read the global default setting: he.decode.options.isAttributeValue; // → `false` by default // Override the global default setting: he.decode.options.isAttributeValue = true; // Using the global default setting, which is now `true`: he.decode('foo&bar'); // → 'foo&bar' ``` ### `he.escape(text)` This function takes a string of text and escapes it for use in text contexts in XML or HTML documents. Only the following characters are escaped: `&`, `<`, `>`, `"`, `'`, and `` ` ``. ```js he.escape(''); // → '<img src='x' onerror="prompt(1)">' ``` ### `he.unescape(html, options)` `he.unescape` is an alias for `he.decode`. It takes a string of HTML and decodes any named and numerical character references in it. ### Using the `he` binary To use the `he` binary in your shell, simply install _he_ globally using npm: ```bash npm install -g he ``` After that you will be able to encode/decode HTML entities from the command line: ```bash $ he --encode 'föo ♥ bår 𝌆 baz' föo ♥ bår 𝌆 baz $ he --encode --use-named-refs 'föo ♥ bår 𝌆 baz' föo ♥ bår 𝌆 baz $ he --decode 'föo ♥ bår 𝌆 baz' föo ♥ bår 𝌆 baz ``` Read a local text file, encode it for use in an HTML text context, and save the result to a new file: ```bash $ he --encode < foo.txt > foo-escaped.html ``` Or do the same with an online text file: ```bash $ curl -sL "http://git.io/HnfEaw" | he --encode > escaped.html ``` Or, the opposite — read a local file containing a snippet of HTML in a text context, decode it back to plain text, and save the result to a new file: ```bash $ he --decode < foo-escaped.html > foo.txt ``` Or do the same with an online HTML snippet: ```bash $ curl -sL "http://git.io/HnfEaw" | he --decode > decoded.txt ``` See `he --help` for the full list of options. ## Support he has been tested in at least Chrome 27-29, Firefox 3-22, Safari 4-6, Opera 10-12, IE 6-10, Node.js v0.10.0, Narwhal 0.3.2, RingoJS 0.8-0.9, PhantomJS 1.9.0, and Rhino 1.7RC4. ## Unit tests & code coverage After cloning this repository, run `npm install` to install the dependencies needed for he development and testing. You may want to install Istanbul _globally_ using `npm install istanbul -g`. Once that’s done, you can run the unit tests in Node using `npm test` or `node tests/tests.js`. To run the tests in Rhino, Ringo, Narwhal, and web browsers as well, use `grunt test`. To generate the code coverage report, use `grunt cover`. ## Acknowledgements Thanks to [Simon Pieters](http://simon.html5.org/) ([@zcorpan](https://twitter.com/zcorpan)) for the many suggestions. ## Author | [![twitter/mathias](https://gravatar.com/avatar/24e08a9ea84deb17ae121074d0f17125?s=70)](https://twitter.com/mathias "Follow @mathias on Twitter") | |---| | [Mathias Bynens](https://mathiasbynens.be/) | ## License _he_ is available under the [MIT](http://mths.be/mit) license.