You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

336 lines
12 KiB

8 years ago
  1. # he [![Build status](https://travis-ci.org/mathiasbynens/he.svg?branch=master)](https://travis-ci.org/mathiasbynens/he) [![Code coverage status](http://img.shields.io/coveralls/mathiasbynens/he/master.svg)](https://coveralls.io/r/mathiasbynens/he) [![Dependency status](https://gemnasium.com/mathiasbynens/he.svg)](https://gemnasium.com/mathiasbynens/he)
  2. _he_ (for “HTML entities”) is a robust HTML entity encoder/decoder written in JavaScript. It supports [all standardized named character references as per HTML](http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html), handles [ambiguous ampersands](https://mathiasbynens.be/notes/ambiguous-ampersands) and other edge cases [just like a browser would](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references), has an extensive test suite, and — contrary to many other JavaScript solutions — _he_ handles astral Unicode symbols just fine. [An online demo is available.](http://mothereff.in/html-entities)
  3. ## Installation
  4. Via [npm](http://npmjs.org/):
  5. ```bash
  6. npm install he
  7. ```
  8. Via [Bower](http://bower.io/):
  9. ```bash
  10. bower install he
  11. ```
  12. Via [Component](https://github.com/component/component):
  13. ```bash
  14. component install mathiasbynens/he
  15. ```
  16. In a browser:
  17. ```html
  18. <script src="he.js"></script>
  19. ```
  20. In [Narwhal](http://narwhaljs.org/), [Node.js](http://nodejs.org/), and [RingoJS](http://ringojs.org/):
  21. ```js
  22. var he = require('he');
  23. ```
  24. In [Rhino](http://www.mozilla.org/rhino/):
  25. ```js
  26. load('he.js');
  27. ```
  28. Using an AMD loader like [RequireJS](http://requirejs.org/):
  29. ```js
  30. require(
  31. {
  32. 'paths': {
  33. 'he': 'path/to/he'
  34. }
  35. },
  36. ['he'],
  37. function(he) {
  38. console.log(he);
  39. }
  40. );
  41. ```
  42. ## API
  43. ### `he.version`
  44. A string representing the semantic version number.
  45. ### `he.encode(text, options)`
  46. This function takes a string of text and encodes (by default) any symbols that aren’t printable ASCII symbols and `&`, `<`, `>`, `"`, `'`, and `` ` ``, replacing them with character references.
  47. ```js
  48. he.encode('foo © bar ≠ baz 𝌆 qux');
  49. // → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'
  50. ```
  51. As long as the input string contains [allowed code points](http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream) only, the return value of this function is always valid HTML. Any [(invalid) code points that cannot be represented using a character reference](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#table-charref-overrides) in the input are not encoded.
  52. ```js
  53. he.encode('foo \0 bar');
  54. // → 'foo \0 bar'
  55. ```
  56. The `options` object is optional. It recognizes the following properties:
  57. #### `useNamedReferences`
  58. The default value for the `useNamedReferences` option is `false`. This means that `encode()` will not use any named character references (e.g. `&copy;`) in the output — hexadecimal escapes (e.g. `&#xA9;`) will be used instead. Set it to `true` to enable the use of named references.
  59. **Note that if compatibility with older browsers is a concern, this option should remain disabled.**
  60. ```js
  61. // Using the global default setting (defaults to `false`):
  62. he.encode('foo © bar ≠ baz 𝌆 qux');
  63. // → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'
  64. // Passing an `options` object to `encode`, to explicitly disallow named references:
  65. he.encode('foo © bar ≠ baz 𝌆 qux', {
  66. 'useNamedReferences': false
  67. });
  68. // → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'
  69. // Passing an `options` object to `encode`, to explicitly allow named references:
  70. he.encode('foo © bar ≠ baz 𝌆 qux', {
  71. 'useNamedReferences': true
  72. });
  73. // → 'foo &copy; bar &ne; baz &#x1D306; qux'
  74. ```
  75. #### `encodeEverything`
  76. The default value for the `encodeEverything` option is `false`. This means that `encode()` will not use any character references for printable ASCII symbols that don’t need escaping. Set it to `true` to encode every symbol in the input string. When set to `true`, this option takes precedence over `allowUnsafeSymbols` (i.e. setting the latter to `true` in such a case has no effect).
  77. ```js
  78. // Using the global default setting (defaults to `false`):
  79. he.encode('foo © bar ≠ baz 𝌆 qux');
  80. // → 'foo &#xA9; bar &#x2260; baz &#x1D306; qux'
  81. // Passing an `options` object to `encode`, to explicitly encode all symbols:
  82. he.encode('foo © bar ≠ baz 𝌆 qux', {
  83. 'encodeEverything': true
  84. });
  85. // → '&#x66;&#x6F;&#x6F;&#x20;&#xA9;&#x20;&#x62;&#x61;&#x72;&#x20;&#x2260;&#x20;&#x62;&#x61;&#x7A;&#x20;&#x1D306;&#x20;&#x71;&#x75;&#x78;'
  86. // This setting can be combined with the `useNamedReferences` option:
  87. he.encode('foo © bar ≠ baz 𝌆 qux', {
  88. 'encodeEverything': true,
  89. 'useNamedReferences': true
  90. });
  91. // → '&#x66;&#x6F;&#x6F;&#x20;&copy;&#x20;&#x62;&#x61;&#x72;&#x20;&ne;&#x20;&#x62;&#x61;&#x7A;&#x20;&#x1D306;&#x20;&#x71;&#x75;&#x78;'
  92. ```
  93. #### `strict`
  94. The default value for the `strict` option is `false`. This means that `encode()` will encode any HTML text content you feed it, even if it contains any symbols that cause [parse errors](http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#preprocessing-the-input-stream). To throw an error when such invalid HTML is encountered, set the `strict` option to `true`. This option makes it possible to use _he_ as part of HTML parsers and HTML validators.
  95. ```js
  96. // Using the global default setting (defaults to `false`, i.e. error-tolerant mode):
  97. he.encode('\x01');
  98. // → '&#x1;'
  99. // Passing an `options` object to `encode`, to explicitly enable error-tolerant mode:
  100. he.encode('\x01', {
  101. 'strict': false
  102. });
  103. // → '&#x1;'
  104. // Passing an `options` object to `encode`, to explicitly enable strict mode:
  105. he.encode('\x01', {
  106. 'strict': true
  107. });
  108. // → Parse error
  109. ```
  110. #### `allowUnsafeSymbols`
  111. The default value for the `allowUnsafeSymbols` option is `false`. This means that characters that are unsafe for use in HTML content (`&`, `<`, `>`, `"`, `'`, and `` ` ``) will be encoded. When set to `true`, only non-ASCII characters will be encoded. If the `encodeEverything` option is set to `true`, this option will be ignored.
  112. ```js
  113. he.encode('foo © and & ampersand', {
  114. 'allowUnsafeSymbols': true
  115. });
  116. // → 'foo &#xA9; and & ampersand'
  117. ```
  118. #### Overriding default `encode` options globally
  119. The global default setting can be overridden by modifying the `he.encode.options` object. This saves you from passing in an `options` object for every call to `encode` if you want to use the non-default setting.
  120. ```js
  121. // Read the global default setting:
  122. he.encode.options.useNamedReferences;
  123. // → `false` by default
  124. // Override the global default setting:
  125. he.encode.options.useNamedReferences = true;
  126. // Using the global default setting, which is now `true`:
  127. he.encode('foo © bar ≠ baz 𝌆 qux');
  128. // → 'foo &copy; bar &ne; baz &#x1D306; qux'
  129. ```
  130. ### `he.decode(html, options)`
  131. This function takes a string of HTML and decodes any named and numerical character references in it using [the algorithm described in section 12.2.4.69 of the HTML spec](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references).
  132. ```js
  133. he.decode('foo &copy; bar &ne; baz &#x1D306; qux');
  134. // → 'foo © bar ≠ baz 𝌆 qux'
  135. ```
  136. The `options` object is optional. It recognizes the following properties:
  137. #### `isAttributeValue`
  138. The default value for the `isAttributeValue` option is `false`. This means that `decode()` will decode the string as if it were used in [a text context in an HTML document](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state). HTML has different rules for [parsing character references in attribute values](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#character-reference-in-attribute-value-state) — set this option to `true` to treat the input string as if it were used as an attribute value.
  139. ```js
  140. // Using the global default setting (defaults to `false`, i.e. HTML text context):
  141. he.decode('foo&ampbar');
  142. // → 'foo&bar'
  143. // Passing an `options` object to `decode`, to explicitly assume an HTML text context:
  144. he.decode('foo&ampbar', {
  145. 'isAttributeValue': false
  146. });
  147. // → 'foo&bar'
  148. // Passing an `options` object to `decode`, to explicitly assume an HTML attribute value context:
  149. he.decode('foo&ampbar', {
  150. 'isAttributeValue': true
  151. });
  152. // → 'foo&ampbar'
  153. ```
  154. #### `strict`
  155. The default value for the `strict` option is `false`. This means that `decode()` will decode any HTML text content you feed it, even if it contains any entities that cause [parse errors](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references). To throw an error when such invalid HTML is encountered, set the `strict` option to `true`. This option makes it possible to use _he_ as part of HTML parsers and HTML validators.
  156. ```js
  157. // Using the global default setting (defaults to `false`, i.e. error-tolerant mode):
  158. he.decode('foo&ampbar');
  159. // → 'foo&bar'
  160. // Passing an `options` object to `decode`, to explicitly enable error-tolerant mode:
  161. he.decode('foo&ampbar', {
  162. 'strict': false
  163. });
  164. // → 'foo&bar'
  165. // Passing an `options` object to `decode`, to explicitly enable strict mode:
  166. he.decode('foo&ampbar', {
  167. 'strict': true
  168. });
  169. // → Parse error
  170. ```
  171. #### Overriding default `decode` options globally
  172. The global default settings for the `decode` function can be overridden by modifying the `he.decode.options` object. This saves you from passing in an `options` object for every call to `decode` if you want to use a non-default setting.
  173. ```js
  174. // Read the global default setting:
  175. he.decode.options.isAttributeValue;
  176. // → `false` by default
  177. // Override the global default setting:
  178. he.decode.options.isAttributeValue = true;
  179. // Using the global default setting, which is now `true`:
  180. he.decode('foo&ampbar');
  181. // → 'foo&ampbar'
  182. ```
  183. ### `he.escape(text)`
  184. This function takes a string of text and escapes it for use in text contexts in XML or HTML documents. Only the following characters are escaped: `&`, `<`, `>`, `"`, `'`, and `` ` ``.
  185. ```js
  186. he.escape('<img src=\'x\' onerror="prompt(1)">');
  187. // → '&lt;img src=&#x27;x&#x27; onerror=&quot;prompt(1)&quot;&gt;'
  188. ```
  189. ### `he.unescape(html, options)`
  190. `he.unescape` is an alias for `he.decode`. It takes a string of HTML and decodes any named and numerical character references in it.
  191. ### Using the `he` binary
  192. To use the `he` binary in your shell, simply install _he_ globally using npm:
  193. ```bash
  194. npm install -g he
  195. ```
  196. After that you will be able to encode/decode HTML entities from the command line:
  197. ```bash
  198. $ he --encode 'föo ♥ bår 𝌆 baz'
  199. f&#xF6;o &#x2665; b&#xE5;r &#x1D306; baz
  200. $ he --encode --use-named-refs 'föo ♥ bår 𝌆 baz'
  201. f&ouml;o &hearts; b&aring;r &#x1D306; baz
  202. $ he --decode 'f&ouml;o &hearts; b&aring;r &#x1D306; baz'
  203. föo ♥ bår 𝌆 baz
  204. ```
  205. Read a local text file, encode it for use in an HTML text context, and save the result to a new file:
  206. ```bash
  207. $ he --encode < foo.txt > foo-escaped.html
  208. ```
  209. Or do the same with an online text file:
  210. ```bash
  211. $ curl -sL "http://git.io/HnfEaw" | he --encode > escaped.html
  212. ```
  213. Or, the opposite — read a local file containing a snippet of HTML in a text context, decode it back to plain text, and save the result to a new file:
  214. ```bash
  215. $ he --decode < foo-escaped.html > foo.txt
  216. ```
  217. Or do the same with an online HTML snippet:
  218. ```bash
  219. $ curl -sL "http://git.io/HnfEaw" | he --decode > decoded.txt
  220. ```
  221. See `he --help` for the full list of options.
  222. ## Support
  223. he has been tested in at least Chrome 27-29, Firefox 3-22, Safari 4-6, Opera 10-12, IE 6-10, Node.js v0.10.0, Narwhal 0.3.2, RingoJS 0.8-0.9, PhantomJS 1.9.0, and Rhino 1.7RC4.
  224. ## Unit tests & code coverage
  225. After cloning this repository, run `npm install` to install the dependencies needed for he development and testing. You may want to install Istanbul _globally_ using `npm install istanbul -g`.
  226. Once that’s done, you can run the unit tests in Node using `npm test` or `node tests/tests.js`. To run the tests in Rhino, Ringo, Narwhal, and web browsers as well, use `grunt test`.
  227. To generate the code coverage report, use `grunt cover`.
  228. ## Acknowledgements
  229. Thanks to [Simon Pieters](http://simon.html5.org/) ([@zcorpan](https://twitter.com/zcorpan)) for the many suggestions.
  230. ## Author
  231. | [![twitter/mathias](https://gravatar.com/avatar/24e08a9ea84deb17ae121074d0f17125?s=70)](https://twitter.com/mathias "Follow @mathias on Twitter") |
  232. |---|
  233. | [Mathias Bynens](https://mathiasbynens.be/) |
  234. ## License
  235. _he_ is available under the [MIT](http://mths.be/mit) license.