You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

106 lines
3.3 KiB

  1. // Copyright 2010 The Go Authors. All rights reserved.
  2. // Use of this source code is governed by a BSD-style
  3. // license that can be found in the LICENSE file.
  4. /*
  5. Package html implements an HTML5-compliant tokenizer and parser.
  6. Tokenization is done by creating a Tokenizer for an io.Reader r. It is the
  7. caller's responsibility to ensure that r provides UTF-8 encoded HTML.
  8. z := html.NewTokenizer(r)
  9. Given a Tokenizer z, the HTML is tokenized by repeatedly calling z.Next(),
  10. which parses the next token and returns its type, or an error:
  11. for {
  12. tt := z.Next()
  13. if tt == html.ErrorToken {
  14. // ...
  15. return ...
  16. }
  17. // Process the current token.
  18. }
  19. There are two APIs for retrieving the current token. The high-level API is to
  20. call Token; the low-level API is to call Text or TagName / TagAttr. Both APIs
  21. allow optionally calling Raw after Next but before Token, Text, TagName, or
  22. TagAttr. In EBNF notation, the valid call sequence per token is:
  23. Next {Raw} [ Token | Text | TagName {TagAttr} ]
  24. Token returns an independent data structure that completely describes a token.
  25. Entities (such as "<") are unescaped, tag names and attribute keys are
  26. lower-cased, and attributes are collected into a []Attribute. For example:
  27. for {
  28. if z.Next() == html.ErrorToken {
  29. // Returning io.EOF indicates success.
  30. return z.Err()
  31. }
  32. emitToken(z.Token())
  33. }
  34. The low-level API performs fewer allocations and copies, but the contents of
  35. the []byte values returned by Text, TagName and TagAttr may change on the next
  36. call to Next. For example, to extract an HTML page's anchor text:
  37. depth := 0
  38. for {
  39. tt := z.Next()
  40. switch tt {
  41. case html.ErrorToken:
  42. return z.Err()
  43. case html.TextToken:
  44. if depth > 0 {
  45. // emitBytes should copy the []byte it receives,
  46. // if it doesn't process it immediately.
  47. emitBytes(z.Text())
  48. }
  49. case html.StartTagToken, html.EndTagToken:
  50. tn, _ := z.TagName()
  51. if len(tn) == 1 && tn[0] == 'a' {
  52. if tt == html.StartTagToken {
  53. depth++
  54. } else {
  55. depth--
  56. }
  57. }
  58. }
  59. }
  60. Parsing is done by calling Parse with an io.Reader, which returns the root of
  61. the parse tree (the document element) as a *Node. It is the caller's
  62. responsibility to ensure that the Reader provides UTF-8 encoded HTML. For
  63. example, to process each anchor node in depth-first order:
  64. doc, err := html.Parse(r)
  65. if err != nil {
  66. // ...
  67. }
  68. var f func(*html.Node)
  69. f = func(n *html.Node) {
  70. if n.Type == html.ElementNode && n.Data == "a" {
  71. // Do something with n...
  72. }
  73. for c := n.FirstChild; c != nil; c = c.NextSibling {
  74. f(c)
  75. }
  76. }
  77. f(doc)
  78. The relevant specifications include:
  79. https://html.spec.whatwg.org/multipage/syntax.html and
  80. https://html.spec.whatwg.org/multipage/syntax.html#tokenization
  81. */
  82. package html // import "golang.org/x/net/html"
  83. // The tokenization algorithm implemented by this package is not a line-by-line
  84. // transliteration of the relatively verbose state-machine in the WHATWG
  85. // specification. A more direct approach is used instead, where the program
  86. // counter implies the state, such as whether it is tokenizing a tag or a text
  87. // node. Specification compliance is verified by checking expected and actual
  88. // outputs over a test suite rather than aiming for algorithmic fidelity.
  89. // TODO(nigeltao): Does a DOM API belong in this package or a separate one?
  90. // TODO(nigeltao): How does parsing interact with a JavaScript engine?