Golang HTML Tokenizer: extract text from a web page — Kanan Rahimov

Kanan Rahimov
5 min readAug 17, 2020
Photo by Max Duzij on Unsplash

Golang HTML Tokenizer: extract text from a web page

One thing is to load the content of a web page, and another is to extract some valuable information from it. For the second one, sometimes, it is helpful and needed to get only the textual information. Because most of the time, it is the particular text that we are interested in. And, one of the ways to do it in Golang is to use HTML Tokenizer.

In Go, there is a sub-repository package called HTML which implements an HTML5-compliant tokenizer. By using this package, it’s possible to retrieve information about the page in the form of tokens — tag names, attributes, and text data.

Fetch webpage and parse just text content

The following simple example will fetch the given URL and print only the text content of each related tag:

package main

import (
"fmt"
"io"
"log"
"strings"

"golang.org/x/net/html"
)

func main() {
response, err := http.Get("https://kenanbek.github.io/about")
if err != nil {
log.Fatal(err)
}
defer response.Body.Close()

tokenizer := html.NewTokenizer(response.Body)
for {
tt := tokenizer.Next()
t := tokenizer.Token()

err := tokenizer.Err()
if err == io.EOF {
break
}

switch tt {
case html.ErrorToken:
log.Fatal(err)
case html.TextToken:
data := strings.TrimSpace(t.Data)
fmt.Println(data)
}
}
}

The above example is quite straightforward:

  1. Here, http.Get("https://kenanbek.github.io/about"), we load the content of the web page.
  2. If there is no error, we initialize a new tokenizer with the body of the response: tokenizer := html.NewTokenizer(response.Body)
  3. Iterate through each token and check the token type. To do so, we use tokenizer.Next() to fetch the next token and tokenizer.Token() to get the additional information about the current token.
  4. tokenizer.Next() returns a type of the current token, which helps us identify if it is Error, Opening or Closing tag, or a Text token (there is also token type for Comment, Doctype, Self-closing tokens).
  5. If the token is a text token, using data :=

--

--