Golang HTML Tokenizer: extract text from a web page — Kanan Rahimov

Kanan Rahimov
5 min readAug 17, 2020
Photo by Max Duzij on Unsplash

Golang HTML Tokenizer: extract text from a web page

One thing is to load the content of a web page, and another is to extract some valuable information from it. For the second one, sometimes, it is helpful and needed to get only the textual information. Because most of the time, it is the particular text that we are interested in. And, one of the ways to do it in Golang is to use HTML Tokenizer.

In Go, there is a sub-repository package called HTML which implements an HTML5-compliant tokenizer. By using this package, it’s possible to retrieve information about the page in the form of tokens — tag names, attributes, and text data.

Fetch webpage and parse just text content

The following simple example will fetch the given URL and print only the text content of each related tag:

package main

import (
"fmt"
"io"
"log"
"strings"

"golang.org/x/net/html"
)

func main() {
response, err := http.Get("https://kenanbek.github.io/about")
if err != nil {
log.Fatal(err)
}
defer response.Body.Close()

tokenizer := html.NewTokenizer(response.Body)
for {
tt := tokenizer.Next()
t := tokenizer.Token()

err := tokenizer.Err()
if err == io.EOF {
break
}

switch

--

--