Golang HTML Tokenizer: extract text from a web page — Kanan Rahimov
Golang HTML Tokenizer: extract text from a web page
One thing is to load the content of a web page, and another is to extract some valuable information from it. For the second one, sometimes, it is helpful and needed to get only the textual information. Because most of the time, it is the particular text that we are interested in. And, one of the ways to do it in Golang is to use HTML Tokenizer.
In Go, there is a sub-repository package called HTML which implements an HTML5-compliant tokenizer. By using this package, it’s possible to retrieve information about the page in the form of tokens — tag names, attributes, and text data.
Fetch webpage and parse just text content
The following simple example will fetch the given URL and print only the text content of each related tag:
package main
import (
"fmt"
"io"
"log"
"strings"
"golang.org/x/net/html"
)
func main() {
response, err := http.Get("https://kenanbek.github.io/about")
if err != nil {
log.Fatal(err)
}
defer response.Body.Close()
tokenizer := html.NewTokenizer(response.Body)
for {
tt := tokenizer.Next()
t := tokenizer.Token()
err := tokenizer.Err()
if err == io.EOF {
break
}
switch…