[VIDEO] GZIP compress web page’s content and save in MySQL using GORM (Golang) — Kanan Rahimov
--
In the attached video I discuss the following topics:
- Save links and images from the webpage.
- Mark URL as complete in the pipeline once it is fully parsed.
- Refactor: extract text compressor to the separate function (similar to decompressor).
- To-do: define a task for the “webpage data parser” worker.
Compress using GZIP
We retrieve the web page content as a text body. Since we expect to have many URLs saved locally, it would be optimal to compress this date; thus, it will take less storage. By my benchmarks, we can see an average of 60–85 % compression level. See the video for examples.
I use gzip ( compress/gzip
) to compress the text. In this video, I refactored text compression to a separate function. I then executed the whole pipeline to see if the webpage's data fetched and saved in the compressed version (manual test).
Here is the main gzip function:
func gzipWrite(w io.Writer, respBody []byte) error {
var err error
gz := gzip.NewWriter(w) if _, err = gz.Write(respBody); err != nil {
return err
}
if err = gz.Close(); err != nil {
return err
}
return nil
}
GORM Models for URLLink and URLImage
In this section, we will introduce two new models: URLLink and URLImage. These tables aim to keep reference information from the given web page.
type URLLink struct {
CreatedAt time.Time
UpdatedAt time.Time
DeletedAt *time.Time `sql:"index"` URL string `gorm:"index:idx_url;not null"`
LinkURL string `gorm:"not null"`
LinkTitle string
}func (URLLink) TableName() string {
return TablePrefix + "url_links"
}type URLImage struct {
CreatedAt time.Time
UpdatedAt time.Time
DeletedAt *time.Time `sql:"index"` URL string `gorm:"index:idx_url;not null"`
ImageURL string `gorm:"not null"`
ImageTitle string
}func (URLImage) TableName() string {
return TablePrefix + "url_images"
}
Originally published at https://kananrahimov.com on January 3, 2021.