mirror of
https://github.com/NohamR/papeer.git
synced 2026-05-25 20:00:47 +00:00
Compare commits
16 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d8a3cc027f | ||
|
|
be45a8f744 | ||
|
|
4b760c9562 | ||
|
|
26b144fb73 | ||
|
|
900ee8c5d7 | ||
|
|
008e4ebd7a | ||
|
|
5e735f9c52 | ||
|
|
29008185a8 | ||
|
|
ff3d09c727 | ||
|
|
0009435769 | ||
|
|
4e9b0611e8 | ||
|
|
e7ffd8c66c | ||
|
|
84e6ad8585 | ||
|
|
d593a74e6e | ||
|
|
d5971a2819 | ||
|
|
2d1d5a964a |
14
Makefile
Normal file
14
Makefile
Normal file
@@ -0,0 +1,14 @@
|
||||
format:
|
||||
gofmt -s -w .
|
||||
|
||||
test:
|
||||
go test github.com/lapwat/papeer/book
|
||||
|
||||
install:
|
||||
go install
|
||||
|
||||
clean:
|
||||
find . -maxdepth 1 -not -name 'README.md' -name '*.md' -delete
|
||||
find . -maxdepth 1 -name '*.epub' -delete
|
||||
find . -maxdepth 1 -name '*.mobi' -delete
|
||||
find . -maxdepth 1 -name 'papeer-v*' -delete
|
||||
191
README.md
191
README.md
@@ -1,17 +1,125 @@
|
||||
# Papeer
|
||||
|
||||
Papeer is a powerful **ereader internet vacuum**. It can scrape any website, removing ads and keeping only the relevant content (formatted text and images). You can export the content to Markdown, EPUB or MOBI files.
|
||||
|
||||
# Table of contents
|
||||
|
||||
- [Usage](#usage)
|
||||
* [Scrape a web page](#scrape-a-web-page)
|
||||
* [Scrape a whole website](#scrape-a-whole-website)
|
||||
+ [`depth` option](#-depth--option)
|
||||
+ [`selector` option](#-selector--option)
|
||||
+ [Display the table of contents](#display-the-table-of-contents)
|
||||
+ [Scrape time](#scrape-time)
|
||||
- [Installation](#installation)
|
||||
* [From source](#from-source)
|
||||
* [From binary](#from-binary)
|
||||
+ [Linux / MacOS](#linux---macos)
|
||||
+ [Windows](#windows)
|
||||
* [MOBI support](#mobi-support)
|
||||
- [Autocompletion](#autocompletion)
|
||||
- [Dependencies](#dependencies)
|
||||
|
||||
# Usage
|
||||
|
||||
## Scrape a web page
|
||||
|
||||
The `get` command lets you retrieve the content of any web page.
|
||||
|
||||
```
|
||||
❯ papeer get --format epub --recursive --delay 500 --limit 10 https://news.ycombinator.com/
|
||||
6s [===============================================>--------------------] 70% Status: 7 out of 10 chapters
|
||||
0s [====================================================================] 100% 1. Three ex-US intelligence officers admit hacking for UAE
|
||||
0s [====================================================================] 100% 2. Show HN: Time Travel Debugger
|
||||
0s [====================================================================] 100% 3. How much faster is Java 17?
|
||||
0s [====================================================================] 100% 4. The First Webcam Was Invented to Keep an Eye on a Coffee Pot
|
||||
0s [====================================================================] 100% 5. Nikon's 2021 Photomicrography Competition Winners
|
||||
0s [====================================================================] 100% 6. HTTP Status 418 – I'm a teapot
|
||||
0s [====================================================================] 100% 7. H3: Hexagonal hierarchical geospatial indexing system
|
||||
--- [--------------------------------------------------------------------] 0% 8. Automatic cipher suite ordering in Go’s crypto/tls
|
||||
--- [--------------------------------------------------------------------] 0% 9. Find engineering roles at over 800 YC-funded startups
|
||||
--- [--------------------------------------------------------------------] 0% 10. Futarchy: Robin Hanson on prediction markets
|
||||
Ebook saved to "Hacker_News.epub"
|
||||
Scrape URL content
|
||||
|
||||
Usage:
|
||||
papeer get URL [flags]
|
||||
|
||||
Examples:
|
||||
papeer get https://www.eff.org/cyberspace-independence
|
||||
|
||||
Flags:
|
||||
-a, --author string book author
|
||||
--delay int time in milliseconds to wait before downloading next chapter, use with depth/selector (default -1)
|
||||
-d, --depth int scraping depth
|
||||
-f, --format string file format [stdout, md, epub, mobi] (default "md")
|
||||
-h, --help help for get
|
||||
--images retrieve images only
|
||||
-i, --include include URL as first chapter, use with depth/selector
|
||||
-l, --limit int limit number of chapters, use with depth/selector (default -1)
|
||||
-n, --name string book name (default: page title)
|
||||
-o, --offset int skip first chapters, use with depth/selector
|
||||
--output string file name (default: book name)
|
||||
-q, --quiet hide progress bar
|
||||
-s, --selector strings table of contents CSS selector
|
||||
-t, --threads int download concurrency, use with depth/selector (default -1)
|
||||
--use-link-name use link name for chapter title
|
||||
```
|
||||
|
||||
## Scrape a whole website
|
||||
|
||||
If a navigation menu is present on a website, you can scrape the content of each page.
|
||||
|
||||
You can activate this mode by using the `depth` or `selector` options.
|
||||
|
||||
### `depth` option
|
||||
|
||||
This option defaults to 0, `papeer` will grab only the main page.
|
||||
|
||||
If you specify a value greater than 0, `papeer` will grab pages as deep as the value you specify.
|
||||
|
||||
> Using `include` option will include all intermediary levels into the book.
|
||||
|
||||
### `selector` option
|
||||
|
||||
If this option is not specified, `papeer` will grab only the one page.
|
||||
|
||||
If this option is specified, `papeer` will select the links (a HTML tag) present on the main page, then grab each one of them.
|
||||
|
||||
You can chain this option to grab several level of pages with diferent selectors for each level.
|
||||
|
||||
### Display the table of contents
|
||||
|
||||
Before actually scraping a whole website, it is a good idea to use the `list` command. This command is like a **dry run**, which lets you vizualize the content before actually retrieving it. You can use several options to customize the table of contents extraction, such as `selector`, `limit`, `offset` and `include`. Type `papeer list --help` for more information about those options.
|
||||
|
||||
```sh
|
||||
papeer list https://12factor.net/ -s 'section.concrete>article>h2>a'
|
||||
```
|
||||
```
|
||||
# NAME URL
|
||||
1 I. Codebase https://12factor.net/codebase
|
||||
2 II. Dependencies https://12factor.net/dependencies
|
||||
3 III. Config https://12factor.net/config
|
||||
4 IV. Backing services https://12factor.net/backing-services
|
||||
5 V. Build, release, run https://12factor.net/build-release-run
|
||||
6 VI. Processes https://12factor.net/processes
|
||||
7 VII. Port binding https://12factor.net/port-binding
|
||||
8 VIII. Concurrency https://12factor.net/concurrency
|
||||
9 IX. Disposability https://12factor.net/disposability
|
||||
10 X. Dev/prod parity https://12factor.net/dev-prod-parity
|
||||
11 XI. Logs https://12factor.net/logs
|
||||
12 XII. Admin processes https://12factor.net/admin-processes
|
||||
```
|
||||
|
||||
### Scrape time
|
||||
|
||||
Once you are satisfied with the table of contents listed by the `ls` command, you can actually scrape the content of those pages. You can use the same options that you specified for the `ls` command. You can specify `delay` and `threads` options when using `selector` or `depth` options.
|
||||
|
||||
```sh
|
||||
papeer get https://12factor.net/ --selector='section.concrete>article>h2>a'
|
||||
```
|
||||
```
|
||||
[======================================>-----------------------------] Chapters 7 / 12
|
||||
[====================================================================] 1. I. Codebase
|
||||
[====================================================================] 2. II. Dependencies
|
||||
[====================================================================] 3. III. Config
|
||||
[====================================================================] 4. IV. Backing services
|
||||
[====================================================================] 5. V. Build, release, run
|
||||
[====================================================================] 6. VI. Processes
|
||||
[====================================================================] 7. VII. Port binding
|
||||
[--------------------------------------------------------------------] 8. VIII. Concurrency
|
||||
[--------------------------------------------------------------------] 9. IX. Disposability
|
||||
[--------------------------------------------------------------------] 10. X. Dev/prod parity
|
||||
[--------------------------------------------------------------------] 11. XI. Logs
|
||||
[--------------------------------------------------------------------] 12. XII. Admin processes
|
||||
Markdown saved to "The_Twelve-Factor_App.md"
|
||||
```
|
||||
|
||||
# Installation
|
||||
@@ -24,21 +132,29 @@ go get -u github.com/lapwat/papeer
|
||||
|
||||
## From binary
|
||||
|
||||
### On Linux / MacOS
|
||||
### Linux / MacOS
|
||||
|
||||
```sh
|
||||
# use platform=darwin for MacOS
|
||||
platform=linux
|
||||
# platform=darwin for MacOS
|
||||
curl -L https://github.com/lapwat/papeer/releases/download/v0.1.0/papeer-v0.1.0-$platform-amd64 > papeer
|
||||
chmod +x papeer
|
||||
release=0.4.1
|
||||
|
||||
# download and extract
|
||||
curl -L https://github.com/lapwat/papeer/releases/download/v$release/papeer-v$release-$platform-amd64.tar.gz > papeer.tar.gz
|
||||
tar xzvf papeer.tar.gz
|
||||
rm papeer.tar.gz
|
||||
|
||||
# move to user binaries
|
||||
sudo mv papeer /usr/local/bin
|
||||
```
|
||||
|
||||
### On Windows
|
||||
### Windows
|
||||
|
||||
Download [latest release](https://github.com/lapwat/papeer/releases/download/v0.1.0/papeer-v0.1.0-windows-amd64.exe).
|
||||
Download [latest release](https://github.com/lapwat/papeer/releases/download/v0.4.1/papeer-v0.4.1-windows-amd64.exe.zip).
|
||||
|
||||
## Install kindlegen to export websites to MOBI (optional)
|
||||
## MOBI support
|
||||
|
||||
Install kindlegen to convert websites, Linux only
|
||||
|
||||
```sh
|
||||
TMPDIR=$(mktemp -d -t papeer-XXXXX)
|
||||
@@ -49,37 +165,6 @@ sudo mv $TMPDIR/kindlegen /usr/local/bin
|
||||
rm -rf $TMPDIR
|
||||
```
|
||||
|
||||
# Usage
|
||||
|
||||
```
|
||||
Browse the web in the eink era
|
||||
|
||||
Usage:
|
||||
papeer [flags]
|
||||
papeer [command]
|
||||
|
||||
Available Commands:
|
||||
completion generate the autocompletion script for the specified shell
|
||||
get Scrape URL content
|
||||
help Help about any command
|
||||
ls Print table of content
|
||||
version Print the version number of papeer
|
||||
|
||||
Flags:
|
||||
-d, --delay int time to wait before downloading next chapter, in milliseconds (default -1)
|
||||
-f, --format string file format [md, epub, mobi] (default "md")
|
||||
-h, --help help for papeer
|
||||
-i, --include include URL as first chapter, in resursive mode
|
||||
-l, --limit int limit number of chapters, in recursive mode (default -1)
|
||||
-o, --output string output file
|
||||
-q, --quiet do not show logs
|
||||
-r, --recursive create one chapter per natigation item
|
||||
-s, --selector string table of content CSS selector
|
||||
--stdout print to standard output
|
||||
|
||||
Use "papeer [command] --help" for more information about a command.
|
||||
```
|
||||
|
||||
# Autocompletion
|
||||
|
||||
Execute this command in your current shell, or add it to your `.bashrc`.
|
||||
@@ -88,10 +173,10 @@ Execute this command in your current shell, or add it to your `.bashrc`.
|
||||
. <(papeer completion bash)
|
||||
```
|
||||
|
||||
Type `papeer completion bash -h` for more information.
|
||||
|
||||
You can replace `bash` by your own shell (zsh, fish or powershell).
|
||||
|
||||
Type `papeer completion bash -h` for more information.
|
||||
|
||||
# Dependencies
|
||||
|
||||
- `cobra` command line interface
|
||||
|
||||
@@ -1,13 +1,20 @@
|
||||
package book
|
||||
|
||||
type chapter struct {
|
||||
name string
|
||||
author string
|
||||
content string
|
||||
body string
|
||||
name string
|
||||
author string
|
||||
content string
|
||||
subChapters []chapter
|
||||
config *ScrapeConfig
|
||||
}
|
||||
|
||||
func NewChapter(name, author, content string) chapter {
|
||||
return chapter{name, author, content}
|
||||
func NewChapter(body, name, author, content string, subChapters []chapter, config *ScrapeConfig) chapter {
|
||||
return chapter{body, name, author, content, subChapters, config}
|
||||
}
|
||||
|
||||
func (c chapter) Body() string {
|
||||
return c.body
|
||||
}
|
||||
|
||||
func (c chapter) Name() string {
|
||||
@@ -21,3 +28,7 @@ func (c chapter) Author() string {
|
||||
func (c chapter) Content() string {
|
||||
return c.content
|
||||
}
|
||||
|
||||
func (c chapter) SubChapters() []chapter {
|
||||
return c.subChapters
|
||||
}
|
||||
|
||||
164
book/format.go
Normal file
164
book/format.go
Normal file
@@ -0,0 +1,164 @@
|
||||
package book
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"log"
|
||||
"os"
|
||||
"os/exec"
|
||||
"strings"
|
||||
|
||||
md "github.com/JohannesKaufmann/html-to-markdown"
|
||||
"github.com/PuerkitoBio/goquery"
|
||||
epub "github.com/bmaupin/go-epub"
|
||||
)
|
||||
|
||||
func Filename(name string) string {
|
||||
filename := name
|
||||
|
||||
filename = strings.ReplaceAll(filename, " ", "_")
|
||||
filename = strings.ReplaceAll(filename, "/", "")
|
||||
|
||||
return filename
|
||||
}
|
||||
|
||||
func ToMarkdownString(c chapter) string {
|
||||
markdown := ""
|
||||
|
||||
if c.config.Include {
|
||||
// title
|
||||
markdown += fmt.Sprintf("%s\n", c.Name())
|
||||
markdown += fmt.Sprintf("%s\n\n", strings.Repeat("=", len(c.Name())))
|
||||
|
||||
// convert content to markdown
|
||||
content, err := md.NewConverter("", true, nil).ConvertString(c.Content())
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
markdown += fmt.Sprintf("%s\n\n\n", content)
|
||||
}
|
||||
|
||||
for _, sc := range c.SubChapters() {
|
||||
// subchapters content
|
||||
markdown += fmt.Sprintf("%s\n\n\n", ToMarkdownString(sc))
|
||||
}
|
||||
|
||||
return markdown
|
||||
}
|
||||
|
||||
func ToMarkdown(c chapter, filename string) string {
|
||||
if len(filename) == 0 {
|
||||
filename = fmt.Sprintf("%s.md", Filename(c.Name()))
|
||||
}
|
||||
|
||||
markdown := ToMarkdownString(c)
|
||||
|
||||
// write to file
|
||||
f, err := os.Create(filename)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
_, err2 := f.WriteString(markdown)
|
||||
if err2 != nil {
|
||||
log.Fatal(err2)
|
||||
}
|
||||
f.Close()
|
||||
|
||||
return filename
|
||||
}
|
||||
|
||||
func ToEpub(c chapter, filename string) string {
|
||||
if len(filename) == 0 {
|
||||
filename = fmt.Sprintf("%s.epub", Filename(c.Name()))
|
||||
}
|
||||
|
||||
// init ebook
|
||||
e := epub.NewEpub(c.Name())
|
||||
e.SetAuthor(c.Author())
|
||||
|
||||
AppendToEpub(e, c)
|
||||
|
||||
err := e.Write(filename)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
return filename
|
||||
}
|
||||
|
||||
func AppendToEpub(e *epub.Epub, c chapter) {
|
||||
content := ""
|
||||
|
||||
if c.config.Include {
|
||||
|
||||
if c.config.ImagesOnly == false {
|
||||
content = c.Content()
|
||||
}
|
||||
|
||||
// parse content
|
||||
doc, err := goquery.NewDocumentFromReader(strings.NewReader(c.Content()))
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
// download images and replace src in img tags of content
|
||||
doc.Find("img").Each(func(i int, s *goquery.Selection) {
|
||||
src, _ := s.Attr("src")
|
||||
src = strings.Split(src, "?")[0] // remove query part
|
||||
imagePath, _ := e.AddImage(src, "")
|
||||
|
||||
if c.config.ImagesOnly {
|
||||
imageTag, _ := goquery.OuterHtml(s)
|
||||
content += strings.Replace(imageTag, src, imagePath, 1)
|
||||
} else {
|
||||
content = strings.Replace(content, src, imagePath, 1)
|
||||
}
|
||||
})
|
||||
|
||||
html := ""
|
||||
// add title only if ImagesOnly = false
|
||||
if c.config.ImagesOnly == false {
|
||||
html += fmt.Sprintf("<h1>%s</h1>", c.Name())
|
||||
}
|
||||
html += content
|
||||
|
||||
// write to epub file
|
||||
_, err = e.AddSection(html, c.Name(), "", "")
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
for _, sc := range c.SubChapters() {
|
||||
AppendToEpub(e, sc)
|
||||
}
|
||||
}
|
||||
|
||||
func ToMobi(c chapter, filename string) string {
|
||||
if len(filename) == 0 {
|
||||
filename = fmt.Sprintf("%s.mobi", Filename(c.Name()))
|
||||
} else {
|
||||
|
||||
// add .mobi extension if not specified
|
||||
if strings.HasSuffix(filename, ".mobi") == false {
|
||||
filename = fmt.Sprintf("%s.mobi", filename)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
filenameEPUB := strings.ReplaceAll(filename, ".mobi", ".epub")
|
||||
ToEpub(c, filenameEPUB)
|
||||
|
||||
exec.Command("kindlegen", filenameEPUB).Run()
|
||||
// exec command always return status 1 even if it succeed
|
||||
// if err != nil {
|
||||
// log.Fatal(err)
|
||||
// }
|
||||
|
||||
err := os.Remove(filenameEPUB)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
return filename
|
||||
}
|
||||
127
book/format_test.go
Normal file
127
book/format_test.go
Normal file
@@ -0,0 +1,127 @@
|
||||
package book
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"os"
|
||||
"testing"
|
||||
)
|
||||
|
||||
func TestFilename(t *testing.T) {
|
||||
|
||||
got := Filename("This is a chapter / book")
|
||||
want := "This_is_a_chapter__book"
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %q, wanted %q", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestToMarkdownString(t *testing.T) {
|
||||
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
|
||||
|
||||
got := ToMarkdownString(c)
|
||||
want := "Books\n=====\n\n- [Discours de la Méthode](https://books.lapw.at/posts/ren%C3%A9-descartes-discours-de-la-m%C3%A9thode/)clock 98 min read -\n1637\n\n- [The Twelve-Factor App](https://books.lapw.at/posts/adam-wiggins-the-twelve-factor-app/)clock 22 min read -\n2011\n\n\n"
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %q, wanted %q", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestToMarkdown(t *testing.T) {
|
||||
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
|
||||
ToMarkdown(c, "")
|
||||
|
||||
filename := "Books.md"
|
||||
if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
|
||||
t.Errorf("%s does not exist: %v", filename, err)
|
||||
} else {
|
||||
if err := os.Remove(filename); err != nil {
|
||||
t.Errorf("cannot remove %v: %v", filename, err)
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestToMarkdownFilename(t *testing.T) {
|
||||
|
||||
filename := "ebook.md"
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
|
||||
ToMarkdown(c, filename)
|
||||
|
||||
if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
|
||||
t.Errorf("%s does not exist: %v", filename, err)
|
||||
} else {
|
||||
if err := os.Remove(filename); err != nil {
|
||||
t.Errorf("cannot remove %v: %v", filename, err)
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestToEpub(t *testing.T) {
|
||||
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
|
||||
ToEpub(c, "")
|
||||
|
||||
filename := "Books.epub"
|
||||
if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
|
||||
t.Errorf("%s does not exist: %v", filename, err)
|
||||
} else {
|
||||
if err := os.Remove(filename); err != nil {
|
||||
t.Errorf("cannot remove %v: %v", filename, err)
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestToEpubFilename(t *testing.T) {
|
||||
|
||||
filename := "ebook.epub"
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
|
||||
ToEpub(c, filename)
|
||||
|
||||
if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
|
||||
t.Errorf("%s does not exist: %v", filename, err)
|
||||
} else {
|
||||
if err := os.Remove(filename); err != nil {
|
||||
t.Errorf("cannot remove %v: %v", filename, err)
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestToMobi(t *testing.T) {
|
||||
|
||||
filename := "ebook.mobi"
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
|
||||
ToMobi(c, filename)
|
||||
|
||||
if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
|
||||
t.Errorf("%s does not exist: %v", filename, err)
|
||||
} else {
|
||||
if err := os.Remove(filename); err != nil {
|
||||
t.Errorf("cannot remove %v: %v", filename, err)
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestToMobiFilename(t *testing.T) {
|
||||
|
||||
filename := "ebook.mobi"
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
|
||||
ToMobi(c, filename)
|
||||
|
||||
if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
|
||||
t.Errorf("%s does not exist: %v", filename, err)
|
||||
} else {
|
||||
if err := os.Remove(filename); err != nil {
|
||||
t.Errorf("cannot remove %v: %v", filename, err)
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
@@ -3,11 +3,10 @@ package book
|
||||
type link struct {
|
||||
href string
|
||||
text string
|
||||
class string
|
||||
}
|
||||
|
||||
func NewLink(href, text, class string) link {
|
||||
return link{href, text, class}
|
||||
func NewLink(href, text string) link {
|
||||
return link{href, text}
|
||||
}
|
||||
|
||||
func (c link) Href() string {
|
||||
@@ -17,7 +16,3 @@ func (c link) Href() string {
|
||||
func (c link) Text() string {
|
||||
return c.text
|
||||
}
|
||||
|
||||
func (c link) Class() string {
|
||||
return c.class
|
||||
}
|
||||
|
||||
59
book/progress.go
Normal file
59
book/progress.go
Normal file
@@ -0,0 +1,59 @@
|
||||
package book
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
|
||||
"github.com/gosuri/uiprogress"
|
||||
)
|
||||
|
||||
type progress struct {
|
||||
global *uiprogress.Bar
|
||||
individuals []*uiprogress.Bar
|
||||
}
|
||||
|
||||
func NewProgress(links []link, parent string, depth int) progress {
|
||||
uiprogress.Start()
|
||||
|
||||
global := uiprogress.AddBar(len(links))
|
||||
indentGlobal := strings.Repeat("> ", depth)
|
||||
global.AppendFunc(func(b *uiprogress.Bar) string {
|
||||
return fmt.Sprintf("%v%v (%v / %v)", indentGlobal, parent, b.Current(), len(links))
|
||||
})
|
||||
|
||||
// hide individual bars if more than 50 chapters
|
||||
individuals := []*uiprogress.Bar{}
|
||||
indent := strings.Repeat("- ", depth)
|
||||
if len(links) <= 50 {
|
||||
for index, link := range links {
|
||||
bar := uiprogress.AddBar(1)
|
||||
barText := fmt.Sprintf("%v#%v %v", indent, index+1, link.Text())
|
||||
bar.AppendFunc(func(b *uiprogress.Bar) string {
|
||||
return barText
|
||||
})
|
||||
individuals = append(individuals, bar)
|
||||
}
|
||||
}
|
||||
|
||||
return progress{global, individuals}
|
||||
}
|
||||
|
||||
func (p *progress) IncrementGlobal() {
|
||||
p.global.Incr()
|
||||
}
|
||||
|
||||
func (p *progress) Increment(index int) {
|
||||
p.IncrementGlobal()
|
||||
if len(p.individuals) > index {
|
||||
p.individuals[index].Incr()
|
||||
}
|
||||
}
|
||||
|
||||
func (p *progress) UpdateName(index int, name string) {
|
||||
if len(p.individuals) > index {
|
||||
barText := fmt.Sprintf("%s", name)
|
||||
p.individuals[index].AppendFunc(func(b *uiprogress.Bar) string {
|
||||
return barText
|
||||
})
|
||||
}
|
||||
}
|
||||
430
book/scraper.go
430
book/scraper.go
@@ -1,9 +1,12 @@
|
||||
package book
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"fmt"
|
||||
"io"
|
||||
"log"
|
||||
"math"
|
||||
"net/http"
|
||||
urllib "net/url"
|
||||
"strings"
|
||||
"sync"
|
||||
@@ -12,97 +15,320 @@ import (
|
||||
"github.com/PuerkitoBio/goquery"
|
||||
readability "github.com/go-shiori/go-readability"
|
||||
colly "github.com/gocolly/colly/v2"
|
||||
"github.com/gosuri/uiprogress"
|
||||
)
|
||||
|
||||
func NewBookFromURL(url, selector string, recursive, include bool, limit, delay int) book {
|
||||
if recursive {
|
||||
home := NewChapterFromURL(url)
|
||||
b := New(home.Name(), home.Author())
|
||||
type ScrapeConfig struct {
|
||||
Depth int
|
||||
Selector string
|
||||
Quiet bool
|
||||
Limit int
|
||||
Offset int
|
||||
Delay int
|
||||
Threads int
|
||||
Include bool
|
||||
ImagesOnly bool
|
||||
UseLinkName bool
|
||||
}
|
||||
|
||||
chapters := tableOfContent(url, selector, limit, delay)
|
||||
if include {
|
||||
b.AddChapter(home)
|
||||
}
|
||||
for _, c := range chapters {
|
||||
b.AddChapter(c)
|
||||
}
|
||||
func NewScrapeConfig() *ScrapeConfig {
|
||||
return &ScrapeConfig{0, "", false, -1, 0, -1, -1, true, false, false}
|
||||
}
|
||||
|
||||
return b
|
||||
func NewScrapeConfigs(selectors []string) []*ScrapeConfig {
|
||||
configs := []*ScrapeConfig{}
|
||||
|
||||
for _, s := range selectors {
|
||||
config := NewScrapeConfig()
|
||||
config.Selector = s
|
||||
|
||||
configs = append(configs, config)
|
||||
}
|
||||
|
||||
return configs
|
||||
}
|
||||
|
||||
func NewScrapeConfigsAjin() []*ScrapeConfig {
|
||||
config0 := NewScrapeConfig()
|
||||
config0.Depth = 0
|
||||
config0.Selector = ".dt>a"
|
||||
config0.Limit = 3
|
||||
config0.Offset = 0
|
||||
config0.Delay = 5000
|
||||
config0.Include = false
|
||||
|
||||
config1 := NewScrapeConfig()
|
||||
config1.Depth = 1
|
||||
config1.Selector = ".nav_apb>a"
|
||||
config1.Limit = 3
|
||||
config1.Offset = 1
|
||||
config1.Delay = 5000
|
||||
config1.Include = false
|
||||
|
||||
config2 := NewScrapeConfig()
|
||||
config2.Depth = 2
|
||||
config2.ImagesOnly = true
|
||||
|
||||
return []*ScrapeConfig{config0, config1, config2}
|
||||
}
|
||||
|
||||
func NewScrapeConfigsWikipedia() []*ScrapeConfig {
|
||||
config0 := NewScrapeConfig()
|
||||
config0.Depth = 0
|
||||
config0.Threads = -1
|
||||
config0.Include = true
|
||||
|
||||
config1 := NewScrapeConfig()
|
||||
config1.Depth = 1
|
||||
config1.Include = true
|
||||
|
||||
return []*ScrapeConfig{config0, config1}
|
||||
}
|
||||
|
||||
func NewScrapeConfigFake() *ScrapeConfig {
|
||||
config := NewScrapeConfig()
|
||||
config.Include = false
|
||||
|
||||
return config
|
||||
}
|
||||
|
||||
func NewBookFromURL(url string, selector []string, name, author string, include, ImagesOnly, useLinkName, quiet bool, limit, offset, delay, threads int) book {
|
||||
config1 := NewScrapeConfig()
|
||||
config1.ImagesOnly = ImagesOnly
|
||||
config1.UseLinkName = useLinkName
|
||||
|
||||
var chapters []chapter
|
||||
var home chapter
|
||||
|
||||
if len(selector) > 0 {
|
||||
config2 := NewScrapeConfig()
|
||||
config2.Selector = selector[0]
|
||||
config2.Limit = limit
|
||||
config2.Offset = offset
|
||||
config2.Delay = delay
|
||||
config2.Threads = threads
|
||||
config2.Include = include
|
||||
config2.ImagesOnly = ImagesOnly
|
||||
config2.UseLinkName = useLinkName
|
||||
chapters, home = tableOfContent(url, config2, config1, quiet)
|
||||
} else {
|
||||
c := NewChapterFromURL(url)
|
||||
b := New(c.Name(), c.Author())
|
||||
chapters = []chapter{NewChapterFromURL(url, "", []*ScrapeConfig{config1}, 0, func(index int, name string) {})}
|
||||
home = chapters[0]
|
||||
}
|
||||
|
||||
if len(name) == 0 {
|
||||
name = home.Name()
|
||||
}
|
||||
|
||||
if len(author) == 0 {
|
||||
author = home.Author()
|
||||
}
|
||||
|
||||
b := New(name, author)
|
||||
for _, c := range chapters {
|
||||
b.AddChapter(c)
|
||||
return b
|
||||
}
|
||||
}
|
||||
|
||||
func NewChapterFromURL(url string) chapter {
|
||||
article, err := readability.FromURL(url, 30*time.Second)
|
||||
if err != nil {
|
||||
log.Fatalf("failed to parse %s, %v\n", url, err)
|
||||
}
|
||||
|
||||
return chapter{article.Title, article.Byline, article.Content}
|
||||
return b
|
||||
}
|
||||
|
||||
func tableOfContent(url, selector string, limit, delay int) []chapter {
|
||||
func NewChapterFromURL(url, linkName string, configs []*ScrapeConfig, index int, updateProgressBarName func(index int, name string)) chapter {
|
||||
config := configs[0]
|
||||
|
||||
base, err := urllib.Parse(url)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
links := GetLinks(base, selector)
|
||||
if limit != -1 {
|
||||
limit = int(math.Min(float64(limit), float64(len(links))))
|
||||
links = links[:limit]
|
||||
// get page body
|
||||
response, err := http.Get(url)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
defer response.Body.Close()
|
||||
|
||||
// duplicate response stream
|
||||
readabilityReader := &bytes.Buffer{}
|
||||
bodyReader := io.TeeReader(response.Body, readabilityReader)
|
||||
|
||||
// extract HTML body
|
||||
body, err := io.ReadAll(bodyReader)
|
||||
|
||||
// extract article content and metadata
|
||||
article, err := readability.FromReader(readabilityReader, base)
|
||||
if err != nil {
|
||||
log.Fatalf("failed to parse %s, %v\n", url, err)
|
||||
}
|
||||
|
||||
chapters := make([]chapter, len(links))
|
||||
name := linkName
|
||||
if config.UseLinkName == false {
|
||||
name = article.Title
|
||||
|
||||
// init global progress bar
|
||||
uiprogress.Start()
|
||||
barGlobal := uiprogress.AddBar(len(links)).AppendCompleted().PrependElapsed()
|
||||
barGlobal.AppendFunc(func(b *uiprogress.Bar) string {
|
||||
return fmt.Sprintf("Status: %d out of %d chapters", b.Current(), len(links))
|
||||
})
|
||||
|
||||
// init progress bars
|
||||
bars := []*uiprogress.Bar{}
|
||||
for index, link := range links {
|
||||
bar := uiprogress.AddBar(1).AppendCompleted().PrependElapsed()
|
||||
barText := fmt.Sprintf("%d. %s", index+1, link.text)
|
||||
bar.AppendFunc(func(b *uiprogress.Bar) string {
|
||||
return barText
|
||||
})
|
||||
bars = append(bars, bar)
|
||||
// notify progressbar with new name
|
||||
updateProgressBarName(index, name)
|
||||
}
|
||||
|
||||
if delay >= 0 {
|
||||
for index, link := range links {
|
||||
// and then use it to parse relative URLs
|
||||
u, err := base.Parse(link.href)
|
||||
subchapters := []chapter{}
|
||||
if len(configs) > 1 {
|
||||
// add subchapters
|
||||
|
||||
links, _, _, err := GetLinks(base, config.Selector, config.Limit, config.Offset, false)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
subchapters = make([]chapter, len(links))
|
||||
var p progress
|
||||
if config.Quiet == false {
|
||||
p = NewProgress(links, name, config.Depth)
|
||||
}
|
||||
|
||||
if config.Delay >= 0 {
|
||||
|
||||
// synchronous mode
|
||||
for index, link := range links {
|
||||
// and then use it to parse relative URLs
|
||||
u, err := base.Parse(link.href)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
sc := NewChapterFromURL(u.String(), link.text, configs[1:], index, p.UpdateName)
|
||||
subchapters[index] = sc
|
||||
if config.Quiet == false {
|
||||
p.Increment(index)
|
||||
}
|
||||
|
||||
time.Sleep(time.Duration(config.Delay) * time.Millisecond)
|
||||
}
|
||||
|
||||
} else {
|
||||
// asynchronous mode
|
||||
var wg sync.WaitGroup
|
||||
|
||||
threads := config.Threads
|
||||
if threads == -1 {
|
||||
threads = len(links)
|
||||
}
|
||||
semaphore := make(chan bool, threads)
|
||||
|
||||
for index, l := range links {
|
||||
|
||||
wg.Add(1)
|
||||
semaphore <- true
|
||||
|
||||
go func(index int, l link) {
|
||||
defer wg.Done()
|
||||
|
||||
// and then use it to parse relative URLs
|
||||
u, err := base.Parse(l.href)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
sc := NewChapterFromURL(u.String(), l.text, configs[1:], index, p.UpdateName)
|
||||
subchapters[index] = sc
|
||||
|
||||
if config.Quiet == false {
|
||||
p.Increment(index)
|
||||
}
|
||||
|
||||
<-semaphore
|
||||
}(index, l)
|
||||
}
|
||||
wg.Wait()
|
||||
}
|
||||
}
|
||||
|
||||
content := ""
|
||||
if config.Include {
|
||||
|
||||
// we care about the content only if:
|
||||
// - we include this level
|
||||
// - we use the page name
|
||||
content = article.Content
|
||||
|
||||
// extract images
|
||||
if config.ImagesOnly {
|
||||
|
||||
// parse HTML
|
||||
doc, err := goquery.NewDocumentFromReader(strings.NewReader(content))
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
chapters[index] = NewChapterFromURL(u.String())
|
||||
// append every image to content
|
||||
content = ""
|
||||
doc.Find("img").Each(func(i int, s *goquery.Selection) {
|
||||
imageTag, _ := goquery.OuterHtml(s)
|
||||
imageTag = strings.ReplaceAll(imageTag, "\n", "")
|
||||
|
||||
bars[index].Incr()
|
||||
barGlobal.Incr()
|
||||
content += imageTag
|
||||
})
|
||||
|
||||
// do not wait after downloading last chapter
|
||||
if index < len(links)-1 {
|
||||
time.Sleep(time.Duration(delay) * time.Millisecond)
|
||||
}
|
||||
}
|
||||
|
||||
return chapter{string(body), name, article.Byline, content, subchapters, config}
|
||||
}
|
||||
|
||||
func tableOfContent(url string, config *ScrapeConfig, subConfig *ScrapeConfig, quiet bool) ([]chapter, chapter) {
|
||||
base, err := urllib.Parse(url)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
links, _, home, err := GetLinks(base, config.Selector, config.Limit, config.Offset, config.Include)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
chapters := make([]chapter, len(links))
|
||||
delay := config.Delay
|
||||
|
||||
var p progress
|
||||
if quiet == false {
|
||||
p = NewProgress(links, "", 0)
|
||||
}
|
||||
|
||||
if delay >= 0 {
|
||||
// synchronous mode
|
||||
|
||||
for index, l := range links {
|
||||
// and then use it to parse relative URLs
|
||||
u, err := base.Parse(l.href)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
chapters[index] = NewChapterFromURL(u.String(), l.text, []*ScrapeConfig{subConfig}, 0, func(index int, name string) {})
|
||||
|
||||
if quiet == false {
|
||||
p.Increment(index)
|
||||
}
|
||||
|
||||
// short sleep for last chapter to let the progress bar update
|
||||
if index == len(links)-1 {
|
||||
delay = 100
|
||||
}
|
||||
|
||||
time.Sleep(time.Duration(delay) * time.Millisecond)
|
||||
}
|
||||
|
||||
} else {
|
||||
// asynchronous mode
|
||||
var wg sync.WaitGroup
|
||||
|
||||
threads := config.Threads
|
||||
if threads == -1 {
|
||||
threads = len(links)
|
||||
}
|
||||
semaphore := make(chan bool, threads)
|
||||
|
||||
for index, l := range links {
|
||||
|
||||
wg.Add(1)
|
||||
semaphore <- true
|
||||
|
||||
go func(index int, l link) {
|
||||
defer wg.Done()
|
||||
|
||||
@@ -112,15 +338,19 @@ func tableOfContent(url, selector string, limit, delay int) []chapter {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
chapters[index] = NewChapterFromURL(u.String())
|
||||
chapters[index] = NewChapterFromURL(u.String(), l.text, []*ScrapeConfig{subConfig}, 0, func(index int, name string) {})
|
||||
|
||||
bars[index].Incr()
|
||||
barGlobal.Incr()
|
||||
if quiet == false {
|
||||
p.Increment(index)
|
||||
}
|
||||
|
||||
<-semaphore
|
||||
}(index, l)
|
||||
}
|
||||
wg.Wait()
|
||||
}
|
||||
return chapters
|
||||
|
||||
return chapters, home
|
||||
}
|
||||
|
||||
func GetPath(elm *goquery.Selection) string {
|
||||
@@ -128,7 +358,7 @@ func GetPath(elm *goquery.Selection) string {
|
||||
|
||||
for {
|
||||
selector := strings.ToLower(goquery.NodeName(elm))
|
||||
if selector == "" {
|
||||
if len(selector) == 0 {
|
||||
break
|
||||
}
|
||||
|
||||
@@ -140,60 +370,74 @@ func GetPath(elm *goquery.Selection) string {
|
||||
return join
|
||||
}
|
||||
|
||||
|
||||
func GetLinks(url *urllib.URL, selector string) []link {
|
||||
func GetLinks(url *urllib.URL, selector string, limit, offset int, include bool) ([]link, string, chapter, error) {
|
||||
selectorSet := true
|
||||
if selector == "" {
|
||||
if len(selector) == 0 {
|
||||
selector = "a"
|
||||
selectorSet = false
|
||||
}
|
||||
|
||||
// visit and count link classes
|
||||
pathLinks := map[string][]link{}
|
||||
pathCount := map[string]int{}
|
||||
pathMax := ""
|
||||
|
||||
// visit and count link classes
|
||||
c := colly.NewCollector()
|
||||
c.OnHTML(selector, func(e *colly.HTMLElement) {
|
||||
href := e.Attr("href")
|
||||
text := strings.TrimSpace(e.Text)
|
||||
path := GetPath(e.DOM)
|
||||
class := e.Attr("class")
|
||||
key := fmt.Sprintf("%s.%s", path, class)
|
||||
key := path
|
||||
|
||||
if selectorSet || text != "" {
|
||||
pathLinks[key] = append(pathLinks[key], NewLink(href, text, class))
|
||||
pathCount[key] += len(text)
|
||||
// pathCount[key]++
|
||||
if selectorSet {
|
||||
|
||||
if pathCount[key] > pathCount[pathMax] {
|
||||
pathMax = key
|
||||
// if selector is set, we use the selector specified by the user
|
||||
|
||||
key = selector
|
||||
pathLinks[key] = append(pathLinks[key], NewLink(href, text))
|
||||
pathCount[key] += 1
|
||||
pathMax = key
|
||||
|
||||
} else {
|
||||
|
||||
// if selector is not set, we compute the selector ourselves
|
||||
|
||||
class := e.Attr("class")
|
||||
// include the element class to make sure we have the same exact path for every link in the table of content
|
||||
key = fmt.Sprintf("%s.%s", path, class)
|
||||
|
||||
// we count this key if the link text is not empty
|
||||
if text != "" {
|
||||
pathLinks[key] = append(pathLinks[key], NewLink(href, text))
|
||||
pathCount[key] += len(text)
|
||||
|
||||
if pathCount[key] > pathCount[pathMax] {
|
||||
pathMax = key
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
})
|
||||
c.Visit(url.String())
|
||||
return pathLinks[pathMax]
|
||||
|
||||
// // visit and count link classes
|
||||
// classesLinks := map[string][]link{}
|
||||
// classesCount := map[string]int{}
|
||||
// classMax := ""
|
||||
links := pathLinks[pathMax]
|
||||
if len(links) == 0 {
|
||||
return []link{}, pathMax, chapter{}, fmt.Errorf("no link found for selector: %s", selector)
|
||||
}
|
||||
|
||||
// c := colly.NewCollector()
|
||||
// c.OnHTML(selector, func(e *colly.HTMLElement) {
|
||||
// href := e.Attr("href")
|
||||
// text := strings.TrimSpace(e.Text)
|
||||
// class := e.Attr("class")
|
||||
end := len(links)
|
||||
if limit != -1 {
|
||||
end = int(math.Min(float64(limit+offset), float64(len(links))))
|
||||
}
|
||||
|
||||
// if selectorSet || class != "" && text != "" {
|
||||
// classesLinks[class] = append(classesLinks[class], NewLink(href, text))
|
||||
// classesCount[class]++
|
||||
links = links[offset:end]
|
||||
|
||||
// if classesCount[class] > classesCount[classMax] {
|
||||
// classMax = class
|
||||
// }
|
||||
// }
|
||||
// })
|
||||
// c.Visit(url.String())
|
||||
// return classesLinks[classMax]
|
||||
home := NewChapterFromURL(url.String(), "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
|
||||
|
||||
if include {
|
||||
l := NewLink(url.String(), home.Name())
|
||||
links = append([]link{l}, links...)
|
||||
}
|
||||
|
||||
return links, pathMax, home, nil
|
||||
}
|
||||
|
||||
199
book/scraper_test.go
Normal file
199
book/scraper_test.go
Normal file
@@ -0,0 +1,199 @@
|
||||
package book
|
||||
|
||||
import (
|
||||
"testing"
|
||||
"time"
|
||||
)
|
||||
|
||||
func TestBody(t *testing.T) {
|
||||
|
||||
config := NewScrapeConfig()
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
|
||||
|
||||
got := c.Body()
|
||||
want := "<!doctype html>\n<html lang=\"en-us\">\n <head>\n <title>Books</title>\n <link rel=\"shortcut icon\" href=\"/favicon.ico\" />\n <meta charset=\"utf-8\" />\n <meta name=\"generator\" content=\"Hugo 0.59.1\" />\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />\n <meta name=\"author\" content=\"John Doe\" />\n <meta name=\"description\" content=\" \" />\n <link rel=\"stylesheet\" href=\"https://books.lapw.at/css/main.min.88e7083eff65effb7485b6e6f38d10afbec25093a6fac42d734ce9024d3defbd.css\" />\n\n \n <meta name=\"twitter:card\" content=\"summary\"/>\n<meta name=\"twitter:title\" content=\"Books\"/>\n<meta name=\"twitter:description\" content=\" \"/>\n\n <meta property=\"og:title\" content=\"Books\" />\n<meta property=\"og:description\" content=\" \" />\n<meta property=\"og:type\" content=\"website\" />\n<meta property=\"og:url\" content=\"https://books.lapw.at/\" />\n\n\n\n </head>\n <body>\n <header class=\"app-header\">\n <a href=\"https://books.lapw.at/\"><img class=\"app-header-avatar\" src=\"/book.svg\" alt=\"John Doe\" /></a>\n <h1>Books</h1>\n <p> </p>\n <div class=\"app-header-social\">\n \n </div>\n </header>\n <main class=\"app-container\">\n \n <article>\n <h1>Books</h1>\n <ul class=\"posts-list\">\n \n <li class=\"posts-list-item\">\n <a class=\"posts-list-item-title\" href=\"https://books.lapw.at/posts/ren%C3%A9-descartes-discours-de-la-m%C3%A9thode/\">Discours de la Méthode</a>\n <span class=\"posts-list-item-description\">\n <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"icon icon-clock\">\n <title>clock</title>\n <circle cx=\"12\" cy=\"12\" r=\"10\"></circle><polyline points=\"12 6 12 12 16 14\"></polyline>\n</svg> 98 min read -\n 1637\n </span>\n </li>\n \n <li class=\"posts-list-item\">\n <a class=\"posts-list-item-title\" href=\"https://books.lapw.at/posts/adam-wiggins-the-twelve-factor-app/\">The Twelve-Factor App</a>\n <span class=\"posts-list-item-description\">\n <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"icon icon-clock\">\n <title>clock</title>\n <circle cx=\"12\" cy=\"12\" r=\"10\"></circle><polyline points=\"12 6 12 12 16 14\"></polyline>\n</svg> 22 min read -\n 2011\n </span>\n </li>\n \n </ul>\n \n\n\n\n </article>\n\n </main>\n </body>\n</html>\n"
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestName(t *testing.T) {
|
||||
|
||||
config := NewScrapeConfig()
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
|
||||
|
||||
got := c.Name()
|
||||
want := "Books"
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestCustomName(t *testing.T) {
|
||||
|
||||
config := NewScrapeConfig()
|
||||
config.UseLinkName = true
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "Custom Name", []*ScrapeConfig{config}, 0, func(index int, name string) {})
|
||||
|
||||
got := c.Name()
|
||||
want := "Custom Name"
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestAuthor(t *testing.T) {
|
||||
|
||||
config := NewScrapeConfig()
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
|
||||
|
||||
got := c.Author()
|
||||
want := "John Doe"
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestContent(t *testing.T) {
|
||||
|
||||
config := NewScrapeConfig()
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
|
||||
|
||||
got := c.Content()
|
||||
want := "<div id=\"readability-page-1\" class=\"page\">\n \n <main>\n \n <article>\n \n <ul>\n \n <li>\n <a href=\"https://books.lapw.at/posts/ren%C3%A9-descartes-discours-de-la-m%C3%A9thode/\">Discours de la Méthode</a>\n <span>\n <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\">\n <title>clock</title>\n <circle cx=\"12\" cy=\"12\" r=\"10\"></circle><polyline points=\"12 6 12 12 16 14\"></polyline>\n</svg> 98 min read -\n 1637\n </span>\n </li>\n \n <li>\n <a href=\"https://books.lapw.at/posts/adam-wiggins-the-twelve-factor-app/\">The Twelve-Factor App</a>\n <span>\n <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\">\n <title>clock</title>\n <circle cx=\"12\" cy=\"12\" r=\"10\"></circle><polyline points=\"12 6 12 12 16 14\"></polyline>\n</svg> 22 min read -\n 2011\n </span>\n </li>\n \n </ul>\n \n\n\n\n </article>\n\n </main>\n \n\n</div>"
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestDelay(t *testing.T) {
|
||||
|
||||
config0 := NewScrapeConfig()
|
||||
config0.Delay = 500
|
||||
|
||||
config1 := NewScrapeConfig()
|
||||
|
||||
start := time.Now()
|
||||
NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
|
||||
elapsed := time.Since(start)
|
||||
|
||||
got := elapsed
|
||||
want := time.Duration(500) * time.Millisecond
|
||||
|
||||
if got < want {
|
||||
t.Errorf("got %v, wanted min %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestContentImagesOnly(t *testing.T) {
|
||||
|
||||
config := NewScrapeConfig()
|
||||
config.ImagesOnly = true
|
||||
|
||||
c := NewChapterFromURL("https://books.lapw.at/posts/adam-wiggins-the-twelve-factor-app/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
|
||||
|
||||
got := c.Content()
|
||||
want := "<img src=\"https://books.lapw.at/images/codebase-deploys.png\" alt=\"One codebase maps to many deploys\"/><img src=\"https://books.lapw.at/images/attached-resources.png\" alt=\"A production deploy attached to four backing services.\"/><img src=\"https://books.lapw.at/images/release.png\" alt=\"Code becomes a build, which is combined with config to create a release.\"/><img src=\"https://books.lapw.at/images/process-types.png\" alt=\"Scale is expressed as running processes, workload diversity is expressed as process types.\"/>"
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestSubChapters(t *testing.T) {
|
||||
|
||||
config0 := NewScrapeConfig()
|
||||
config1 := NewScrapeConfig()
|
||||
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
|
||||
|
||||
got := len(c.SubChapters())
|
||||
want := 2
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestSubChaptersSelector(t *testing.T) {
|
||||
|
||||
config0 := NewScrapeConfig()
|
||||
config0.Selector = "section.concrete > article > h2 > a"
|
||||
|
||||
config1 := NewScrapeConfig()
|
||||
|
||||
c := NewChapterFromURL("https://12factor.net/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
|
||||
|
||||
got := len(c.SubChapters())
|
||||
want := 12
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestSubChaptersLimit(t *testing.T) {
|
||||
|
||||
config0 := NewScrapeConfig()
|
||||
config0.Limit = 1
|
||||
|
||||
config1 := NewScrapeConfig()
|
||||
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
|
||||
|
||||
got := len(c.SubChapters())
|
||||
want := 1
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestSubChaptersLimitOver(t *testing.T) {
|
||||
|
||||
config0 := NewScrapeConfig()
|
||||
config0.Limit = 3
|
||||
|
||||
config1 := NewScrapeConfig()
|
||||
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
|
||||
|
||||
got := len(c.SubChapters())
|
||||
want := 2
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
func TestNotInclude(t *testing.T) {
|
||||
|
||||
config := NewScrapeConfig()
|
||||
config.Include = false
|
||||
|
||||
c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
|
||||
|
||||
got := c.Content()
|
||||
want := ""
|
||||
|
||||
if got != want {
|
||||
t.Errorf("got %v, wanted %v", got, want)
|
||||
}
|
||||
|
||||
}
|
||||
235
cmd/get.go
235
cmd/get.go
@@ -3,157 +3,174 @@ package cmd
|
||||
import (
|
||||
"errors"
|
||||
"fmt"
|
||||
"log"
|
||||
"os"
|
||||
"os/exec"
|
||||
"strings"
|
||||
|
||||
md "github.com/JohannesKaufmann/html-to-markdown"
|
||||
epub "github.com/bmaupin/go-epub"
|
||||
cobra "github.com/spf13/cobra"
|
||||
"github.com/spf13/cobra"
|
||||
|
||||
"github.com/lapwat/papeer/book"
|
||||
)
|
||||
|
||||
var quiet, stdout, recursive, include bool
|
||||
var format, output, selector string
|
||||
var limit, delay int
|
||||
type GetOptions struct {
|
||||
// url string
|
||||
|
||||
name string
|
||||
author string
|
||||
Format string
|
||||
output string
|
||||
images bool
|
||||
// ImagesOnly bool
|
||||
quiet bool
|
||||
|
||||
Selector []string
|
||||
depth int
|
||||
limit int
|
||||
offset int
|
||||
delay int
|
||||
threads int
|
||||
// includeUrl bool
|
||||
include bool
|
||||
useLinkName bool
|
||||
}
|
||||
|
||||
var getOpts *GetOptions
|
||||
|
||||
func init() {
|
||||
getOpts = &GetOptions{}
|
||||
|
||||
getCmd.PersistentFlags().StringVarP(&getOpts.name, "name", "n", "", "book name (default: page title)")
|
||||
getCmd.PersistentFlags().StringVarP(&getOpts.author, "author", "a", "", "book author")
|
||||
getCmd.PersistentFlags().StringVarP(&getOpts.Format, "format", "f", "md", "file format [stdout, md, epub, mobi]")
|
||||
getCmd.PersistentFlags().StringVarP(&getOpts.output, "output", "", "", "file name (default: book name)")
|
||||
getCmd.PersistentFlags().BoolVarP(&getOpts.images, "images", "", false, "retrieve images only")
|
||||
getCmd.PersistentFlags().BoolVarP(&getOpts.quiet, "quiet", "q", false, "hide progress bar")
|
||||
|
||||
// common with list command
|
||||
getCmd.Flags().StringSliceVarP(&getOpts.Selector, "selector", "s", []string{}, "table of contents CSS selector")
|
||||
getCmd.Flags().IntVarP(&getOpts.depth, "depth", "d", 0, "scraping depth")
|
||||
getCmd.Flags().IntVarP(&getOpts.limit, "limit", "l", -1, "limit number of chapters, use with depth/selector")
|
||||
getCmd.Flags().IntVarP(&getOpts.offset, "offset", "o", 0, "skip first chapters, use with depth/selector")
|
||||
getCmd.Flags().IntVarP(&getOpts.delay, "delay", "", -1, "time in milliseconds to wait before downloading next chapter, use with depth/selector")
|
||||
getCmd.Flags().IntVarP(&getOpts.threads, "threads", "t", -1, "download concurrency, use with depth/selector")
|
||||
getCmd.Flags().BoolVarP(&getOpts.include, "include", "i", false, "include URL as first chapter, use with depth/selector")
|
||||
getCmd.Flags().BoolVarP(&getOpts.useLinkName, "use-link-name", "", false, "use link name for chapter title")
|
||||
|
||||
rootCmd.AddCommand(getCmd)
|
||||
}
|
||||
|
||||
var getCmd = &cobra.Command{
|
||||
Use: "get",
|
||||
Short: "Scrape URL content",
|
||||
Use: "get URL",
|
||||
Short: "Scrape URL content",
|
||||
Example: "papeer get https://www.eff.org/cyberspace-independence",
|
||||
Args: func(cmd *cobra.Command, args []string) error {
|
||||
if len(args) < 1 {
|
||||
return errors.New("requires an URL argument")
|
||||
}
|
||||
|
||||
formatEnum := map[string]bool{
|
||||
"md": true,
|
||||
"epub": true,
|
||||
"mobi": true,
|
||||
}
|
||||
if formatEnum[format] != true {
|
||||
return fmt.Errorf("invalid format specified: %s", format)
|
||||
"stdout": true,
|
||||
"md": true,
|
||||
"epub": true,
|
||||
"mobi": true,
|
||||
}
|
||||
|
||||
if format == "epub" || format == "mobi" {
|
||||
if stdout {
|
||||
return errors.New("cannot print EPUB/MOBI file to standard output")
|
||||
if formatEnum[getOpts.Format] != true {
|
||||
return fmt.Errorf("invalid format specified: %s", getOpts.Format)
|
||||
}
|
||||
|
||||
// add .mobi to filename if not specified
|
||||
if getOpts.Format == "mobi" {
|
||||
if len(getOpts.output) > 0 && strings.HasSuffix(getOpts.output, ".mobi") == false {
|
||||
getOpts.output = fmt.Sprintf("%s.mobi", getOpts.output)
|
||||
}
|
||||
}
|
||||
|
||||
if format == "mobi" {
|
||||
if len(output) > 0 && strings.HasSuffix(output, ".mobi") == false {
|
||||
output = fmt.Sprintf("%s.mobi", output)
|
||||
}
|
||||
if cmd.Flags().Changed("include") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
|
||||
return errors.New("cannot use include option if depth/selector is not specified")
|
||||
}
|
||||
|
||||
if cmd.Flags().Changed("selector") && recursive == false {
|
||||
return errors.New("cannot use selector option if not in recursive mode")
|
||||
if cmd.Flags().Changed("limit") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
|
||||
return errors.New("cannot use limit option if depth/selector is not specified")
|
||||
}
|
||||
|
||||
if cmd.Flags().Changed("include") && recursive == false {
|
||||
return errors.New("cannot use include option if not in recursive mode")
|
||||
if cmd.Flags().Changed("offset") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
|
||||
return errors.New("cannot use offset option if depth/selector is not specified")
|
||||
}
|
||||
|
||||
if cmd.Flags().Changed("limit") && recursive == false {
|
||||
return errors.New("cannot use limit option if not in recursive mode")
|
||||
if cmd.Flags().Changed("delay") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
|
||||
return errors.New("cannot use delay option if depth/selector is not specified")
|
||||
}
|
||||
|
||||
if cmd.Flags().Changed("delay") && recursive == false {
|
||||
return errors.New("cannot use delay option if not in recursive mode")
|
||||
if cmd.Flags().Changed("threads") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
|
||||
return errors.New("cannot use threads option if depth/selector is not specified")
|
||||
}
|
||||
|
||||
if cmd.Flags().Changed("use-link-name") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
|
||||
return errors.New("cannot use use-link-name option if depth/selector is not specified")
|
||||
}
|
||||
|
||||
if cmd.Flags().Changed("delay") && cmd.Flags().Changed("threads") {
|
||||
return errors.New("cannot use delay and threads options at the same time")
|
||||
}
|
||||
|
||||
return nil
|
||||
},
|
||||
Run: func(cmd *cobra.Command, args []string) {
|
||||
url := args[0]
|
||||
b := book.NewBookFromURL(url, selector, recursive, include, limit, delay)
|
||||
|
||||
if len(output) == 0 {
|
||||
// set default output
|
||||
output = strings.ReplaceAll(b.Name(), " ", "_")
|
||||
output = strings.ReplaceAll(output, "/", "")
|
||||
output = fmt.Sprintf("%s.%s", output, format)
|
||||
// fill selector array with empty selectors to match depth
|
||||
getOpts.Selector = append(getOpts.Selector, "")
|
||||
for len(getOpts.Selector) < getOpts.depth+1 {
|
||||
getOpts.Selector = append(getOpts.Selector, "")
|
||||
}
|
||||
fmt.Println(len(getOpts.Selector))
|
||||
|
||||
// generate config for each level
|
||||
configs := make([]*book.ScrapeConfig, len(getOpts.Selector))
|
||||
for index, s := range getOpts.Selector {
|
||||
config := book.NewScrapeConfig()
|
||||
config.Selector = s
|
||||
config.Quiet = getOpts.quiet
|
||||
config.Limit = getOpts.limit
|
||||
config.Offset = getOpts.offset
|
||||
config.Delay = getOpts.delay
|
||||
config.Threads = getOpts.threads
|
||||
config.ImagesOnly = getOpts.images
|
||||
config.Include = getOpts.include
|
||||
config.UseLinkName = getOpts.useLinkName
|
||||
|
||||
// do not use link name for root level as there is not parent link
|
||||
if index == 0 {
|
||||
config.UseLinkName = false
|
||||
}
|
||||
|
||||
// always include last level by default
|
||||
if index == len(getOpts.Selector)-1 {
|
||||
config.Include = true
|
||||
}
|
||||
|
||||
configs[index] = config
|
||||
}
|
||||
|
||||
if format == "md" {
|
||||
f, err := os.Create(output)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
c := book.NewChapterFromURL(url, "", configs, 0, func(index int, name string) {})
|
||||
|
||||
defer f.Close()
|
||||
|
||||
for _, c := range b.Chapters() {
|
||||
content, err := md.NewConverter("", true, nil).ConvertString(c.Content())
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
text := fmt.Sprintf("%s\n%s\n\n%s\n\n\n", c.Name(), strings.Repeat("=", len(c.Name())), content)
|
||||
|
||||
if stdout {
|
||||
fmt.Println(text)
|
||||
} else {
|
||||
|
||||
_, err := f.WriteString(text)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
||||
if stdout == false {
|
||||
fmt.Printf("Markdown saved to \"%s\"\n", output)
|
||||
}
|
||||
if getOpts.Format == "stdout" {
|
||||
markdown := book.ToMarkdownString(c)
|
||||
fmt.Println(markdown)
|
||||
}
|
||||
|
||||
if format == "epub" {
|
||||
e := epub.NewEpub(b.Name())
|
||||
e.SetAuthor(b.Author())
|
||||
|
||||
for _, c := range b.Chapters() {
|
||||
html := fmt.Sprintf("<h1>%s</h1>%s", c.Name(), c.Content())
|
||||
e.AddSection(html, c.Name(), "", "")
|
||||
}
|
||||
|
||||
err := e.Write(output)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
fmt.Printf("Ebook saved to \"%s\"\n", output)
|
||||
if getOpts.Format == "md" {
|
||||
filename := book.ToMarkdown(c, getOpts.output)
|
||||
fmt.Printf("Markdown saved to \"%s\"\n", filename)
|
||||
}
|
||||
|
||||
if format == "mobi" {
|
||||
e := epub.NewEpub(b.Name())
|
||||
e.SetAuthor(b.Author())
|
||||
if getOpts.Format == "epub" {
|
||||
filename := book.ToEpub(c, getOpts.output)
|
||||
fmt.Printf("Ebook saved to \"%s\"\n", filename)
|
||||
}
|
||||
|
||||
for _, chapter := range b.Chapters() {
|
||||
e.AddSection(chapter.Content(), chapter.Name(), "", "")
|
||||
}
|
||||
|
||||
outputEPUB := strings.ReplaceAll(output, ".mobi", ".epub")
|
||||
|
||||
err := e.Write(outputEPUB)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
exec.Command("kindlegen", outputEPUB).Run()
|
||||
// exec command always return status 1 even if it fails
|
||||
// if err != nil {
|
||||
// log.Fatal(err)
|
||||
// }
|
||||
|
||||
fmt.Printf("Ebook saved to \"%s\"\n", output)
|
||||
|
||||
err2 := os.Remove(outputEPUB)
|
||||
if err2 != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
if getOpts.Format == "mobi" {
|
||||
filename := book.ToMobi(c, getOpts.output)
|
||||
fmt.Printf("Ebook saved to \"%s\"\n", filename)
|
||||
}
|
||||
},
|
||||
}
|
||||
|
||||
64
cmd/list.go
64
cmd/list.go
@@ -2,9 +2,11 @@ package cmd
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"fmt"
|
||||
"log"
|
||||
urllib "net/url"
|
||||
"os"
|
||||
"strings"
|
||||
|
||||
"github.com/jedib0t/go-pretty/v6/table"
|
||||
cobra "github.com/spf13/cobra"
|
||||
@@ -12,9 +14,42 @@ import (
|
||||
"github.com/lapwat/papeer/book"
|
||||
)
|
||||
|
||||
type ListOptions struct {
|
||||
// url string
|
||||
|
||||
Selector []string
|
||||
depth int
|
||||
limit int
|
||||
offset int
|
||||
delay int
|
||||
threads int
|
||||
// includeUrl bool
|
||||
include bool
|
||||
useLinkName bool
|
||||
}
|
||||
|
||||
var listOpts *ListOptions
|
||||
|
||||
func init() {
|
||||
listOpts = &ListOptions{}
|
||||
|
||||
listCmd.Flags().StringSliceVarP(&listOpts.Selector, "selector", "s", []string{}, "table of contents CSS selector")
|
||||
listCmd.Flags().IntVarP(&listOpts.depth, "depth", "d", 0, "scraping depth")
|
||||
listCmd.Flags().IntVarP(&listOpts.limit, "limit", "l", -1, "limit number of chapters, use with depth/selector")
|
||||
listCmd.Flags().IntVarP(&listOpts.offset, "offset", "o", 0, "skip first chapters, use with depth/selector")
|
||||
listCmd.Flags().IntVarP(&listOpts.delay, "delay", "", -1, "time in milliseconds to wait before downloading next chapter, use with depth/selector")
|
||||
listCmd.Flags().IntVarP(&listOpts.threads, "threads", "t", -1, "download concurrency, use with depth/selector")
|
||||
listCmd.Flags().BoolVarP(&listOpts.include, "include", "i", false, "include URL as first chapter, use with depth/selector")
|
||||
listCmd.Flags().BoolVarP(&listOpts.useLinkName, "use-link-name", "", false, "use link name for chapter title")
|
||||
|
||||
rootCmd.AddCommand(listCmd)
|
||||
}
|
||||
|
||||
var listCmd = &cobra.Command{
|
||||
Use: "ls",
|
||||
Short: "Print table of content",
|
||||
Use: "list URL",
|
||||
Aliases: []string{"ls"},
|
||||
Short: "Print URL table of contents",
|
||||
Example: "papeer list https://12factor.net/ -s 'section.concrete>article>h2>a'",
|
||||
Args: func(cmd *cobra.Command, args []string) error {
|
||||
if len(args) < 1 {
|
||||
return errors.New("requires an URL argument")
|
||||
@@ -22,16 +57,35 @@ var listCmd = &cobra.Command{
|
||||
return nil
|
||||
},
|
||||
Run: func(cmd *cobra.Command, args []string) {
|
||||
if len(listOpts.Selector) == 0 {
|
||||
listOpts.Selector = []string{""}
|
||||
}
|
||||
|
||||
base, err := urllib.Parse(args[0])
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
links := book.GetLinks(base, selector)
|
||||
links, path, _, err := book.GetLinks(base, listOpts.Selector[0], listOpts.limit, listOpts.offset, listOpts.include)
|
||||
if err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
t := table.NewWriter()
|
||||
t.SetOutputMirror(os.Stdout)
|
||||
t.AppendHeader(table.Row{"#", "Name", "Url", "Class"})
|
||||
t.Style().Options.DrawBorder = false
|
||||
t.Style().Options.SeparateColumns = false
|
||||
t.Style().Options.SeparateHeader = false
|
||||
|
||||
// format selector path
|
||||
pathArray := strings.Split(path, "<")
|
||||
// reverse path
|
||||
for i, j := 0, len(pathArray)-1; i < j; i, j = i+1, j-1 {
|
||||
pathArray[i], pathArray[j] = pathArray[j], pathArray[i]
|
||||
}
|
||||
pathFormatted := strings.Join(pathArray, ">")
|
||||
|
||||
t.AppendHeader(table.Row{"#", "Name", fmt.Sprintf("Url [%s]", pathFormatted)})
|
||||
|
||||
for index, link := range links {
|
||||
u, err := base.Parse(link.Href())
|
||||
@@ -39,7 +93,7 @@ var listCmd = &cobra.Command{
|
||||
log.Fatal(err)
|
||||
}
|
||||
|
||||
t.AppendRow([]interface{}{index + 1, link.Text(), u.String(), link.Class()})
|
||||
t.AppendRow([]interface{}{index + 1, link.Text(), u.String()})
|
||||
}
|
||||
|
||||
t.Render()
|
||||
|
||||
15
cmd/root.go
15
cmd/root.go
@@ -21,18 +21,3 @@ func Execute() {
|
||||
os.Exit(1)
|
||||
}
|
||||
}
|
||||
|
||||
func init() {
|
||||
rootCmd.PersistentFlags().StringVarP(&format, "format", "f", "md", "file format [md, epub, mobi]")
|
||||
rootCmd.PersistentFlags().StringVarP(&output, "output", "o", "", "output file")
|
||||
rootCmd.PersistentFlags().StringVarP(&selector, "selector", "s", "", "table of content CSS selector")
|
||||
rootCmd.PersistentFlags().BoolVarP(&recursive, "recursive", "r", false, "create one chapter per natigation item")
|
||||
rootCmd.PersistentFlags().BoolVarP(&include, "include", "i", false, "include URL as first chapter, in resursive mode")
|
||||
rootCmd.PersistentFlags().BoolVarP(&quiet, "quiet", "q", false, "do not show logs")
|
||||
rootCmd.PersistentFlags().BoolVarP(&stdout, "stdout", "", false, "print to standard output")
|
||||
rootCmd.PersistentFlags().IntVarP(&limit, "limit", "l", -1, "limit number of chapters, in recursive mode")
|
||||
rootCmd.PersistentFlags().IntVarP(&delay, "delay", "d", -1, "time to wait before downloading next chapter, in milliseconds")
|
||||
|
||||
rootCmd.AddCommand(getCmd)
|
||||
rootCmd.AddCommand(listCmd)
|
||||
}
|
||||
|
||||
@@ -1,5 +0,0 @@
|
||||
package cmd
|
||||
|
||||
func getTableOfContent() {
|
||||
|
||||
}
|
||||
@@ -14,6 +14,6 @@ var versionCmd = &cobra.Command{
|
||||
Use: "version",
|
||||
Short: "Print the version number of papeer",
|
||||
Run: func(cmd *cobra.Command, args []string) {
|
||||
fmt.Println("papeer v0.1.1")
|
||||
fmt.Println("papeer v0.4.1")
|
||||
},
|
||||
}
|
||||
|
||||
9
go.mod
9
go.mod
@@ -31,13 +31,18 @@ require (
|
||||
github.com/jedib0t/go-pretty/v6 v6.2.4 // indirect
|
||||
github.com/kennygrant/sanitize v1.2.4 // indirect
|
||||
github.com/mattn/go-isatty v0.0.14 // indirect
|
||||
github.com/mattn/go-runewidth v0.0.9 // indirect
|
||||
github.com/mattn/go-runewidth v0.0.13 // indirect
|
||||
github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db // indirect
|
||||
github.com/rivo/uniseg v0.2.0 // indirect
|
||||
github.com/saintfish/chardet v0.0.0-20120816061221-3af4cd4741ca // indirect
|
||||
github.com/schollz/progressbar/v3 v3.8.3 // indirect
|
||||
github.com/sirupsen/logrus v1.8.1 // indirect
|
||||
github.com/spf13/pflag v1.0.5 // indirect
|
||||
github.com/temoto/robotstxt v1.1.2 // indirect
|
||||
golang.org/x/crypto v0.0.0-20210817164053-32db794688a5 // indirect
|
||||
golang.org/x/net v0.0.0-20210614182718-04defd469f4e // indirect
|
||||
golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c // indirect
|
||||
golang.org/x/sys v0.0.0-20210910150752-751e447fb3d0 // indirect
|
||||
golang.org/x/term v0.0.0-20210615171337-6886f2dfbf5b // indirect
|
||||
golang.org/x/text v0.3.6 // indirect
|
||||
google.golang.org/appengine v1.6.7 // indirect
|
||||
google.golang.org/protobuf v1.27.1 // indirect
|
||||
|
||||
16
go.sum
16
go.sum
@@ -240,6 +240,7 @@ github.com/jstemmer/go-junit-report v0.0.0-20190106144839-af01ea7f8024/go.mod h1
|
||||
github.com/jstemmer/go-junit-report v0.9.1/go.mod h1:Brl9GWCQeLvo8nXZwPNNblvFj/XSXhF0NWZEnDohbsk=
|
||||
github.com/jtolds/gls v4.20.0+incompatible/go.mod h1:QJZ7F/aHp+rZTRtaJ1ow/lLfFfVYBRgL+9YlvaHOwJU=
|
||||
github.com/julienschmidt/httprouter v1.2.0/go.mod h1:SYymIcj16QtmaHHD7aYtjjsJG7VTCxuUUipMqKk8s4w=
|
||||
github.com/k0kubun/go-ansi v0.0.0-20180517002512-3bf9e2903213/go.mod h1:vNUNkEQ1e29fT/6vq2aBdFsgNPmy8qMdSay1npru+Sw=
|
||||
github.com/kennygrant/sanitize v1.2.4 h1:gN25/otpP5vAsO2djbMhF/LQX6R7+O1TB4yv8NzpJ3o=
|
||||
github.com/kennygrant/sanitize v1.2.4/go.mod h1:LGsjYYtgxbetdg5owWB2mpgUL6e2nfw2eObZ0u0qvak=
|
||||
github.com/kisielk/errcheck v1.1.0/go.mod h1:EZBBE59ingxPouuu3KfxchcWSUPOHkagtvWXihfKN4Q=
|
||||
@@ -259,9 +260,13 @@ github.com/mattn/go-isatty v0.0.14 h1:yVuAays6BHfxijgZPzw+3Zlu5yQgKGP2/hcQbHb7S9
|
||||
github.com/mattn/go-isatty v0.0.14/go.mod h1:7GGIvUiUoEMVVmxf/4nioHXj79iQHKdU27kJ6hsGG94=
|
||||
github.com/mattn/go-runewidth v0.0.9 h1:Lm995f3rfxdpd6TSmuVCHVb/QhupuXlYr8sCI/QdE+0=
|
||||
github.com/mattn/go-runewidth v0.0.9/go.mod h1:H031xJmbD/WCDINGzjvQ9THkh0rPKHF+m2gUSrubnMI=
|
||||
github.com/mattn/go-runewidth v0.0.13 h1:lTGmDsbAYt5DmK6OnoV7EuIF1wEIFAcxld6ypU4OSgU=
|
||||
github.com/mattn/go-runewidth v0.0.13/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
|
||||
github.com/matttproud/golang_protobuf_extensions v1.0.1/go.mod h1:D8He9yQNgCq6Z5Ld7szi9bcBfOoFv/3dc6xSMkL2PC0=
|
||||
github.com/miekg/dns v1.0.14/go.mod h1:W1PPwlIAgtquWBMBEV9nkV9Cazfe8ScdGz/Lj7v3Nrg=
|
||||
github.com/mitchellh/cli v1.0.0/go.mod h1:hNIlj7HEI86fIcpObd7a0FcrxTWetlwJDGcceTlRvqc=
|
||||
github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db h1:62I3jR2EmQ4l5rM/4FEfDWcRD+abF5XlKShorW5LRoQ=
|
||||
github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db/go.mod h1:l0dey0ia/Uv7NcFFVbCLtqEBQbrT4OCwCSKTEv6enCw=
|
||||
github.com/mitchellh/go-homedir v1.0.0/go.mod h1:SfyaCUpYCn1Vlf4IUYiD9fPX4A5wJrkLzIz1N1q0pr0=
|
||||
github.com/mitchellh/go-homedir v1.1.0/go.mod h1:SfyaCUpYCn1Vlf4IUYiD9fPX4A5wJrkLzIz1N1q0pr0=
|
||||
github.com/mitchellh/go-testing-interface v1.0.0/go.mod h1:kRemZodwjscx+RGhAo8eIhFbs2+BFgRtFPeD/KE+zxI=
|
||||
@@ -295,6 +300,8 @@ github.com/prometheus/common v0.4.0/go.mod h1:TNfzLD0ON7rHzMJeJkieUDPYmFC7Snx/y8
|
||||
github.com/prometheus/procfs v0.0.0-20181005140218-185b4288413d/go.mod h1:c3At6R/oaqEKCNdg8wHV1ftS6bRYblBhIjjI8uT2IGk=
|
||||
github.com/prometheus/procfs v0.0.0-20190507164030-5867b95ac084/go.mod h1:TjEm7ze935MbeOT/UhFTIMYKhuLP4wbCsTZCD3I8kEA=
|
||||
github.com/prometheus/tsdb v0.7.1/go.mod h1:qhTCs0VvXwvX/y3TZrWD7rabWM+ijKTux40TwIPHuXU=
|
||||
github.com/rivo/uniseg v0.2.0 h1:S1pD9weZBuJdFmowNwbpi7BJ8TNftyUImj/0WQi72jY=
|
||||
github.com/rivo/uniseg v0.2.0/go.mod h1:J6wj4VEh+S6ZtnVlnTBMWIodfgj8LQOQFoIToxlJtxc=
|
||||
github.com/rogpeppe/fastuuid v0.0.0-20150106093220-6724a57986af/go.mod h1:XWv6SoW27p1b0cqNHllgS5HIMJraePCO15w5zCzIWYg=
|
||||
github.com/rogpeppe/fastuuid v1.2.0/go.mod h1:jVj6XXZzXRy/MSR5jhDC/2q6DgLz+nrA6LYCDYWNEvQ=
|
||||
github.com/rogpeppe/go-internal v1.3.0/go.mod h1:M8bDsm7K2OlrFYOpmOWEs/qY81heoFRclV5y23lUDJ4=
|
||||
@@ -302,6 +309,8 @@ github.com/russross/blackfriday/v2 v2.0.1/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQD
|
||||
github.com/ryanuber/columnize v0.0.0-20160712163229-9b3edd62028f/go.mod h1:sm1tb6uqfes/u+d4ooFouqFdy9/2g9QGwK3SQygK0Ts=
|
||||
github.com/saintfish/chardet v0.0.0-20120816061221-3af4cd4741ca h1:NugYot0LIVPxTvN8n+Kvkn6TrbMyxQiuvKdEwFdR9vI=
|
||||
github.com/saintfish/chardet v0.0.0-20120816061221-3af4cd4741ca/go.mod h1:uugorj2VCxiV1x+LzaIdVa9b4S4qGAcH6cbhh4qVxOU=
|
||||
github.com/schollz/progressbar/v3 v3.8.3 h1:FnLGl3ewlDUP+YdSwveXBaXs053Mem/du+wr7XSYKl8=
|
||||
github.com/schollz/progressbar/v3 v3.8.3/go.mod h1:pWnVCjSBZsT2X3nx9HfRdnCDrpbevliMeoEVhStwHko=
|
||||
github.com/sean-/seed v0.0.0-20170313163322-e2103e2c3529/go.mod h1:DxrIzT+xaE7yg65j358z/aeFdxmN0P9QXhEzd20vsDc=
|
||||
github.com/sebdah/goldie/v2 v2.5.1 h1:hh70HvG4n3T3MNRJN2z/baxPR8xutxo7JVxyi2svl+s=
|
||||
github.com/sebdah/goldie/v2 v2.5.1/go.mod h1:oZ9fp0+se1eapSRjfYbsV/0Hqhbuu3bJVvKI/NNtssI=
|
||||
@@ -380,6 +389,8 @@ golang.org/x/crypto v0.0.0-20190605123033-f99c8df09eb5/go.mod h1:yigFU9vqHzYiE8U
|
||||
golang.org/x/crypto v0.0.0-20190820162420-60c769a6c586/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
|
||||
golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI=
|
||||
golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto=
|
||||
golang.org/x/crypto v0.0.0-20210817164053-32db794688a5 h1:HWj/xjIHfjYU5nVXpTM0s39J9CbLn7Cc5a7IC5rwsMQ=
|
||||
golang.org/x/crypto v0.0.0-20210817164053-32db794688a5/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=
|
||||
golang.org/x/exp v0.0.0-20190121172915-509febef88a4/go.mod h1:CJ0aWSM057203Lf6IL+f9T1iT9GByDxfZKAQTCR3kQA=
|
||||
golang.org/x/exp v0.0.0-20190306152737-a1d7652674e8/go.mod h1:CJ0aWSM057203Lf6IL+f9T1iT9GByDxfZKAQTCR3kQA=
|
||||
golang.org/x/exp v0.0.0-20190510132918-efd6b22b2522/go.mod h1:ZjyILWgesfNpC6sMxTJOJm9Kp84zZh5NQWvqDGG3Qr8=
|
||||
@@ -533,9 +544,14 @@ golang.org/x/sys v0.0.0-20210403161142-5e06dd20ab57/go.mod h1:h1NjWce9XRLGQEsW7w
|
||||
golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
|
||||
golang.org/x/sys v0.0.0-20210510120138-977fb7262007/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.0.0-20210514084401-e8d321eab015/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c h1:F1jZWGFhYfh0Ci55sIpILtKKK8p3i2/krTr0H1rg74I=
|
||||
golang.org/x/sys v0.0.0-20210630005230-0f9fa26af87c/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/sys v0.0.0-20210910150752-751e447fb3d0 h1:xrCZDmdtoloIiooiA9q0OQb9r8HejIHYoHGhGCe1pGg=
|
||||
golang.org/x/sys v0.0.0-20210910150752-751e447fb3d0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=
|
||||
golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=
|
||||
golang.org/x/term v0.0.0-20210615171337-6886f2dfbf5b h1:9zKuko04nR4gjZ4+DNjHqRlAJqbJETHwiNKDqTfOjfE=
|
||||
golang.org/x/term v0.0.0-20210615171337-6886f2dfbf5b/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8=
|
||||
golang.org/x/text v0.0.0-20170915032832-14c0d48ead0c/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
|
||||
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
|
||||
golang.org/x/text v0.3.1-0.20180807135948-17ff2d5776d2/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
|
||||
|
||||
26
release.sh
26
release.sh
@@ -3,19 +3,31 @@
|
||||
version=$1
|
||||
platforms=("linux/amd64" "darwin/amd64" "windows/amd64")
|
||||
|
||||
if [ "$#" -ne 1 ]; then
|
||||
echo "Illegal number of parameters"
|
||||
echo "Usage: ./release.sh X.X.X"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
for platform in "${platforms[@]}"
|
||||
do
|
||||
platform_split=(${platform//\// })
|
||||
GOOS=${platform_split[0]}
|
||||
GOARCH=${platform_split[1]}
|
||||
output_name='papeer-'$version'-'$GOOS'-'$GOARCH
|
||||
output_name=papeer
|
||||
|
||||
if [ $GOOS = "windows" ]; then
|
||||
output_name+='.exe'
|
||||
env GOOS=$GOOS GOARCH=$GOARCH go build -o "$output_name.exe"
|
||||
zip "$output_name-v$version-$GOOS-$GOARCH.exe.zip" "$output_name.exe"
|
||||
rm "$output_name.exe"
|
||||
else
|
||||
env GOOS=$GOOS GOARCH=$GOARCH go build -o "$output_name"
|
||||
tar czvf "$output_name-v$version-$GOOS-$GOARCH.tar.gz" "$output_name"
|
||||
rm "$output_name"
|
||||
fi
|
||||
|
||||
env GOOS=$GOOS GOARCH=$GOARCH go build -o $output_name
|
||||
if [ $? -ne 0 ]; then
|
||||
echo 'An error has occurred! Aborting the script execution...'
|
||||
exit 1
|
||||
fi
|
||||
# if [ $? -ne 0 ]; then
|
||||
# echo 'An error has occurred! Aborting the script execution...'
|
||||
# exit 1
|
||||
# fi
|
||||
done
|
||||
|
||||
Reference in New Issue
Block a user