add reverse option

fix: selector depth
update installation instructions
2026-05-25 20:00:47 +00:00 · 2022-02-21 00:32:39 +01:00 · 2022-02-06 23:35:35 +01:00 · 2022-02-05 11:58:02 +01:00 · 2022-02-04 19:42:27 +01:00 · 2022-01-06 00:33:11 +01:00
14 changed files with 1218 additions and 277 deletions
--- a/14
+++ b/14
@@ -0,0 +1,14 @@
+format:
+	gofmt -s -w .
+
+test:
+	go test github.com/lapwat/papeer/book
+
+install:
+	go install
+
+clean:
+	find . -maxdepth 1 -not -name 'README.md' -name '*.md' -delete
+	find . -maxdepth 1 -name '*.epub' -delete
+	find . -maxdepth 1 -name '*.mobi' -delete
+	find . -maxdepth 1 -name 'papeer-v*' -delete
--- a/README.md
+++ b/README.md
@@ -1,17 +1,126 @@
+# Papeer
+
+Papeer is a powerful **ereader internet vacuum**. It can scrape any website, removing ads and keeping only the relevant content (formatted text and images). You can export the content to Markdown, EPUB or MOBI files.
+
+# Table of contents
+
+- [Usage](#usage)
+  * [Scrape a web page](#scrape-a-web-page)
+  * [Scrape a whole website](#scrape-a-whole-website)
+    + [`depth` option](#-depth--option)
+    + [`selector` option](#-selector--option)
+    + [Display the table of contents](#display-the-table-of-contents)
+    + [Scrape time](#scrape-time)
+- [Installation](#installation)
+  * [From source](#from-source)
+  * [From binary](#from-binary)
+    + [Linux / MacOS](#linux---macos)
+    + [Windows](#windows)
+  * [MOBI support](#mobi-support)
+- [Autocompletion](#autocompletion)
+- [Dependencies](#dependencies)
+
+# Usage
+
+## Scrape a web page
+
+The `get` command lets you retrieve the content of any web page.
+
 ```
-❯ papeer get --format=epub --recursive --delay=500 --limit=10 https://news.ycombinator.com/
-[===============================================>--------------------] Chapters 7 / 10
-[====================================================================] 1. Three ex-US intelligence officers admit hacking for UAE
-[====================================================================] 2. Show HN: Time Travel Debugger
-[====================================================================] 3. How much faster is Java 17?
-[====================================================================] 4. The First Webcam Was Invented to Keep an Eye on a Coffee Pot
-[====================================================================] 5. Nikon's 2021 Photomicrography Competition Winners
-[====================================================================] 6. HTTP Status 418 – I'm a teapot
-[====================================================================] 7. H3: Hexagonal hierarchical geospatial indexing system
-[--------------------------------------------------------------------] 8. Automatic cipher suite ordering in Go’s crypto/tls
-[--------------------------------------------------------------------] 9. Find engineering roles at over 800 YC-funded startups
-[--------------------------------------------------------------------] 10. Futarchy: Robin Hanson on prediction markets
-Ebook saved to "Hacker_News.epub"
+Scrape URL content
+
+Usage:
+  papeer get URL [flags]
+
+Examples:
+papeer get https://www.eff.org/cyberspace-independence
+
+Flags:
+  -a, --author string      book author
+      --delay int          time in milliseconds to wait before downloading next chapter, use with depth/selector (default -1)
+  -d, --depth int          scraping depth
+  -f, --format string      file format [stdout, md, epub, mobi] (default "md")
+  -h, --help               help for get
+      --images             retrieve images only
+  -i, --include            include URL as first chapter, use with depth/selector
+  -l, --limit int          limit number of chapters, use with depth/selector (default -1)
+  -n, --name string        book name (default: page title)
+  -o, --offset int         skip first chapters, use with depth/selector
+      --output string      file name (default: book name)
+  -q, --quiet              hide progress bar
+  -r, --reverse            reverse chapter order
+  -s, --selector strings   table of contents CSS selector
+  -t, --threads int        download concurrency, use with depth/selector (default -1)
+      --use-link-name      use link name for chapter title
+```
+
+## Scrape a whole website
+
+If a navigation menu is present on a website, you can scrape the content of each page.
+
+You can activate this mode by using the `depth` or `selector` options.
+
+### `depth` option
+
+This option defaults to 0, `papeer` will grab only the main page.
+
+If you specify a value greater than 0, `papeer` will grab pages as deep as the value you specify.
+
+> Using `include` option will include all intermediary levels into the book.
+
+### `selector` option
+
+If this option is not specified, `papeer` will grab only the one page.
+
+If this option is specified, `papeer` will select the links (a HTML tag) present on the main page, then grab each one of them.
+
+You can chain this option to grab several level of pages with diferent selectors for each level.
+
+### Display the table of contents
+
+Before actually scraping a whole website, it is a good idea to use the `list` command. This command is like a **dry run**, which lets you vizualize the content before actually retrieving it. You can use several options to customize the table of contents extraction, such as `selector`, `limit`, `offset`, `reverse` and `include`. Type `papeer list --help` for more information about those options.
+
+```sh
+papeer list https://12factor.net/ -s 'section.concrete>article>h2>a'
+```
+```
+ #  NAME                    URL                                    
+ 1  I. Codebase             https://12factor.net/codebase          
+ 2  II. Dependencies        https://12factor.net/dependencies      
+ 3  III. Config             https://12factor.net/config            
+ 4  IV. Backing services    https://12factor.net/backing-services  
+ 5  V. Build, release, run  https://12factor.net/build-release-run 
+ 6  VI. Processes           https://12factor.net/processes         
+ 7  VII. Port binding       https://12factor.net/port-binding      
+ 8  VIII. Concurrency       https://12factor.net/concurrency       
+ 9  IX. Disposability       https://12factor.net/disposability     
+10  X. Dev/prod parity      https://12factor.net/dev-prod-parity   
+11  XI. Logs                https://12factor.net/logs              
+12  XII. Admin processes    https://12factor.net/admin-processes
+```
+
+### Scrape time
+
+Once you are satisfied with the table of contents listed by the `ls` command, you can actually scrape the content of those pages. You can use the same options that you specified for the `ls` command. You can specify `delay` and `threads` options when using `selector` or `depth` options.
+
+```sh
+papeer get https://12factor.net/ --selector='section.concrete>article>h2>a'
+```
+```
+[======================================>-----------------------------] Chapters 7 / 12
+[====================================================================] 1. I. Codebase
+[====================================================================] 2. II. Dependencies
+[====================================================================] 3. III. Config
+[====================================================================] 4. IV. Backing services
+[====================================================================] 5. V. Build, release, run
+[====================================================================] 6. VI. Processes
+[====================================================================] 7. VII. Port binding
+[--------------------------------------------------------------------] 8. VIII. Concurrency
+[--------------------------------------------------------------------] 9. IX. Disposability
+[--------------------------------------------------------------------] 10. X. Dev/prod parity
+[--------------------------------------------------------------------] 11. XI. Logs
+[--------------------------------------------------------------------] 12. XII. Admin processes
+Markdown saved to "The_Twelve-Factor_App.md"
 ```

 # Installation
@@ -24,21 +133,29 @@ go get -u github.com/lapwat/papeer

 ## From binary

-### On Linux / MacOS
+### Linux / MacOS

 ```sh
+# use platform=darwin for MacOS
 platform=linux
-# platform=darwin for MacOS
-curl -L https://github.com/lapwat/papeer/releases/download/v0.2.0/papeer-v0.2.0-$platform-amd64 > papeer
-chmod +x papeer
+release=0.4.2
+
+# download and extract
+curl -L https://github.com/lapwat/papeer/releases/download/v$release/papeer-v$release-$platform-amd64.tar.gz > papeer.tar.gz
+tar xzvf papeer.tar.gz
+rm papeer.tar.gz
+
+# move to user binaries
 sudo mv papeer /usr/local/bin
 ```

-### On Windows
+### Windows

-Download [latest release](https://github.com/lapwat/papeer/releases/download/v0.2.0/papeer-v0.2.0-windows-amd64.exe).
+Download [latest release](https://github.com/lapwat/papeer/releases/download/v0.4.2/papeer-v0.4.2-windows-amd64.exe.zip).

-## Install kindlegen to export websites to MOBI (optional)
+## MOBI support
+
+Install kindlegen to convert websites, Linux only

 ```sh
 TMPDIR=$(mktemp -d -t papeer-XXXXX)
@@ -49,38 +166,6 @@ sudo mv $TMPDIR/kindlegen /usr/local/bin
 rm -rf $TMPDIR
 ```

-# Usage
-
-```
-Browse the web in the eink era
-
-Usage:
-  papeer [flags]
-  papeer [command]
-
-Available Commands:
-  completion  generate the autocompletion script for the specified shell
-  get         Scrape URL content
-  help        Help about any command
-  ls          Print table of content
-  version     Print the version number of papeer
-
-Flags:
-  -d, --delay int         time to wait before downloading next chapter, in milliseconds (default -1)
-  -f, --format string     file format [md, epub, mobi] (default "md")
-  -h, --help              help for papeer
-      --images            retrieve images only
-  -i, --include           include URL as first chapter, in resursive mode
-  -l, --limit int         limit number of chapters, in recursive mode (default -1)
-  -o, --offset int        skip first chapters, in recursive mode
-      --output string     output file
-  -r, --recursive         create one chapter per natigation item
-  -s, --selector string   table of content CSS selector
-      --stdout            print to standard output
-
-Use "papeer [command] --help" for more information about a command.
-```
-
 # Autocompletion

 Execute this command in your current shell, or add it to your `.bashrc`.
@@ -89,10 +174,10 @@ Execute this command in your current shell, or add it to your `.bashrc`.
 . <(papeer completion bash)
 ```

-Type `papeer completion bash -h` for more information.
-
 You can replace `bash` by your own shell (zsh, fish or powershell).

+Type `papeer completion bash -h` for more information.
+
 # Dependencies

 - `cobra` command line interface
--- a/book/chapter.go
+++ b/book/chapter.go
@@ -1,13 +1,20 @@
 package book

 type chapter struct {
-	name    string
-	author  string
-	content string
+	body        string
+	name        string
+	author      string
+	content     string
+	subChapters []chapter
+	config      *ScrapeConfig
 }

-func NewChapter(name, author, content string) chapter {
-	return chapter{name, author, content}
+func NewChapter(body, name, author, content string, subChapters []chapter, config *ScrapeConfig) chapter {
+	return chapter{body, name, author, content, subChapters, config}
+}
+
+func (c chapter) Body() string {
+	return c.body
 }

 func (c chapter) Name() string {
@@ -21,3 +28,7 @@ func (c chapter) Author() string {
 func (c chapter) Content() string {
 	return c.content
 }
+
+func (c chapter) SubChapters() []chapter {
+	return c.subChapters
+}
--- a/book/format.go
+++ b/book/format.go
@@ -0,0 +1,164 @@
+package book
+
+import (
+	"fmt"
+	"log"
+	"os"
+	"os/exec"
+	"strings"
+
+	md "github.com/JohannesKaufmann/html-to-markdown"
+	"github.com/PuerkitoBio/goquery"
+	epub "github.com/bmaupin/go-epub"
+)
+
+func Filename(name string) string {
+	filename := name
+
+	filename = strings.ReplaceAll(filename, " ", "_")
+	filename = strings.ReplaceAll(filename, "/", "")
+
+	return filename
+}
+
+func ToMarkdownString(c chapter) string {
+	markdown := ""
+
+	if c.config.Include {
+		// title
+		markdown += fmt.Sprintf("%s\n", c.Name())
+		markdown += fmt.Sprintf("%s\n\n", strings.Repeat("=", len(c.Name())))
+
+		// convert content to markdown
+		content, err := md.NewConverter("", true, nil).ConvertString(c.Content())
+		if err != nil {
+			log.Fatal(err)
+		}
+		markdown += fmt.Sprintf("%s\n\n\n", content)
+	}
+
+	for _, sc := range c.SubChapters() {
+		// subchapters content
+		markdown += fmt.Sprintf("%s\n\n\n", ToMarkdownString(sc))
+	}
+
+	return markdown
+}
+
+func ToMarkdown(c chapter, filename string) string {
+	if len(filename) == 0 {
+		filename = fmt.Sprintf("%s.md", Filename(c.Name()))
+	}
+
+	markdown := ToMarkdownString(c)
+
+	// write to file
+	f, err := os.Create(filename)
+	if err != nil {
+		log.Fatal(err)
+	}
+	_, err2 := f.WriteString(markdown)
+	if err2 != nil {
+		log.Fatal(err2)
+	}
+	f.Close()
+
+	return filename
+}
+
+func ToEpub(c chapter, filename string) string {
+	if len(filename) == 0 {
+		filename = fmt.Sprintf("%s.epub", Filename(c.Name()))
+	}
+
+	// init ebook
+	e := epub.NewEpub(c.Name())
+	e.SetAuthor(c.Author())
+
+	AppendToEpub(e, c)
+
+	err := e.Write(filename)
+	if err != nil {
+		log.Fatal(err)
+	}
+
+	return filename
+}
+
+func AppendToEpub(e *epub.Epub, c chapter) {
+	content := ""
+
+	if c.config.Include {
+
+		if c.config.ImagesOnly == false {
+			content = c.Content()
+		}
+
+		// parse content
+		doc, err := goquery.NewDocumentFromReader(strings.NewReader(c.Content()))
+		if err != nil {
+			log.Fatal(err)
+		}
+
+		// download images and replace src in img tags of content
+		doc.Find("img").Each(func(i int, s *goquery.Selection) {
+			src, _ := s.Attr("src")
+			src = strings.Split(src, "?")[0] // remove query part
+			imagePath, _ := e.AddImage(src, "")
+
+			if c.config.ImagesOnly {
+				imageTag, _ := goquery.OuterHtml(s)
+				content += strings.Replace(imageTag, src, imagePath, 1)
+			} else {
+				content = strings.Replace(content, src, imagePath, 1)
+			}
+		})
+
+		html := ""
+		// add title only if ImagesOnly = false
+		if c.config.ImagesOnly == false {
+			html += fmt.Sprintf("<h1>%s</h1>", c.Name())
+		}
+		html += content
+
+		//  write to epub file
+		_, err = e.AddSection(html, c.Name(), "", "")
+		if err != nil {
+			log.Fatal(err)
+		}
+
+	}
+
+	for _, sc := range c.SubChapters() {
+		AppendToEpub(e, sc)
+	}
+}
+
+func ToMobi(c chapter, filename string) string {
+	if len(filename) == 0 {
+		filename = fmt.Sprintf("%s.mobi", Filename(c.Name()))
+	} else {
+
+		// add .mobi extension if not specified
+		if strings.HasSuffix(filename, ".mobi") == false {
+			filename = fmt.Sprintf("%s.mobi", filename)
+		}
+
+	}
+
+	filenameEPUB := strings.ReplaceAll(filename, ".mobi", ".epub")
+	ToEpub(c, filenameEPUB)
+
+	exec.Command("kindlegen", filenameEPUB).Run()
+	// exec command always return status 1 even if it succeed
+	// if err != nil {
+	// 	log.Fatal(err)
+	// }
+
+	err := os.Remove(filenameEPUB)
+	if err != nil {
+		log.Fatal(err)
+	}
+
+	return filename
+}
--- a/book/format_test.go
+++ b/book/format_test.go
@@ -0,0 +1,127 @@
+package book
+
+import (
+	"errors"
+	"os"
+	"testing"
+)
+
+func TestFilename(t *testing.T) {
+
+	got := Filename("This is a chapter / book")
+	want := "This_is_a_chapter__book"
+
+	if got != want {
+		t.Errorf("got %q, wanted %q", got, want)
+	}
+
+}
+
+func TestToMarkdownString(t *testing.T) {
+
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
+
+	got := ToMarkdownString(c)
+	want := "Books\n=====\n\n- [Discours de la Méthode](https://books.lapw.at/posts/ren%C3%A9-descartes-discours-de-la-m%C3%A9thode/)clock 98 min read -\n1637\n\n- [The Twelve-Factor App](https://books.lapw.at/posts/adam-wiggins-the-twelve-factor-app/)clock 22 min read -\n2011\n\n\n"
+
+	if got != want {
+		t.Errorf("got %q, wanted %q", got, want)
+	}
+
+}
+
+func TestToMarkdown(t *testing.T) {
+
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
+	ToMarkdown(c, "")
+
+	filename := "Books.md"
+	if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
+		t.Errorf("%s does not exist: %v", filename, err)
+	} else {
+		if err := os.Remove(filename); err != nil {
+			t.Errorf("cannot remove %v: %v", filename, err)
+		}
+	}
+
+}
+
+func TestToMarkdownFilename(t *testing.T) {
+
+	filename := "ebook.md"
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
+	ToMarkdown(c, filename)
+
+	if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
+		t.Errorf("%s does not exist: %v", filename, err)
+	} else {
+		if err := os.Remove(filename); err != nil {
+			t.Errorf("cannot remove %v: %v", filename, err)
+		}
+	}
+
+}
+
+func TestToEpub(t *testing.T) {
+
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
+	ToEpub(c, "")
+
+	filename := "Books.epub"
+	if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
+		t.Errorf("%s does not exist: %v", filename, err)
+	} else {
+		if err := os.Remove(filename); err != nil {
+			t.Errorf("cannot remove %v: %v", filename, err)
+		}
+	}
+
+}
+
+func TestToEpubFilename(t *testing.T) {
+
+	filename := "ebook.epub"
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
+	ToEpub(c, filename)
+
+	if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
+		t.Errorf("%s does not exist: %v", filename, err)
+	} else {
+		if err := os.Remove(filename); err != nil {
+			t.Errorf("cannot remove %v: %v", filename, err)
+		}
+	}
+
+}
+
+func TestToMobi(t *testing.T) {
+
+	filename := "ebook.mobi"
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
+	ToMobi(c, filename)
+
+	if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
+		t.Errorf("%s does not exist: %v", filename, err)
+	} else {
+		if err := os.Remove(filename); err != nil {
+			t.Errorf("cannot remove %v: %v", filename, err)
+		}
+	}
+
+}
+
+func TestToMobiFilename(t *testing.T) {
+
+	filename := "ebook.mobi"
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
+	ToMobi(c, filename)
+
+	if _, err := os.Stat(filename); errors.Is(err, os.ErrNotExist) {
+		t.Errorf("%s does not exist: %v", filename, err)
+	} else {
+		if err := os.Remove(filename); err != nil {
+			t.Errorf("cannot remove %v: %v", filename, err)
+		}
+	}
+
+}
--- a/book/progress.go
+++ b/book/progress.go
@@ -2,6 +2,7 @@ package book

 import (
 	"fmt"
+	"strings"

 	"github.com/gosuri/uiprogress"
 )
@@ -11,20 +12,22 @@ type progress struct {
 	individuals []*uiprogress.Bar
 }

-func NewProgress(links []link) progress {
+func NewProgress(links []link, parent string, depth int) progress {
 	uiprogress.Start()

 	global := uiprogress.AddBar(len(links))
+	indentGlobal := strings.Repeat("> ", depth)
 	global.AppendFunc(func(b *uiprogress.Bar) string {
-		return fmt.Sprintf("Chapters %d / %d", b.Current(), len(links))
+		return fmt.Sprintf("%v%v (%v / %v)", indentGlobal, parent, b.Current(), len(links))
 	})

-	individuals := []*uiprogress.Bar{}
 	// hide individual bars if more than 50 chapters
+	individuals := []*uiprogress.Bar{}
+	indent := strings.Repeat("- ", depth)
 	if len(links) <= 50 {
 		for index, link := range links {
 			bar := uiprogress.AddBar(1)
-			barText := fmt.Sprintf("%d. %s", index+1, link.text)
+			barText := fmt.Sprintf("%v#%v %v", indent, index+1, link.Text())
 			bar.AppendFunc(func(b *uiprogress.Bar) string {
 				return barText
 			})
@@ -35,13 +38,22 @@ func NewProgress(links []link) progress {
 	return progress{global, individuals}
 }

-func (p *progress) IncrGlobal() {
+func (p *progress) IncrementGlobal() {
 	p.global.Incr()
 }

-func (p *progress) Incr(index int) {
-	p.global.Incr()
+func (p *progress) Increment(index int) {
+	p.IncrementGlobal()
 	if len(p.individuals) > index {
 		p.individuals[index].Incr()
 	}
 }
+
+func (p *progress) UpdateName(index int, name string) {
+	if len(p.individuals) > index {
+		barText := fmt.Sprintf("%s", name)
+		p.individuals[index].AppendFunc(func(b *uiprogress.Bar) string {
+			return barText
+		})
+	}
+}
--- a/book/scraper.go
+++ b/book/scraper.go
@@ -1,9 +1,12 @@
 package book

 import (
+	"bytes"
 	"fmt"
+	"io"
 	"log"
 	"math"
+	"net/http"
 	urllib "net/url"
 	"strings"
 	"sync"
@@ -14,73 +17,298 @@ import (
 	colly "github.com/gocolly/colly/v2"
 )

-func NewBookFromURL(url, selector string, recursive, include, images bool, limit, offset, delay int) book {
-	if recursive {
-		chapters := tableOfContent(url, selector, limit, offset, delay, include, images)
+type ScrapeConfig struct {
+	Depth       int
+	Selector    string
+	Quiet       bool
+	Limit       int
+	Offset      int
+	Reverse     bool
+	Delay       int
+	Threads     int
+	Include     bool
+	ImagesOnly  bool
+	UseLinkName bool
+}

-		b := New(chapters[0].Name(), chapters[0].Author())
-		for _, c := range chapters {
-			b.AddChapter(c)
-		}
+func NewScrapeConfig() *ScrapeConfig {
+	return &ScrapeConfig{0, "", false, -1, 0, false, -1, -1, true, false, false}
+}

-		return b
+func NewScrapeConfigs(selectors []string) []*ScrapeConfig {
+	configs := []*ScrapeConfig{}
+
+	for _, s := range selectors {
+		config := NewScrapeConfig()
+		config.Selector = s
+
+		configs = append(configs, config)
+	}
+
+	return configs
+}
+
+func NewScrapeConfigsAjin() []*ScrapeConfig {
+	config0 := NewScrapeConfig()
+	config0.Depth = 0
+	config0.Selector = ".dt>a"
+	config0.Limit = 3
+	config0.Offset = 0
+	config0.Delay = 5000
+	config0.Include = false
+
+	config1 := NewScrapeConfig()
+	config1.Depth = 1
+	config1.Selector = ".nav_apb>a"
+	config1.Limit = 3
+	config1.Offset = 1
+	config1.Delay = 5000
+	config1.Include = false
+
+	config2 := NewScrapeConfig()
+	config2.Depth = 2
+	config2.ImagesOnly = true
+
+	return []*ScrapeConfig{config0, config1, config2}
+}
+
+func NewScrapeConfigsWikipedia() []*ScrapeConfig {
+	config0 := NewScrapeConfig()
+	config0.Depth = 0
+	config0.Threads = -1
+	config0.Include = true
+
+	config1 := NewScrapeConfig()
+	config1.Depth = 1
+	config1.Include = true
+
+	return []*ScrapeConfig{config0, config1}
+}
+
+func NewScrapeConfigFake() *ScrapeConfig {
+	config := NewScrapeConfig()
+	config.Include = false
+
+	return config
+}
+
+func NewBookFromURL(url string, selector []string, name, author string, include, ImagesOnly, useLinkName, quiet bool, limit, offset, delay, threads int) book {
+	config1 := NewScrapeConfig()
+	config1.ImagesOnly = ImagesOnly
+	config1.UseLinkName = useLinkName
+
+	var chapters []chapter
+	var home chapter
+
+	if len(selector) > 0 {
+		config2 := NewScrapeConfig()
+		config2.Selector = selector[0]
+		config2.Limit = limit
+		config2.Offset = offset
+		config2.Delay = delay
+		config2.Threads = threads
+		config2.Include = include
+		config2.ImagesOnly = ImagesOnly
+		config2.UseLinkName = useLinkName
+		chapters, home = tableOfContent(url, config2, config1, quiet)
 	} else {
-		c := NewChapterFromURL(url, images)
-		b := New(c.Name(), c.Author())
+		chapters = []chapter{NewChapterFromURL(url, "", []*ScrapeConfig{config1}, 0, func(index int, name string) {})}
+		home = chapters[0]
+	}
+
+	if len(name) == 0 {
+		name = home.Name()
+	}
+
+	if len(author) == 0 {
+		author = home.Author()
+	}
+
+	b := New(name, author)
+	for _, c := range chapters {
 		b.AddChapter(c)
-		return b
 	}
+
+	return b
 }

-func NewChapterFromURL(url string, images bool) chapter {
-	article, err := readability.FromURL(url, 30*time.Second)
-	if err != nil {
-		log.Fatalf("failed to parse %s, %v\n", url, err)
-	}
+func NewChapterFromURL(url, linkName string, configs []*ScrapeConfig, index int, updateProgressBarName func(index int, name string)) chapter {
+	config := configs[0]

-	content := strings.ReplaceAll(article.Content, "\n", "")
-
-	if images {
-		// Load the HTML document
-		doc, err := goquery.NewDocumentFromReader(strings.NewReader(content))
-		if err != nil {
-			log.Fatal(err)
-		}
-
-		// Find the review items
-		doc.Find("img").Each(func(i int, s *goquery.Selection) {
-			content, _ = goquery.OuterHtml(s)
-		})
-	}
-
-	return chapter{article.Title, article.Byline, content}
-}
-
-func tableOfContent(url, selector string, limit, offset, delay int, include, images bool) []chapter {
 	base, err := urllib.Parse(url)
 	if err != nil {
 		log.Fatal(err)
 	}

-	links, err := GetLinks(base, selector, limit, offset, include)
+	// get page body
+	response, err := http.Get(url)
+	if err != nil {
+		log.Fatal(err)
+	}
+	defer response.Body.Close()
+
+	// duplicate response stream
+	readabilityReader := &bytes.Buffer{}
+	bodyReader := io.TeeReader(response.Body, readabilityReader)
+
+	// extract HTML body
+	body, err := io.ReadAll(bodyReader)
+
+	// extract article content and metadata
+	article, err := readability.FromReader(readabilityReader, base)
+	if err != nil {
+		log.Fatalf("failed to parse %s, %v\n", url, err)
+	}
+
+	name := linkName
+	if config.UseLinkName == false {
+		name = article.Title
+
+		// notify progressbar with new name
+		updateProgressBarName(index, name)
+	}
+
+	var subchapters []chapter
+	if len(configs) > 1 {
+
+		// retrieve links on page
+		links, _, _, err := GetLinks(base, config.Selector, config.Limit, config.Offset, config.Reverse, false)
+		if err != nil {
+			log.Fatal(err)
+		}
+
+		// init progess bar
+		var p progress
+		if config.Quiet == false {
+			p = NewProgress(links, name, config.Depth)
+		}
+
+		// init chapters list
+		subchapters = make([]chapter, len(links))
+
+		if config.Delay >= 0 {
+
+			// synchronous mode
+			for index, link := range links {
+				// and then use it to parse relative URLs
+				u, err := base.Parse(link.href)
+				if err != nil {
+					log.Fatal(err)
+				}
+
+				sc := NewChapterFromURL(u.String(), link.text, configs[1:], index, p.UpdateName)
+				subchapters[index] = sc
+				if config.Quiet == false {
+					p.Increment(index)
+				}
+
+				time.Sleep(time.Duration(config.Delay) * time.Millisecond)
+			}
+
+		} else {
+			// asynchronous mode
+			var wg sync.WaitGroup
+
+			threads := config.Threads
+			if threads == -1 {
+				threads = len(links)
+			}
+			semaphore := make(chan bool, threads)
+
+			for index, l := range links {
+
+				wg.Add(1)
+				semaphore <- true
+
+				go func(index int, l link) {
+					defer wg.Done()
+
+					// and then use it to parse relative URLs
+					u, err := base.Parse(l.href)
+					if err != nil {
+						log.Fatal(err)
+					}
+
+					sc := NewChapterFromURL(u.String(), l.text, configs[1:], index, p.UpdateName)
+					subchapters[index] = sc
+
+					if config.Quiet == false {
+						p.Increment(index)
+					}
+
+					<-semaphore
+				}(index, l)
+			}
+			wg.Wait()
+		}
+	}
+
+	content := ""
+	if config.Include {
+
+		// we care about the content only if:
+		// - we include this level
+		// - we use the page name
+		content = article.Content
+
+		// extract images
+		if config.ImagesOnly {
+
+			// parse HTML
+			doc, err := goquery.NewDocumentFromReader(strings.NewReader(content))
+			if err != nil {
+				log.Fatal(err)
+			}
+
+			// append every image to content
+			content = ""
+			doc.Find("img").Each(func(i int, s *goquery.Selection) {
+				imageTag, _ := goquery.OuterHtml(s)
+				imageTag = strings.ReplaceAll(imageTag, "\n", "")
+
+				content += imageTag
+			})
+
+		}
+	}
+
+	return chapter{string(body), name, article.Byline, content, subchapters, config}
+}
+
+func tableOfContent(url string, config *ScrapeConfig, subConfig *ScrapeConfig, quiet bool) ([]chapter, chapter) {
+	base, err := urllib.Parse(url)
+	if err != nil {
+		log.Fatal(err)
+	}
+
+	links, _, home, err := GetLinks(base, config.Selector, config.Limit, config.Offset, config.Reverse, config.Include)
 	if err != nil {
 		log.Fatal(err)
 	}

 	chapters := make([]chapter, len(links))
-	progress := NewProgress(links)
+	delay := config.Delay
+
+	var p progress
+	if quiet == false {
+		p = NewProgress(links, "", 0)
+	}

 	if delay >= 0 {
+		// synchronous mode

-		for index, link := range links {
+		for index, l := range links {
 			// and then use it to parse relative URLs
-			u, err := base.Parse(link.href)
+			u, err := base.Parse(l.href)
 			if err != nil {
 				log.Fatal(err)
 			}

-			chapters[index] = NewChapterFromURL(u.String(), images)
-			progress.Incr(index)
+			chapters[index] = NewChapterFromURL(u.String(), l.text, []*ScrapeConfig{subConfig}, 0, func(index int, name string) {})
+
+			if quiet == false {
+				p.Increment(index)
+			}

 			// short sleep for last chapter to let the progress bar update
 			if index == len(links)-1 {
@@ -91,10 +319,20 @@ func tableOfContent(url, selector string, limit, offset, delay int, include, ima
 		}

 	} else {
+		// asynchronous mode
 		var wg sync.WaitGroup
+
+		threads := config.Threads
+		if threads == -1 {
+			threads = len(links)
+		}
+		semaphore := make(chan bool, threads)
+
 		for index, l := range links {

 			wg.Add(1)
+			semaphore <- true
+
 			go func(index int, l link) {
 				defer wg.Done()

@@ -104,14 +342,19 @@ func tableOfContent(url, selector string, limit, offset, delay int, include, ima
 					log.Fatal(err)
 				}

-				chapters[index] = NewChapterFromURL(u.String(), images)
-				progress.Incr(index)
+				chapters[index] = NewChapterFromURL(u.String(), l.text, []*ScrapeConfig{subConfig}, 0, func(index int, name string) {})

+				if quiet == false {
+					p.Increment(index)
+				}
+
+				<-semaphore
 			}(index, l)
 		}
 		wg.Wait()
 	}
-	return chapters
+
+	return chapters, home
 }

 func GetPath(elm *goquery.Selection) string {
@@ -119,7 +362,7 @@ func GetPath(elm *goquery.Selection) string {

 	for {
 		selector := strings.ToLower(goquery.NodeName(elm))
-		if selector == "" {
+		if len(selector) == 0 {
 			break
 		}

@@ -131,18 +374,18 @@ func GetPath(elm *goquery.Selection) string {
 	return join
 }

-func GetLinks(url *urllib.URL, selector string, limit, offset int, include bool) ([]link, error) {
+func GetLinks(url *urllib.URL, selector string, limit, offset int, reverse, include bool) ([]link, string, chapter, error) {
 	selectorSet := true
-	if selector == "" {
+	if len(selector) == 0 {
 		selector = "a"
 		selectorSet = false
 	}

-	// visit and count link classes
 	pathLinks := map[string][]link{}
 	pathCount := map[string]int{}
 	pathMax := ""

+	// visit and count link classes
 	c := colly.NewCollector()
 	c.OnHTML(selector, func(e *colly.HTMLElement) {
 		href := e.Attr("href")
@@ -150,26 +393,40 @@ func GetLinks(url *urllib.URL, selector string, limit, offset int, include bool)
 		path := GetPath(e.DOM)
 		key := path

-		// include element class in key if selector is set
-		if !selectorSet {
-			class := e.Attr("class")
-			key = fmt.Sprintf("%s.%s", path, class)
-		}
+		if selectorSet {

-		if selectorSet || text != "" {
+			// if selector is set, we use the selector specified by the user
+
+			key = selector
 			pathLinks[key] = append(pathLinks[key], NewLink(href, text))
-			pathCount[key] += len(text)
+			pathCount[key] += 1
+			pathMax = key

-			if pathCount[key] > pathCount[pathMax] {
-				pathMax = key
+		} else {
+
+			// if selector is not set, we compute the selector ourselves
+
+			class := e.Attr("class")
+			// include the element class to make sure we have the same exact path for every link in the table of content
+			key = fmt.Sprintf("%s.%s", path, class)
+
+			// we count this key if the link text is not empty
+			if text != "" {
+				pathLinks[key] = append(pathLinks[key], NewLink(href, text))
+				pathCount[key] += len(text)
+
+				if pathCount[key] > pathCount[pathMax] {
+					pathMax = key
+				}
 			}
+
 		}
 	})
 	c.Visit(url.String())

 	links := pathLinks[pathMax]
 	if len(links) == 0 {
-		return []link{}, fmt.Errorf("no link found for selector: %s", selector)
+		return []link{}, pathMax, chapter{}, fmt.Errorf("no link found for selector: %s", selector)
 	}

 	end := len(links)
@@ -179,11 +436,20 @@ func GetLinks(url *urllib.URL, selector string, limit, offset int, include bool)

 	links = links[offset:end]

+	home := NewChapterFromURL(url.String(), "", []*ScrapeConfig{NewScrapeConfig()}, 0, func(index int, name string) {})
+
+	// include home page
 	if include {
-		c := NewChapterFromURL(url.String(), false)
-		l := NewLink(url.String(), c.Name())
+		l := NewLink(url.String(), home.Name())
 		links = append([]link{l}, links...)
 	}

-	return links, nil
+	// reverse links
+	if reverse {
+		for i, j := 0, len(links)-1; i < j; i, j = i+1, j-1 {
+			links[i], links[j] = links[j], links[i]
+		}
+	}
+
+	return links, pathMax, home, nil
 }
--- a/book/scraper_test.go
+++ b/book/scraper_test.go
@@ -0,0 +1,217 @@
+package book
+
+import (
+	"testing"
+	"time"
+)
+
+func TestBody(t *testing.T) {
+
+	config := NewScrapeConfig()
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
+
+	got := c.Body()
+	want := "<!doctype html>\n<html lang=\"en-us\">\n  <head>\n    <title>Books</title>\n    <link rel=\"shortcut icon\" href=\"/favicon.ico\" />\n    <meta charset=\"utf-8\" />\n    <meta name=\"generator\" content=\"Hugo 0.59.1\" />\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />\n    <meta name=\"author\" content=\"John Doe\" />\n    <meta name=\"description\" content=\" \" />\n    <link rel=\"stylesheet\" href=\"https://books.lapw.at/css/main.min.88e7083eff65effb7485b6e6f38d10afbec25093a6fac42d734ce9024d3defbd.css\" />\n\n    \n    <meta name=\"twitter:card\" content=\"summary\"/>\n<meta name=\"twitter:title\" content=\"Books\"/>\n<meta name=\"twitter:description\" content=\" \"/>\n\n    <meta property=\"og:title\" content=\"Books\" />\n<meta property=\"og:description\" content=\" \" />\n<meta property=\"og:type\" content=\"website\" />\n<meta property=\"og:url\" content=\"https://books.lapw.at/\" />\n\n\n\n  </head>\n  <body>\n    <header class=\"app-header\">\n      <a href=\"https://books.lapw.at/\"><img class=\"app-header-avatar\" src=\"/book.svg\" alt=\"John Doe\" /></a>\n      <h1>Books</h1>\n      <p> </p>\n      <div class=\"app-header-social\">\n        \n      </div>\n    </header>\n    <main class=\"app-container\">\n      \n  <article>\n    <h1>Books</h1>\n    <ul class=\"posts-list\">\n      \n        <li class=\"posts-list-item\">\n          <a class=\"posts-list-item-title\" href=\"https://books.lapw.at/posts/ren%C3%A9-descartes-discours-de-la-m%C3%A9thode/\">Discours de la Méthode</a>\n          <span class=\"posts-list-item-description\">\n            <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"icon icon-clock\">\n  <title>clock</title>\n  <circle cx=\"12\" cy=\"12\" r=\"10\"></circle><polyline points=\"12 6 12 12 16 14\"></polyline>\n</svg> 98 min read -\n            1637\n          </span>\n        </li>\n      \n        <li class=\"posts-list-item\">\n          <a class=\"posts-list-item-title\" href=\"https://books.lapw.at/posts/adam-wiggins-the-twelve-factor-app/\">The Twelve-Factor App</a>\n          <span class=\"posts-list-item-description\">\n            <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"icon icon-clock\">\n  <title>clock</title>\n  <circle cx=\"12\" cy=\"12\" r=\"10\"></circle><polyline points=\"12 6 12 12 16 14\"></polyline>\n</svg> 22 min read -\n            2011\n          </span>\n        </li>\n      \n    </ul>\n    \n\n\n\n  </article>\n\n    </main>\n  </body>\n</html>\n"
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestName(t *testing.T) {
+
+	config := NewScrapeConfig()
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
+
+	got := c.Name()
+	want := "Books"
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestCustomName(t *testing.T) {
+
+	config := NewScrapeConfig()
+	config.UseLinkName = true
+	c := NewChapterFromURL("https://books.lapw.at/", "Custom Name", []*ScrapeConfig{config}, 0, func(index int, name string) {})
+
+	got := c.Name()
+	want := "Custom Name"
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestAuthor(t *testing.T) {
+
+	config := NewScrapeConfig()
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
+
+	got := c.Author()
+	want := "John Doe"
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestContent(t *testing.T) {
+
+	config := NewScrapeConfig()
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
+
+	got := c.Content()
+	want := "<div id=\"readability-page-1\" class=\"page\">\n    \n    <main>\n      \n  <article>\n    \n    <ul>\n      \n        <li>\n          <a href=\"https://books.lapw.at/posts/ren%C3%A9-descartes-discours-de-la-m%C3%A9thode/\">Discours de la Méthode</a>\n          <span>\n            <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\">\n  <title>clock</title>\n  <circle cx=\"12\" cy=\"12\" r=\"10\"></circle><polyline points=\"12 6 12 12 16 14\"></polyline>\n</svg> 98 min read -\n            1637\n          </span>\n        </li>\n      \n        <li>\n          <a href=\"https://books.lapw.at/posts/adam-wiggins-the-twelve-factor-app/\">The Twelve-Factor App</a>\n          <span>\n            <svg xmlns=\"http://www.w3.org/2000/svg\" width=\"24\" height=\"24\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\">\n  <title>clock</title>\n  <circle cx=\"12\" cy=\"12\" r=\"10\"></circle><polyline points=\"12 6 12 12 16 14\"></polyline>\n</svg> 22 min read -\n            2011\n          </span>\n        </li>\n      \n    </ul>\n    \n\n\n\n  </article>\n\n    </main>\n  \n\n</div>"
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestDelay(t *testing.T) {
+
+	config0 := NewScrapeConfig()
+	config0.Delay = 500
+
+	config1 := NewScrapeConfig()
+
+	start := time.Now()
+	NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
+	elapsed := time.Since(start)
+
+	got := elapsed
+	want := time.Duration(500) * time.Millisecond
+
+	if got < want {
+		t.Errorf("got %v, wanted min %v", got, want)
+	}
+
+}
+
+func TestContentImagesOnly(t *testing.T) {
+
+	config := NewScrapeConfig()
+	config.ImagesOnly = true
+
+	c := NewChapterFromURL("https://books.lapw.at/posts/adam-wiggins-the-twelve-factor-app/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
+
+	got := c.Content()
+	want := "<img src=\"https://books.lapw.at/images/codebase-deploys.png\" alt=\"One codebase maps to many deploys\"/><img src=\"https://books.lapw.at/images/attached-resources.png\" alt=\"A production deploy attached to four backing services.\"/><img src=\"https://books.lapw.at/images/release.png\" alt=\"Code becomes a build, which is combined with config to create a release.\"/><img src=\"https://books.lapw.at/images/process-types.png\" alt=\"Scale is expressed as running processes, workload diversity is expressed as process types.\"/>"
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestSubChapters(t *testing.T) {
+
+	config0 := NewScrapeConfig()
+	config1 := NewScrapeConfig()
+
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
+
+	got := len(c.SubChapters())
+	want := 2
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestSubChaptersSelector(t *testing.T) {
+
+	config0 := NewScrapeConfig()
+	config0.Selector = "section.concrete > article > h2 > a"
+
+	config1 := NewScrapeConfig()
+
+	c := NewChapterFromURL("https://12factor.net/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
+
+	got := len(c.SubChapters())
+	want := 12
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestSubChaptersLimit(t *testing.T) {
+
+	config0 := NewScrapeConfig()
+	config0.Limit = 1
+
+	config1 := NewScrapeConfig()
+
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
+
+	got := len(c.SubChapters())
+	want := 1
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestSubChaptersLimitOver(t *testing.T) {
+
+	config0 := NewScrapeConfig()
+	config0.Limit = 3
+
+	config1 := NewScrapeConfig()
+
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
+
+	got := len(c.SubChapters())
+	want := 2
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestReverse(t *testing.T) {
+
+	config0 := NewScrapeConfig()
+	config0.Reverse = true
+
+	config1 := NewScrapeConfig()
+
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config0, config1}, 0, func(index int, name string) {})
+
+	got := c.SubChapters()[0].Name()
+	want := "The Twelve-Factor App"
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
+
+func TestNotInclude(t *testing.T) {
+
+	config := NewScrapeConfig()
+	config.Include = false
+
+	c := NewChapterFromURL("https://books.lapw.at/", "", []*ScrapeConfig{config}, 0, func(index int, name string) {})
+
+	got := c.Content()
+	want := ""
+
+	if got != want {
+		t.Errorf("got %v, wanted %v", got, want)
+	}
+
+}
--- a/cmd/get.go
+++ b/cmd/get.go
@@ -3,172 +3,180 @@ package cmd
 import (
 	"errors"
 	"fmt"
-	"log"
-	"os"
-	"os/exec"
 	"strings"

-	md "github.com/JohannesKaufmann/html-to-markdown"
-	epub "github.com/bmaupin/go-epub"
-	cobra "github.com/spf13/cobra"
+	"github.com/spf13/cobra"

 	"github.com/lapwat/papeer/book"
 )

-var stdout, recursive, include, images bool
-var format, output, selector string
-var limit, offset, delay int
+type GetOptions struct {
+	// url string
+
+	name   string
+	author string
+	Format string
+	output string
+	images bool
+	// ImagesOnly bool
+	quiet bool
+
+	Selector []string
+	depth    int
+	limit    int
+	offset   int
+	reverse  bool
+	delay    int
+	threads  int
+	// includeUrl bool
+	include     bool
+	useLinkName bool
+}
+
+var getOpts *GetOptions
+
+func init() {
+	getOpts = &GetOptions{}
+
+	getCmd.PersistentFlags().StringVarP(&getOpts.name, "name", "n", "", "book name (default: page title)")
+	getCmd.PersistentFlags().StringVarP(&getOpts.author, "author", "a", "", "book author")
+	getCmd.PersistentFlags().StringVarP(&getOpts.Format, "format", "f", "md", "file format [stdout, md, epub, mobi]")
+	getCmd.PersistentFlags().StringVarP(&getOpts.output, "output", "", "", "file name (default: book name)")
+	getCmd.PersistentFlags().BoolVarP(&getOpts.images, "images", "", false, "retrieve images only")
+	getCmd.PersistentFlags().BoolVarP(&getOpts.quiet, "quiet", "q", false, "hide progress bar")
+
+	// common with list command
+	getCmd.Flags().StringSliceVarP(&getOpts.Selector, "selector", "s", []string{}, "table of contents CSS selector")
+	getCmd.Flags().IntVarP(&getOpts.depth, "depth", "d", 0, "scraping depth")
+	getCmd.Flags().IntVarP(&getOpts.limit, "limit", "l", -1, "limit number of chapters, use with depth/selector")
+	getCmd.Flags().IntVarP(&getOpts.offset, "offset", "o", 0, "skip first chapters, use with depth/selector")
+	getCmd.Flags().BoolVarP(&getOpts.reverse, "reverse", "r", false, "reverse chapter order")
+	getCmd.Flags().IntVarP(&getOpts.delay, "delay", "", -1, "time in milliseconds to wait before downloading next chapter, use with depth/selector")
+	getCmd.Flags().IntVarP(&getOpts.threads, "threads", "t", -1, "download concurrency, use with depth/selector")
+	getCmd.Flags().BoolVarP(&getOpts.include, "include", "i", false, "include URL as first chapter, use with depth/selector")
+	getCmd.Flags().BoolVarP(&getOpts.useLinkName, "use-link-name", "", false, "use link name for chapter title")
+
+	rootCmd.AddCommand(getCmd)
+}

 var getCmd = &cobra.Command{
-	Use:   "get",
-	Short: "Scrape URL content",
+	Use:     "get URL",
+	Short:   "Scrape URL content",
+	Example: "papeer get https://www.eff.org/cyberspace-independence",
 	Args: func(cmd *cobra.Command, args []string) error {
 		if len(args) < 1 {
 			return errors.New("requires an URL argument")
 		}

 		formatEnum := map[string]bool{
-			"md":   true,
-			"epub": true,
-			"mobi": true,
-		}
-		if formatEnum[format] != true {
-			return fmt.Errorf("invalid format specified: %s", format)
+			"stdout": true,
+			"md":     true,
+			"epub":   true,
+			"mobi":   true,
 		}

-		if format == "epub" || format == "mobi" {
-			if stdout {
-				return errors.New("cannot print EPUB/MOBI file to standard output")
+		if formatEnum[getOpts.Format] != true {
+			return fmt.Errorf("invalid format specified: %s", getOpts.Format)
+		}
+
+		// add .mobi to filename if not specified
+		if getOpts.Format == "mobi" {
+			if len(getOpts.output) > 0 && strings.HasSuffix(getOpts.output, ".mobi") == false {
+				getOpts.output = fmt.Sprintf("%s.mobi", getOpts.output)
 			}
 		}

-		if format == "mobi" {
-			if len(output) > 0 && strings.HasSuffix(output, ".mobi") == false {
-				output = fmt.Sprintf("%s.mobi", output)
-			}
+		if cmd.Flags().Changed("include") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
+			return errors.New("cannot use include option if depth/selector is not specified")
 		}

-		if cmd.Flags().Changed("selector") && recursive == false {
-			return errors.New("cannot use selector option if not in recursive mode")
+		if cmd.Flags().Changed("limit") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
+			return errors.New("cannot use limit option if depth/selector is not specified")
 		}

-		if cmd.Flags().Changed("include") && recursive == false {
-			return errors.New("cannot use include option if not in recursive mode")
+		if cmd.Flags().Changed("offset") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
+			return errors.New("cannot use offset option if depth/selector is not specified")
 		}

-		if cmd.Flags().Changed("limit") && recursive == false {
-			return errors.New("cannot use limit option if not in recursive mode")
+		if cmd.Flags().Changed("reverse") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
+			return errors.New("cannot use reverse option if depth/selector is not specified")
 		}

-		if cmd.Flags().Changed("offset") && recursive == false {
-			return errors.New("cannot use offset option if not in recursive mode")
+		if cmd.Flags().Changed("delay") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
+			return errors.New("cannot use delay option if depth/selector is not specified")
 		}

-		if cmd.Flags().Changed("delay") && recursive == false {
-			return errors.New("cannot use delay option if not in recursive mode")
+		if cmd.Flags().Changed("threads") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
+			return errors.New("cannot use threads option if depth/selector is not specified")
+		}
+
+		if cmd.Flags().Changed("use-link-name") && getOpts.depth == 0 && len(getOpts.Selector) == 0 {
+			return errors.New("cannot use use-link-name option if depth/selector is not specified")
+		}
+
+		if cmd.Flags().Changed("delay") && cmd.Flags().Changed("threads") {
+			return errors.New("cannot use delay and threads options at the same time")
 		}

 		return nil
 	},
 	Run: func(cmd *cobra.Command, args []string) {
 		url := args[0]
-		b := book.NewBookFromURL(url, selector, recursive, include, images, limit, offset, delay)

-		if len(output) == 0 {
-			// set default output
-			output = strings.ReplaceAll(b.Name(), " ", "_")
-			output = strings.ReplaceAll(output, "/", "")
-			output = fmt.Sprintf("%s.%s", output, format)
+		// fill selector array with empty selectors to match depth
+		getOpts.Selector = append(getOpts.Selector, "")
+		for len(getOpts.Selector) < getOpts.depth+1 {
+			getOpts.Selector = append(getOpts.Selector, "")
 		}

-		if format == "md" {
-			var f *os.File
-			var err error
+		// generate config for each level
+		configs := make([]*book.ScrapeConfig, len(getOpts.Selector))
+		for index, s := range getOpts.Selector {
+			config := book.NewScrapeConfig()
+			config.Selector = s
+			config.Quiet = getOpts.quiet
+			config.Limit = getOpts.limit
+			config.Offset = getOpts.offset
+			config.Reverse = getOpts.reverse
+			config.Delay = getOpts.delay
+			config.Threads = getOpts.threads
+			config.ImagesOnly = getOpts.images
+			config.Include = getOpts.include
+			config.UseLinkName = getOpts.useLinkName

-			if !stdout {
-				f, err = os.Create(output)
-				if err != nil {
-					log.Fatal(err)
-				}
-				defer f.Close()
+			// do not use link name for root level as there is not parent link
+			if index == 0 {
+				config.UseLinkName = false
 			}

-			for _, c := range b.Chapters() {
-				content, err := md.NewConverter("", true, nil).ConvertString(c.Content())
-				if err != nil {
-					log.Fatal(err)
-				}
-
-				text := fmt.Sprintf("%s\n%s\n\n%s\n\n\n", c.Name(), strings.Repeat("=", len(c.Name())), content)
-
-				if stdout {
-					fmt.Println(text)
-				} else {
-					_, err := f.WriteString(text)
-					if err != nil {
-						log.Fatal(err)
-					}
-
-				}
+			// always include last level by default
+			if index == len(getOpts.Selector)-1 {
+				config.Include = true
 			}

-			if stdout == false {
-				fmt.Printf("Markdown saved to \"%s\"\n", output)
-			}
+			configs[index] = config
 		}

-		if format == "epub" {
-			e := epub.NewEpub(b.Name())
-			e.SetAuthor(b.Author())
+		c := book.NewChapterFromURL(url, "", configs, 0, func(index int, name string) {})

-			for _, c := range b.Chapters() {
-				if images {
-					e.AddSection(c.Content(), "", "", "")
-				} else {
-					html := fmt.Sprintf("<h1>%s</h1>%s", c.Name(), c.Content())
-
-					_, err := e.AddSection(html, c.Name(), "", "")
-					if err != nil {
-						log.Fatal(err)
-					}
-				}
-			}
-
-			err := e.Write(output)
-			if err != nil {
-				log.Fatal(err)
-			}
-
-			fmt.Printf("Ebook saved to \"%s\"\n", output)
+		if getOpts.Format == "stdout" {
+			markdown := book.ToMarkdownString(c)
+			fmt.Println(markdown)
 		}

-		if format == "mobi" {
-			e := epub.NewEpub(b.Name())
-			e.SetAuthor(b.Author())
+		if getOpts.Format == "md" {
+			filename := book.ToMarkdown(c, getOpts.output)
+			fmt.Printf("Markdown saved to \"%s\"\n", filename)
+		}

-			for _, chapter := range b.Chapters() {
-				e.AddSection(chapter.Content(), chapter.Name(), "", "")
-			}
+		if getOpts.Format == "epub" {
+			filename := book.ToEpub(c, getOpts.output)
+			fmt.Printf("Ebook saved to \"%s\"\n", filename)
+		}

-			outputEPUB := strings.ReplaceAll(output, ".mobi", ".epub")
-
-			err := e.Write(outputEPUB)
-			if err != nil {
-				log.Fatal(err)
-			}
-
-			exec.Command("kindlegen", outputEPUB).Run()
-			// exec command always return status 1 even if it fails
-			// if err != nil {
-			// 	log.Fatal(err)
-			// }
-
-			fmt.Printf("Ebook saved to \"%s\"\n", output)
-
-			err2 := os.Remove(outputEPUB)
-			if err2 != nil {
-				log.Fatal(err2)
-			}
+		if getOpts.Format == "mobi" {
+			filename := book.ToMobi(c, getOpts.output)
+			fmt.Printf("Ebook saved to \"%s\"\n", filename)
 		}
 	},
 }
--- a/cmd/list.go
+++ b/cmd/list.go
@@ -2,9 +2,11 @@ package cmd

 import (
 	"errors"
+	"fmt"
 	"log"
 	urllib "net/url"
 	"os"
+	"strings"

 	"github.com/jedib0t/go-pretty/v6/table"
 	cobra "github.com/spf13/cobra"
@@ -12,9 +14,45 @@ import (
 	"github.com/lapwat/papeer/book"
 )

+type ListOptions struct {
+	// url string
+
+	Selector []string
+	depth    int
+	limit    int
+	offset   int
+	reverse  bool
+	delay    int
+	threads  int
+	// includeUrl bool
+	include     bool
+	useLinkName bool
+}
+
+var listOpts *ListOptions
+
+func init() {
+	listOpts = &ListOptions{}
+
+	// common with get command
+	listCmd.Flags().StringSliceVarP(&listOpts.Selector, "selector", "s", []string{}, "table of contents CSS selector")
+	listCmd.Flags().IntVarP(&listOpts.depth, "depth", "d", 0, "scraping depth")
+	listCmd.Flags().IntVarP(&listOpts.limit, "limit", "l", -1, "limit number of chapters, use with depth/selector")
+	listCmd.Flags().IntVarP(&listOpts.offset, "offset", "o", 0, "skip first chapters, use with depth/selector")
+	listCmd.Flags().BoolVarP(&listOpts.reverse, "reverse", "r", false, "reverse chapter order")
+	listCmd.Flags().IntVarP(&listOpts.delay, "delay", "", -1, "time in milliseconds to wait before downloading next chapter, use with depth/selector")
+	listCmd.Flags().IntVarP(&listOpts.threads, "threads", "t", -1, "download concurrency, use with depth/selector")
+	listCmd.Flags().BoolVarP(&listOpts.include, "include", "i", false, "include URL as first chapter, use with depth/selector")
+	listCmd.Flags().BoolVarP(&listOpts.useLinkName, "use-link-name", "", false, "use link name for chapter title")
+
+	rootCmd.AddCommand(listCmd)
+}
+
 var listCmd = &cobra.Command{
-	Use:   "ls",
-	Short: "Print table of content",
+	Use:     "list URL",
+	Aliases: []string{"ls"},
+	Short:   "Print URL table of contents",
+	Example: "papeer list https://12factor.net/ -s 'section.concrete>article>h2>a'",
 	Args: func(cmd *cobra.Command, args []string) error {
 		if len(args) < 1 {
 			return errors.New("requires an URL argument")
@@ -22,12 +60,16 @@ var listCmd = &cobra.Command{
 		return nil
 	},
 	Run: func(cmd *cobra.Command, args []string) {
+		if len(listOpts.Selector) == 0 {
+			listOpts.Selector = []string{""}
+		}
+
 		base, err := urllib.Parse(args[0])
 		if err != nil {
 			log.Fatal(err)
 		}

-		links, err := book.GetLinks(base, selector, limit, offset, include)
+		links, path, _, err := book.GetLinks(base, listOpts.Selector[0], listOpts.limit, listOpts.offset, listOpts.reverse, listOpts.include)
 		if err != nil {
 			log.Fatal(err)
 		}
@@ -37,7 +79,16 @@ var listCmd = &cobra.Command{
 		t.Style().Options.DrawBorder = false
 		t.Style().Options.SeparateColumns = false
 		t.Style().Options.SeparateHeader = false
-		t.AppendHeader(table.Row{"#", "Name", "Url"})
+
+		// format selector path
+		pathArray := strings.Split(path, "<")
+		// reverse path
+		for i, j := 0, len(pathArray)-1; i < j; i, j = i+1, j-1 {
+			pathArray[i], pathArray[j] = pathArray[j], pathArray[i]
+		}
+		pathFormatted := strings.Join(pathArray, ">")
+
+		t.AppendHeader(table.Row{"#", "Name", fmt.Sprintf("Url [%s]", pathFormatted)})

 		for index, link := range links {
 			u, err := base.Parse(link.Href())
--- a/cmd/root.go
+++ b/cmd/root.go
@@ -21,19 +21,3 @@ func Execute() {
 		os.Exit(1)
 	}
 }
-
-func init() {
-	rootCmd.PersistentFlags().StringVarP(&format, "format", "f", "md", "file format [md, epub, mobi]")
-	rootCmd.PersistentFlags().StringVarP(&output, "output", "", "", "output file")
-	rootCmd.PersistentFlags().StringVarP(&selector, "selector", "s", "", "table of content CSS selector, in resursive mode")
-	rootCmd.PersistentFlags().BoolVarP(&recursive, "recursive", "r", false, "create one chapter per natigation item")
-	rootCmd.PersistentFlags().BoolVarP(&include, "include", "i", false, "include URL as first chapter, in resursive mode")
-	rootCmd.PersistentFlags().BoolVarP(&stdout, "stdout", "", false, "print to standard output")
-	rootCmd.PersistentFlags().BoolVarP(&images, "images", "", false, "retrieve images only")
-	rootCmd.PersistentFlags().IntVarP(&limit, "limit", "l", -1, "limit number of chapters, in recursive mode")
-	rootCmd.PersistentFlags().IntVarP(&offset, "offset", "o", 0, "skip first chapters, in recursive mode")
-	rootCmd.PersistentFlags().IntVarP(&delay, "delay", "d", -1, "time to wait before downloading next chapter, in milliseconds")
-
-	rootCmd.AddCommand(getCmd)
-	rootCmd.AddCommand(listCmd)
-}
--- a/cmd/utils.go
+++ b/cmd/utils.go
@@ -1,5 +0,0 @@
-package cmd
-
-func getTableOfContent() {
-
-}
--- a/cmd/version.go
+++ b/cmd/version.go
@@ -14,6 +14,6 @@ var versionCmd = &cobra.Command{
 	Use:   "version",
 	Short: "Print the version number of papeer",
 	Run: func(cmd *cobra.Command, args []string) {
-		fmt.Println("papeer v0.1.1")
+		fmt.Println("papeer v0.4.2")
 	},
 }
--- a/release.sh
+++ b/release.sh
@@ -1,5 +1,11 @@
 #!/usr/bin/env bash

+if [ "$#" -ne 1 ]; then
+    echo "Illegal number of parameters"
+    echo "Usage: ./release.sh X.X.X"
+    exit 1
+fi
+
 version=$1
 platforms=("linux/amd64" "darwin/amd64" "windows/amd64")

@@ -8,14 +14,15 @@ do
    platform_split=(${platform//\// })
    GOOS=${platform_split[0]}
    GOARCH=${platform_split[1]}
-    output_name='papeer-'$version'-'$GOOS'-'$GOARCH
-    if [ $GOOS = "windows" ]; then
-        output_name+='.exe'
-    fi
+    output_name=papeer

-    env GOOS=$GOOS GOARCH=$GOARCH go build -o $output_name
-    if [ $? -ne 0 ]; then
-        echo 'An error has occurred! Aborting the script execution...'
-        exit 1
+    if [ $GOOS = "windows" ]; then
+        env GOOS=$GOOS GOARCH=$GOARCH go build -o "$output_name.exe"
+        zip "$output_name-v$version-$GOOS-$GOARCH.exe.zip" "$output_name.exe"
+        rm "$output_name.exe"
+    else
+        env GOOS=$GOOS GOARCH=$GOARCH go build -o "$output_name"
+        tar czvf "$output_name-v$version-$GOOS-$GOARCH.tar.gz" "$output_name"
+        rm "$output_name"
    fi
 done
Author	SHA1	Message	Date
lapwat	be69854b17	add reverse option	2022-02-21 00:32:39 +01:00
lapwat	d8a3cc027f	fix: selector depth	2022-02-06 23:35:35 +01:00
lapwat	be45a8f744	update installation instructions	2022-02-05 11:58:02 +01:00
lapwat	4b760c9562	chain selctors, depth & quiet options, split main commands	2022-02-04 19:42:27 +01:00
lapwat	26b144fb73	Update README.md	2022-01-06 00:33:11 +01:00
lapwat	900ee8c5d7	Update README.md	2022-01-02 15:13:37 +01:00
lapwat	008e4ebd7a	feat: quiet option	2022-01-02 14:58:55 +01:00
lapwat	5e735f9c52	refacto get command, fix: images option	2022-01-02 02:16:45 +01:00
lapwat	29008185a8	test tomobi, update progress style	2021-12-27 13:01:45 +01:00
lapwat	ff3d09c727	add test suites, scrape config	2021-12-22 22:28:19 +01:00
lapwat	0009435769	bump version in readme	2021-12-19 18:27:17 +01:00
lapwat	4e9b0611e8	add book name and author options	2021-12-19 18:20:54 +01:00
lapwat	e7ffd8c66c	fix: display images in epub	2021-12-19 18:15:22 +01:00
lapwat	84e6ad8585	bump version in readme	2021-12-19 18:15:13 +01:00
lapwat	d593a74e6e	fix: display images in epub	2021-12-09 23:40:55 +01:00
lapwat	d5971a2819	add threads option	2021-10-10 22:02:39 +02:00