add proxy feature, update readme

2026-05-25 04:17:19 +00:00 · 2023-03-07 22:44:16 +01:00
parent 21d17274ce
commit 6e55e9a994
7 changed files with 333 additions and 235 deletions
--- a/README.md
+++ b/README.md
@@ -1,91 +1,98 @@
-# Papeer
+<h1 align="center">
+  <img src="logo.png" alt="Papeer">
+  <br>
+  Papeer
+</h1>

-Papeer is a powerful **ereader internet vacuum**. It can scrape any website, removing ads and keeping only the relevant content (formatted text and images). You can export the content to Markdown, HTML, EPUB or MOBI files.
+<h4 align="center">Web scraper for ereaders</h4>

-# Table of contents
+<p align="center">
+  <a href="#features">Features</a> •
+  <a href="#installation">Installation</a> •
+  <a href="#how-to-use">How To Use</a>
+</p>

- [Usage](#usage)
-  * [Scrape a web page](#scrape-a-web-page)
-  * [Scrape a whole website](#scrape-a-whole-website)
-    + [`depth` option](#-depth--option)
-    + [`selector` option](#-selector--option)
-    + [Display the table of contents](#display-the-table-of-contents)
-    + [Scrape time](#scrape-time)
- [Installation](#installation)
-  * [From source](#from-source)
-  * [From binary](#from-binary)
-    + [Linux / MacOS](#linux---macos)
-    + [Windows](#windows)
-  * [MOBI support](#mobi-support)
- [Autocompletion](#autocompletion)
- [Dependencies](#dependencies)
+  <img src="terminal.gif" alt="Papeer">

-# Usage

-## Scrape a web page
+## Features

-The `get` command lets you retrieve the content of any web page or RSS feed.
+* Scrape websites and RSS feeds
+* Keep relevant content only
+  - Formatted text (bold, italic, links)
+  - Images
+* Save websites as Markdown, HTML, EPUB or MOBI files
+* Use it as a an HTTP proxy
+* Cross platform
+  - Windows, MacOS and Linux ready

-```
-Scrape URL content
+# Installation

-Usage:
-  papeer get URL [flags]
-
-Examples:
-papeer get https://www.eff.org/cyberspace-independence
-
-Flags:
-  -a, --author string      book author
-      --delay int          time in milliseconds to wait before downloading next chapter, use with depth/selector (default -1)
-  -d, --depth int          scraping depth
-  -f, --format string      file format [md, html, epub, mobi] (default "md")
-  -h, --help               help for get
-      --images             retrieve images only
-  -i, --include            include URL as first chapter, use with depth/selector
-  -l, --limit int          limit number of chapters, use with depth/selector (default -1)
-  -n, --name string        book name (default: page title)
-  -o, --offset int         skip first chapters, use with depth/selector
-      --output string      file name (default: book name)
-  -q, --quiet              hide progress bar
-  -r, --reverse            reverse chapter order
-  -s, --selector strings   table of contents CSS selector
-      --stdout             print to standard output
-  -t, --threads int        download concurrency, use with depth/selector (default -1)
-      --use-link-name      use link name for chapter title
-```
-
-## Scrape a whole website
-
-If a navigation menu is present on a website, you can scrape the content of each page.
-
-You can activate this mode by using the `depth` or `selector` options.
-
-### `depth` option
-
-This option defaults to 0, `papeer` will grab only the main page.
-
-This option defaults to 1 if the `limit` option is specified.
-
-If you specify a value greater than 0, `papeer` will grab pages as deep as the value you specify.
-
-> Using `include` option will include all intermediary levels into the book.
-
-### `selector` option
-
-If this option is not specified, `papeer` will grab only the one page.
-
-If this option is specified, `papeer` will select the links (a HTML tag) present on the main page, then grab each one of them.
-
-You can chain this option to grab several level of pages with diferent selectors for each level.
-
-### Display the table of contents
-
-Before actually scraping a whole website, it is a good idea to use the `list` command. This command is like a **dry run**, which lets you vizualize the content before actually retrieving it. You can use several options to customize the table of contents extraction, such as `selector`, `limit`, `offset`, `reverse` and `include`. Type `papeer list --help` for more information about those options.
+## From source

 ```sh
-papeer list https://12factor.net/ -s 'section.concrete>article>h2>a'
+go install github.com/lapwat/papeer@latest
 ```
+
+## From binary
+
+Download [latest release](https://github.com/lapwat/papeer/releases/latest) for Windows, MacOS (darwin) and Linux.
+
+## MOBI support
+
+Install kindlegen to convert websites, Linux only.
+
+```sh
+TMPDIR=$(mktemp -d -t papeer-XXXXX)
+curl -L https://github.com/lapwat/papeer/releases/download/kindlegen/kindlegen_linux_2.6_i386_v2_9.tar.gz > $TMPDIR/kindlegen.tar.gz
+tar xzvf $TMPDIR/kindlegen.tar.gz -C $TMPDIR
+chmod +x $TMPDIR/kindlegen
+sudo mv $TMPDIR/kindlegen /usr/local/bin
+rm -rf $TMPDIR
+```
+
+Now you can use `--format=mobi` in your `get` command.
+
+## How To Use
+
+### Scrape a single page
+
+```sh
+papeer get URL
+```
+
+The `get` command let's you retrieve the content of a web page.
+
+It removes ads and menus with `go-readability`, keeping only formatted text and images.
+
+You can chain URLs.
+
+**Options**
+
+```sh
+-a, --author string      book author
+-f, --format string      file format [md, html, epub, mobi] (default "md")
+-h, --help               help for get
+    --images             retrieve images only
+-n, --name string        book name (default: page title)
+    --output string      file name (default: book name)
+    --stdout             print to standard output
+```
+
+### Scrape a whole website recursively
+
+**Display the table of contents**
+
+Before scraping a whole website, it is a good idea to use the `list` command. This command is like a _dry run_, **which lets you vizualize the content before retrieving it**.
+
+You can use several options to customize the table of contents extraction, such as `selector`, `limit`, `offset`, `reverse` and `include`. Type `papeer list --help` for more information about those options.
+
+The selector option should point to **`<a>` HTML tags**. If you don't specify it, the `selector` will be automatically determined based on the links present on the page.
+
+```sh
+papeer list https://12factor.net/ --selector='section.concrete>article>h2>a'
+```
+
 ```
 #  NAME                    URL                                    
 1  I. Codebase             https://12factor.net/codebase          
@@ -102,71 +109,81 @@ papeer list https://12factor.net/ -s 'section.concrete>article>h2>a'
 12  XII. Admin processes    https://12factor.net/admin-processes
 ```

-### Scrape time
+**Scrape the content**

-Once you are satisfied with the table of contents listed by the `ls` command, you can actually scrape the content of those pages. You can use the same options that you specified for the `ls` command. You can specify `delay` and `threads` options when using `selector` or `depth` options.
+Once you are satisfied with the table of contents listed by the `list` command, you can scrape the content of those pages with the `get` command. You can use the same options that you specified for the `list` command.

 ```sh
 papeer get https://12factor.net/ --selector='section.concrete>article>h2>a'
 ```
+
 ```
-[======================================>-----------------------------] Chapters 7 / 12
-[====================================================================] 1. I. Codebase
-[====================================================================] 2. II. Dependencies
-[====================================================================] 3. III. Config
-[====================================================================] 4. IV. Backing services
-[====================================================================] 5. V. Build, release, run
-[====================================================================] 6. VI. Processes
-[====================================================================] 7. VII. Port binding
-[--------------------------------------------------------------------] 8. VIII. Concurrency
-[--------------------------------------------------------------------] 9. IX. Disposability
-[--------------------------------------------------------------------] 10. X. Dev/prod parity
-[--------------------------------------------------------------------] 11. XI. Logs
-[--------------------------------------------------------------------] 12. XII. Admin processes
+[===>-----------------------------] Chapters 7 / 12
+[=================================] 1. I. Codebase
+[=================================] 2. II. Dependencies
+[=================================] 3. III. Config
+[=================================] 4. IV. Backing services
+[=================================] 5. V. Build, release, run
+[=================================] 6. VI. Processes
+[=================================] 7. VII. Port binding
+[---------------------------------] 8. VIII. Concurrency
+[---------------------------------] 9. IX. Disposability
+[---------------------------------] 10. X. Dev/prod parity
+[---------------------------------] 11. XI. Logs
+[---------------------------------] 12. XII. Admin processes
 Markdown saved to "The_Twelve-Factor_App.md"
 ```

-# Installation
+**Recursive mode options**

-## From source
+If a navigation menu is present on a website, you can scrape the content of each subpage.
+
+You can activate this mode by using the `depth` or `selector` options.
+
+**`depth`**
+
+This option defaults to 0, `papeer` will grab only the main page.
+
+This option defaults to 1 if the `limit` option is specified.
+
+If you specify a value greater than 0, `papeer` will grab pages as deep as the value you specify.
+
+**`selector`**
+
+If this option is not specified, `papeer` will grab only the one page.
+
+If this option is specified, `papeer` will select the links (a HTML tag) present on the main page, then grab each one of them.
+
+You can chain this option to grab several level of pages with diferent selectors for each level.
+
+**`include`**
+
+Using this option will include all intermediary levels into the book.
+
+**`delay` `threads`**
+
+By default, it will grab all the pages asynchonously.
+
+Use those options to tweak the synchronicity of scrape requests.
+
+**Automatic table of contents extraction**
+
+If you have a `depth` greater than 1 with no `selector`, it will be automatically determined based on the links present on the parent page.
+
+# Proxy
+
+You can use the `proxy` command to act like proxy. It can serve HTML or Markdown content based on the `--output` option.

 ```sh
-go install github.com/lapwat/papeer@latest
+papeer proxy --output=md
+# Proxy listening on port 8080...
 ```

-## From binary
-
-### Linux / MacOS
+You can call the endpoint with `curl` and the `--proxy` option.

 ```sh
-# use platform=darwin for MacOS
-platform=linux
-release=0.6.3
-
-# download and extract
-curl -L https://github.com/lapwat/papeer/releases/download/v$release/papeer-v$release-$platform-amd64.tar.gz > papeer.tar.gz
-tar xzvf papeer.tar.gz
-rm papeer.tar.gz
-
-# move to user binaries
-sudo mv papeer /usr/local/bin
-```
-
-### Windows
-
-Download [latest release](https://github.com/lapwat/papeer/releases/download/v0.6.3/papeer-v0.6.3-windows-amd64.zip).
-
-## MOBI support
-
-Install kindlegen to convert websites, Linux only
-
-```sh
-TMPDIR=$(mktemp -d -t papeer-XXXXX)
-curl -L https://github.com/lapwat/papeer/releases/download/kindlegen/kindlegen_linux_2.6_i386_v2_9.tar.gz > $TMPDIR/kindlegen.tar.gz
-tar xzvf $TMPDIR/kindlegen.tar.gz -C $TMPDIR
-chmod +x $TMPDIR/kindlegen
-sudo mv $TMPDIR/kindlegen /usr/local/bin
-rm -rf $TMPDIR
+curl --insecure --location --proxy localhost:8080 http://www.brainjar.com/java/host/test.html
+# This is a very simple HTML file.
 ```

 # Autocompletion