mirror of
https://github.com/NohamR/phpBB-forum-scraper.git
synced 2026-02-22 02:25:43 +00:00
Uploading revised phpBB forum scraping code.
This commit is contained in:
46
README.md
Normal file
46
README.md
Normal file
@@ -0,0 +1,46 @@
|
||||
# phpBB Forum Scraper
|
||||
Python-based scraper for phpBB forums.
|
||||
|
||||
Code requires:
|
||||
|
||||
1. Python scraping library, <a href="http://scrapy.org/" target="_blank">Scrapy</a>.
|
||||
|
||||
2. Python HTML parsing library, <a href="ttps://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">BeautifulSoup</a>.
|
||||
|
||||
|
||||
## Scraper Output
|
||||
Scrapes the following information from forum posts:
|
||||
|
||||
1. Username
|
||||
|
||||
2. User post count
|
||||
|
||||
3. Post date & time
|
||||
|
||||
4. Post text
|
||||
|
||||
5. Quoted text
|
||||
|
||||
|
||||
allowed_domains = ['']
|
||||
start_urls = ['']
|
||||
username = ''
|
||||
password = ''
|
||||
form_login = False
|
||||
|
||||
Edit `phpBB.py` and specify:
|
||||
|
||||
1. `allowed_domains`
|
||||
|
||||
2. `start_urls`
|
||||
|
||||
3. `username` & `password`
|
||||
|
||||
4. `forum_login=False` or `forum_login=True`
|
||||
|
||||
## Instructions:
|
||||
From within `/phpBB_scraper/`:
|
||||
|
||||
`scrapy crawl phpBB` to launch the crawler.
|
||||
|
||||
`scrapy crawl phpBB -o posts.csv` to launch the crawler and save results to CSV.
|
||||
Reference in New Issue
Block a user