爬虫概述

作者：游鱼思

概述

库

BeautifulSoup：用于解析HTML文档，提取数据。
Selenium：用来模拟用户在浏览器中执行的操作，如点击链接、填写表单、执行JavaScript脚本等。Selenium支持多种浏览器，包括Chrome、Firefox、Safari等。它的主要优势在于它可以在不同的浏览器、操作系统和设备上运行，这使得它成为自动化测试的首选工具。

框架

深度剖析4款Python爬虫框架，构建你的数据收割机

Scrapy 生态

Scrapy：完整的爬虫框架，包括了调度、下载、解析、存储等模块。

https://docs.scrapy.org/en/latest/index.html

gerapy VS scrapyweb : 都在Scrapy基础上提供了WebUI，gerapy提供了项目编辑功能，sccrapyweb没有。

scrapyweb：基于Scrapy的Web UI，可视化管理爬虫。
https://github.com/my8100/scrapydweb

gerapy：基于Scrapy的爬虫框架，提供可视化的爬虫管理界面。

https://github.com/Gerapy/Gerapy
介绍 https://cuiqingcai.com/4959.html

Gerapy Auto Extractor 是Gerapy中的一个功能，可以自动提取列表页、详情页中的标题、正文、发布时间。

https://github.com/Gerapy/GerapyAutoExtractor

介绍 https://www.v2ex.com/t/687948

安装scrapyd

https://hub.docker.com/r/vimagick/scrapyd

下载 docker-compose.yml，同一目录下运行：

$ docker-compose up -d scrapyd

$ docker-compose logs -f scrapyd

$ docker cp scrapyd_scrapyd_1:/var/lib/scrapyd/items .

运行 gerapy ，工作目录绑定到 /www/wwwroot/gerapy ，主机端口绑定到90。

>docker run -d -v <workspace>:/home/gerapy -p <public_port>:<container_port> germey/gerapy

设置镜像自动启动：

>docker update --restart=always gerapy

Pyspider 生态

Pyspider：基于Scrapy的分布式爬虫框架，支持分布式、分布式集群、分布式任务调度、分布式数据管理、分布式Web UI等功能。
https://github.com/binux/pyspider

2018年之后没再更新了，放弃。

结论：Scrapy + gerapy

其它框架

http://github.com/omkarcloud/botasaurus

Most Stealthiest Framework LITERALLY : Based on the benchmarks, which we encourage you to read here, our framework stands as the most stealthy in both the JS and Python universes. It is more stealthy than the popular Python library undetected-chromedriver and the well-known JavaScript library puppeteer-stealth. Botasaurus can easily visit websites like https://nowsecure.nl/. With Botasaurus, you don't need to waste time finding ways to unblock a website. For usage, see this FAQ.
Access Cloudflare Websites with Simple HTTP Requests: We can access Cloudflare-protected pages using simple HTTP requests. Saving you both time and money spent on proxies. For usage, see this FAQ.
SSL Support for Authenticated Proxy: We are the first and only Python Web Scraping Framework as of writing to offer SSL support for authenticated proxies. No other browser automation libraries be it seleniumwire, puppeteer, playwright offers this important web scraping feature, this feautre enables you to easily access Cloudflare protected websites when using authenticated proxies, which would otherwise be blocked if you used only the bare authenticated proxy.
Use Any Chrome Extension with Just 1 Line of Code: Easily integrate any Chrome extension, be it a Captcha Solving Extension, Adblocker, or any other from the Chrome Web Store, with just one line of code.. Say Sayonara, to the manual process of downloading, unzipping, configuring, and loading extensions.
Sitemap Support: With just one line of code, you can get all links for a website.
Data Cleaners: Make your scrapers robust by cleaning data with expert created data cleaners.
Debuggability: When a crash occurs due to an incorrect selector, etc., Botasaurus pauses the browser instead of closing it, facilitating painless on-the-spot debugging.
Caching: Botasaurus allows you to cache web scraping results, ensuring lightning-fast performance on subsequent scrapes.
Easy Configuration: Easily save hours of Development Time with easy parallelization, profile, and proxy configuration. We make asynchronous, parallel scraping a child's play.
Build Robust Scrapers: Easily configure retry on exceptions to ensure no errors comes in between you and the data
Time-Saving Selenium Shortcuts: Botasaurus comes with numerous Selenium shortcuts to make web scraping incredibly easy.