メモ

Pythonスクレイピング比較

Pythonで始めるウェブスクレイピング実践入門

speakerdeck.com

	tag	css	xpath	text	javascript
beautifulsoup	〇	〇	×	〇	△
scrapy	×	〇	〇	×	△
pyppeteer	×	〇	〇	×	〇
requests-html	〇	〇	〇	〇	〇

beautifulsoup

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

seleniumとの組合せでjavascriptもスクレイピング可能
複雑なHTMLでもスクレイピングできる

scrapy

https://docs.scrapy.org/en/latest/index.html

scrapy-splashとの組合せでjavascriptもスクレイピング可能
ブログ・ニュースサイトの定型なもののデータ収集に便利
クローリング可能
エクスポートが豊富

pyppeteer

https://miyakogi.github.io/pyppeteer/

puppeteer（Node.js）のPython版
自動化ができる
tableの中身の取り方がわからない

requests-html

https://html.python-requests.org/

beautifulsoup+pyppeteer（javascript）
タグが省略されていると苦手
"ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."とエラーがでてできないことがある