Pythonスクレイピング比較

Pythonで始めるウェブスクレイピング実践入門

speakerdeck.com

tag css xpath text javascript
beautifulsoup ×
scrapy × ×
pyppeteer × ×
requests-html

beautifulsoup

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

scrapy

https://docs.scrapy.org/en/latest/index.html

  • scrapy-splashとの組合せでjavascriptスクレイピング可能
  • ブログ・ニュースサイトの定型なもののデータ収集に便利
  • クローリング可能
  • エクスポートが豊富

pyppeteer

https://miyakogi.github.io/pyppeteer/

  • puppeteer(Node.js)のPython
  • 自動化ができる
  • tableの中身の取り方がわからない

requests-html

https://html.python-requests.org/

  • beautifulsoup+pyppeteer(javascript
  • タグが省略されていると苦手
  • "ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."とエラーがでてできないことがある

pyppeteerでスクレイピング

Pyppeteer’s documentation — Pyppeteer 0.0.24 documentation

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch()
    page = await browser.newPage()
    await page.goto('http://www.nikkei.com/markets/kabu/')

    element = await page.querySelector('span.mkc-stock_prices')
    title = await page.evaluate('(element) => element.textContent', element)

    print(title)

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

requests-htmlでスクレイピング

requests-html

https://html.python-requests.org/

!pip install requests-html
!pip install retry

日経平均スクレイピング

from requests.exceptions import  ConnectionError, TooManyRedirects, HTTPError
from requests_html import  HTMLSession
from retry import retry

# 試行回数:3 間隔:2 指数:2
# リトライの間隔が2、4、6と増えていく
@retry(tries=3, delay=2, backoff=2)
def get_resp(url):
    try :
        session = HTMLSession()
        return session.get(url)
    except ConnectionError:
        print('Network Error')
        raise
    except TooManyRedirects:
        print('TooManyRedirects')
        raise
    except HTTPError:
        print('BadResponse')
        raise

try:
    r = get_resp('http://www.nikkei.com/markets/kabu/')
except:
    print('Response not found')
else:
    print(r.html.find('span.mkc-stock_prices', first=True).text)

Hyper-VのHvSocketからVMBusに戻す

# VM名確認
Get-VM

# 現在の設定を確認
(Get-VM -VMName "Ubuntu 18.04.1 LTS").EnhancedSessionTransportType

# VMBusに設定
Set-VM -VMName "Ubuntu 18.04.1 LTS" -EnhancedSessionTransportType VMBus

# HvSocketに設定
Set-VM -VMName "Ubuntu 18.04.1 LTS" -EnhancedSessionTransportType HvSocket

mapboxでCSVファイル(色分け・アイコン表示)

CSVファイルを地図に表示 - Our Open Data

www.mapbox.com

サンプル

HTML

http://imabari.jpn.org/map/hospital_map.html

CSV

http://imabari.jpn.org/map/hospital.csv

ソース

html

<html>

<head>
  <meta charset=utf-8 />
  <title>今治市の病院</title>
  <meta name='viewport' content='initial-scale=1,maximum-scale=1,user-scalable=no' />
  <script src='https://api.mapbox.com/mapbox.js/v3.1.1/mapbox.js'></script>
  <link href='https://api.mapbox.com/mapbox.js/v3.1.1/mapbox.css' rel='stylesheet' />
  <style>
    body {
      margin: 0;
      padding: 0;
    }

    #map {
      position: absolute;
      top: 0;
      bottom: 0;
      width: 100%;
    }
  </style>
</head>

<body>
  <script src='https://api.mapbox.com/mapbox.js/plugins/leaflet-omnivore/v0.2.0/leaflet-omnivore.min.js'></script>

  <div id='map'></div>

  <script>
    L.mapbox.accessToken = 'トークン入力';
    var map = L.mapbox.map('map', 'mapbox.streets')
      .setView([34.06611111, 132.99777777], 12);

    // omnivore will AJAX-request this file behind the scenes and parse it:
    // note that there are considerations:
    // - The CSV file must contain latitude and longitude values, in column
    //   named roughly latitude and longitude
    // - The file must either be on the same domain as the page that requests it,
    //   or both the server it is requested from and the user's browser must
    //   support CORS.

    // The omnivore functions take three arguments:
    //
    // - a URL of the file to fetch
    // - options to the parser
    // - a custom layer
    //
    // And they return the custom layer, which is by default an L.geoJson layer.
    //
    // The second two arguments are each optional. In this case we're supplying
    // no arguments to the parser (null), but supplying a custom layer:
    // an instance of L.mapbox.featureLayer
    // This means that rows with simplestyle properties will be styled as they
    // would be in GeoJSON and elsewhere.
    omnivore.csv('hospital.csv', null, L.mapbox.featureLayer()).addTo(map);
  </script>
</body>

</html>

CSV

CSVのヘッダーに色「marker-color」とアイコン「marker-symbol」を追加すると表示される

www.mapbox.com

UTF-8で保存

ヘッダー名 データ
id ID
title タイトル
description 内容
address 住所
tel 電話番号
latitude 緯度
longitude 経度
marker-size 大きさ large
marker-color 色 #C0C0C0
marker-symbol アイコン名

アイコン種類

Maki Icons | By Mapbox

List of Mapbox (v3) / Maki icons