2019-01-07

excelの結合セルで縦結合のみ表示

from openpyxl import load_workbook

wb = load_workbook(filename='XXXXXX.xlsx')

for sheet in wb.sheetnames:

    print('-' * 20)
    print(sheet)

    ws = wb[sheet]

    for i in ws.merged_cells.ranges:
        if i.min_row != i.max_row:
            print(i)

2019-01-07

seaborn

qiita.com

2019-01-06

Minecraft Nitendo Switch版

久しぶりに攻略本買った

Minecraft Nintendo Switch版

ジャンル: ソフト
ショップ: 楽天ブックス
価格: 3,542円

Minecraft (マインクラフト) - Switch

出版社/メーカー: 日本マイクロソフト
発売日: 2018/06/21
メディア: Video Game
この商品を含むブログ (1件) を見る

www.atmarkit.co.jp

studio.code.org

2019-01-06

Pythonスクレイピングの基本と実践データサイエンティストのためのWebデータ収集術

スクレイピングの基本を勉強するにはいいかも

今まで見た本の中で説明が長いのと表になっていないので見づらい

book.impress.co.jp

Pythonスクレイピングの基本と実践データサイエンティストのためのWebデータ収集術（impress　top　gear） [ Seppe　vanden　Broucke ]

ジャンル: 本・雑誌・コミック > PC・システム開発 > その他
ショップ: 楽天ブックス
価格: 3,564円

Pythonスクレイピングの基本と実践データサイエンティストのためのWebデータ収集術 (impress top gear)

作者: Seppe vanden Broucke,Bart Baesens,株式会社トップスタジオ
出版社/メーカー: インプレス
発売日: 2018/12/17
メディア: 単行本（ソフトカバー）
この商品を含むブログを見る

github.com

2019-01-04

camelotでPDFの表からEXCELにコマンド変換（CSV・XLSX）

PDF

Camelot: PDF Table Extraction for Humans — Camelot 0.8.2 documentation

インストール

Installation of dependencies — Camelot 0.8.2 documentation

apt install python3-tk ghostscript
pip install camelot-py[cv]

# PATH追加
export PATH=$PATH:/home/imabari/.local/bin

# 変換
camelot -p 2-end -o black.xlsx -f excel -split lattice 180928.pdf

# 表示
camelot -p 2 lattice -plot joint 180928.pdf

# 線が短い表の場合 -scale 40 を付ける
camelot -p all -o black.xlsx -f excel -split lattice -scale 40 180928.pdf

# テキストの改行スペース削除 -strip ' \n'
camelot -p all -o black.xlsx -f excel -split -strip ' \n' lattice 180928.pdf

# テキストコピー
camelot -p all -o data.csv -f csv -strip ' .\n' -split lattice -scale 40 -copy v data.pdf

コマンドライン

Command-Line Interface — Camelot 0.8.2 documentation

Usage: camelot [OPTIONS] COMMAND [ARGS]...

  Camelot: PDF Table Extraction for Humans

Options:
  --version                       Show the version and exit.
  -q, --quiet TEXT                Suppress logs and warnings.
  -p, --pages TEXT                Comma-separated page numbers. Example: 1,3,4
                                  or 1,4-end.
  -pw, --password TEXT            Password for decryption.
  -o, --output TEXT               Output file path.
  -f, --format [csv|json|excel|html]
                                  Output file format.
  -z, --zip                       Create ZIP archive.
  -split, --split_text            Split text that spans across multiple cells.
  -flag, --flag_size              Flag text based on font size. Useful to
                                  detect super/subscripts.
  -strip, --strip_text TEXT       Characters that should be stripped from a
                                  string before assigning it to a cell.
  -M, --margins <FLOAT FLOAT FLOAT>...
                                  PDFMiner char_margin, line_margin and
                                  word_margin.
  --help                          Show this message and exit.

Commands:
  lattice  Use lines between text to parse the table.
  stream   Use spaces between text to parse the table.

2019-01-03

camelotでPDFの表からEXCELに変換（CSV・TSV・XLSX）

PDF

厚生労働省のブラック企業リストをTSV変換

imabari.hateblo.jp

前回tabulaのは失敗するのでcamelotで再挑戦

Camelot: PDF Table Extraction for Humans — Camelot 0.7.3 documentation

厚生労働省の長時間労働削減に向けた取り組みから www.mhlw.go.jp

労働基準関係法令違反に係る公表事案をダウンロード https://www.mhlw.go.jp/kinkyu/dl/180928.pdf

表の部分しか取れないので労働局名と最新更新日がありません

apt install ghostscript
pip install camelot-py[cv]

import re
import pandas as pd
import matplotlib.pyplot as plt

import camelot

# 2ページ目から最終頁まで、セパレーター優先、改行除去
tables = camelot.read_pdf('180928.pdf', pages='2-end', split_text=True, strip_text='\n')

dfs = []

# dataframeに変換、ヘッダー部削除、ヘッダー追加
for table in tables:
    df = table.df
    df.drop(0, inplace=True)
    df.columns = ['企業・事業場名称', '所在地', '公表日', '違反法条', '事案概要', 'その他参考事項']
    dfs.append(df)

# ページ結合
df_black = pd.concat(dfs)

# カンマを追加
def ihan_conv(temp):
    result = re.sub('条(の\d{1,3})?', lambda m: m.group(0) + ', ', temp).rstrip(', ')
    return result

def sonota_conv(temp):
    result = re.sub('H\d{1,2}\.\d{1,2}\.\d{1,2}', lambda m: ', ' + m.group(0), temp).lstrip(', ')
    return result

df_black['違反法条'] = df_black['違反法条'].apply(ihan_conv)
df_black['その他参考事項'] = df_black['その他参考事項'].apply(sonota_conv)

# CSVファイルへ出力
df_black.to_csv('black.csv')

# TSVファイルへ出力
df_black.to_csv('black.tsv', sep='\t' )

# EXCELファイルへ出力
with pd.ExcelWriter('black.xlsx') as writer:
    df_black.to_excel(writer, sheet_name='sheet1')

camelotは線できちんと囲まれているものは大丈夫そう。 2行で１つのセルのようなexcelで１つは線ありもう一つは文字オーバーして線が消えていると隣や上にくっついてしまうので難しい。

点線

needtec.sakura.ne.jp

2018-12-28

GIS

Map