camelotでPDFの表からEXCELにコマンド変換(CSV・XLSX)

Camelot: PDF Table Extraction for Humans — Camelot 0.8.2 documentation

インストール

Installation of dependencies — Camelot 0.8.2 documentation

apt install python3-tk ghostscript
pip install camelot-py[cv]
# PATH追加
export PATH=$PATH:/home/imabari/.local/bin

# 変換
camelot -p 2-end -o black.xlsx -f excel -split lattice 180928.pdf

# 表示
camelot -p 2 lattice -plot joint 180928.pdf

# 線が短い表の場合 -scale 40 を付ける
camelot -p all -o black.xlsx -f excel -split lattice -scale 40 180928.pdf

# テキストの改行スペース削除 -strip ' \n'
camelot -p all -o black.xlsx -f excel -split -strip ' \n' lattice 180928.pdf

# テキストコピー
camelot -p all -o data.csv -f csv -strip ' .\n' -split lattice -scale 40 -copy v data.pdf

コマンドライン

Command-Line Interface — Camelot 0.8.2 documentation

Usage: camelot [OPTIONS] COMMAND [ARGS]...

  Camelot: PDF Table Extraction for Humans

Options:
  --version                       Show the version and exit.
  -q, --quiet TEXT                Suppress logs and warnings.
  -p, --pages TEXT                Comma-separated page numbers. Example: 1,3,4
                                  or 1,4-end.
  -pw, --password TEXT            Password for decryption.
  -o, --output TEXT               Output file path.
  -f, --format [csv|json|excel|html]
                                  Output file format.
  -z, --zip                       Create ZIP archive.
  -split, --split_text            Split text that spans across multiple cells.
  -flag, --flag_size              Flag text based on font size. Useful to
                                  detect super/subscripts.
  -strip, --strip_text TEXT       Characters that should be stripped from a
                                  string before assigning it to a cell.
  -M, --margins <FLOAT FLOAT FLOAT>...
                                  PDFMiner char_margin, line_margin and
                                  word_margin.
  --help                          Show this message and exit.

Commands:
  lattice  Use lines between text to parse the table.
  stream   Use spaces between text to parse the table.