Shell script for ADF scanner Fujitsu SP-1120

Posted by Eric Scheibler at July 3, 2021

Recently I bought a Fujitsu SP-1120 to replace my rather slow and old flatbed scanner. The Fujitsu is an an automatic document feeder (ADF) with duplex support. It’s faster, scans front and back pages in one go and produces a much better image quality for ocr.

This article describes the installation under Debian Linux and provides a simple scan-to-pdf shell script.

Install the scanner

First download the Linux driver package - at least it works under Ubuntu and Debian.

Then install:

# dpkg -i pfusp-ubuntu18.04_2.1.1_amd64.deb

And create the udev rule 99-fujitsu-sp-1120.rules under /etc/udev/rules.d with the following content:

# Fujitsu ScanSnap SP-1120
ATTRS{idVendor}=="04c5", ATTRS{idProduct}=="1473", MODE="0664", GROUP="scanner", ENV{libsane_matched}="yes"

Adapt the product id, if you bought a SP-1125 or SP-1130. Afterwards reload the udev rules:

# udevadm control --reload
# udevadm trigger

Lastly install the sane package (Scanner Access Now Easy) and turn on the scanner and check, if the scanner was installed correctly:

# apt install sane
# scanadf --list-devices

Expected output:

device `pfusp:SP1120:001:093' is a FUJITSU SP1120 scanner

Scan script

Download the shell script below, make it executable and adjust some defaults to your liking. For example you may change the pdf viewer variable. I use another shell script for that, which you can finde here.

To start scanning, put some document into your scanner and type:

adf

Use the -p parameter, to preview the ocr result of the first page and -h for help.

Download: adf

#!/bin/bash

# Script to control an ADF scanner
# - start scanning and create a single pdf file
# - with empty page and orientation detection
# - tested with Fujitsu SP-1120
#
# ... excessively borrowed from https://github.com/rocketraman/sane-scan-pdf
#
# Version: 0.1
# Date:    2021-06-16
# License: GNU General Public License
# Author:  Eric Scheibler
# E-Mail:  email [at] eric-scheibler [dot] de
# URL:     http://eric-scheibler.de/en/blog/2015/04/script-to-extract-text-from-images-and-scanned-pdf-files/
#
# Install:
#   sudo apt install imagemagick poppler-utils sane tesseract-ocr tesseract-ocr-deu tesseract-ocr-eng unpaper

OUTPUT="scan.pdf"
TEXT_EDITOR="/usr/bin/vim"
PDF_VIEWER="$HOME/bin/ocr"
HELP=0
VERBOSE=0

# scanner params
DEVICE=pfusp
RESOLUTION=400
MODE=Lineart

# ocr params
OCR_LANGUAGE=deu
OCR_PREVIEW_FIRST_PAGE=0
OVERWRITE_OUTPUT_FILE=0


#####

TMP_DIR=$(mktemp -d -p "" scan.XXXXXXXXXX)
cleanup() {
    rm -rf "$TMP_DIR"
}
trap cleanup EXIT

function yes_or_no {
    while true; do
        read -p "$* [y/n]: " yn
        case $yn in
            [Yy]*) return 0  ;;
            [Nn]*) echo "Aborted" ; return  1 ;;
        esac
    done
}


# Parse command-line options
while [[ $# > 0 ]]; do
    case "$1" in
    -h|--help) HELP=1 ;;
    -v|--verbose) VERBOSE=1 ;;
    -o|--output) shift; OUTPUT="$1" ;;
    -x|--device) shift; DEVICE=$1;;
    -m|--mode) shift; MODE=$1 ;;
    -r|--resolution) shift; RESOLUTION=$1 ;;
    -l|--language) shift; OCR_LANGUAGE=$1 ;;
    -p|--preview-first-page) OCR_PREVIEW_FIRST_PAGE=1 ;;
    -w|--overwrite-output-file) OVERWRITE_OUTPUT_FILE=1 ;;
    esac
    shift     # next option
done

if [[ $HELP == 1 ]]; then
    echo "$(basename $0) [OPTIONS]... [OUTPUT]"
    echo ""
    echo "OPTIONS"
    echo " -x, --device"
    echo "    Override scanner device name, defaulting to \"pfusp\""
    echo " -m, --mode"
    echo "     Mode e.g. Lineart (default), Halftone, Gray, Color, etc."
    echo " -r, --resolution"
    echo "     Resolution e.g 300 (default)"
    echo " -l, --language <lang>"
    echo "     which language to use for OCR"
    echo " -p, --preview-first-page"
    echo "     OCR first page and preview in $TEXT_EDITOR"
    echo ""
    echo "OUTPUT"
    echo " -o, --output <outputfile>"
    echo "     Output to named file default=scan.pdf"
    echo " -w, --overwrite-output-file"
    echo "     Overwrite the output pdf file, if it already exists"
    echo " -v, --verbose"
    exit 0
fi

if [[ $VERBOSE == 0 ]]; then
    quiet_paran="--quiet"
    suppress_error_messages="2> /dev/null"
fi

if [[ "$OUTPUT" == "" ]]; then
    echo >&2 "Output file must be specified. Aborting."
    exit 1
fi

if [[ -f "$OUTPUT" ]]; then
    if [[ $OVERWRITE_OUTPUT_FILE == 0 ]]; then
        echo >&2 "Output file $OUTPUT already exists. Aborting."
        exit 1
    else
        rm "$OUTPUT"
    fi
fi


echo >&2 "Scanning..."
scanadf --device-name "$DEVICE" --source Adf-duplex --resolution $RESOLUTION --mode $MODE -o $TMP_DIR/scan-%04d
if [[ $? != 0 ]]; then
    exit 1
fi
echo ""


shopt -s extglob nullglob
image_files=($TMP_DIR/scan-[0-9]*)
num_scans=${#image_files[@]}
if [[ $num_scans > 0 ]]; then

    if [[ $OCR_PREVIEW_FIRST_PAGE == 1 ]]; then
        echo "Creating preview..."
        preview_image_file="${image_files[0]}"
        preview_text_file="$TMP_DIR/preview_first_page.txt"
        # ocr
        eval tesseract $preview_image_file ${preview_text_file%.*} -l $OCR_LANGUAGE --psm 12 $suppress_error_messages
        # show
        $TEXT_EDITOR $preview_text_file
        if ! yes_or_no "Proceed?"; then
            exit 0
        fi
        # remove preview text file
        rm $preview_text_file
        echo ""
    fi

    echo "Processing $num_scans pages"
    for image_file in ${image_files[@]}; do
        echo "Process $(basename $image_file)"

        # unpaper
        eval unpaper $quiet_paran --overwrite --dpi $RESOLUTION $image_file $image_file $suppress_error_messages

        # convert to tiff
        convert -density ${RESOLUTION}x${RESOLUTION} -units PixelsPerInch $image_file ${image_file}.tiff
        rm $image_file

        # orientation detection
        orientation_result=$(eval tesseract ${image_file}.tiff - --psm 0 $suppress_error_messages) || orientation_result=
        if [[ $orientation_result == *"Rotate: 180"* ]]; then
            echo "Image orientation is upside down, rotate"
            convert -rotate 180 ${image_file}.tiff ${image_file}.tiff
        fi

        # empty page detection
        percentage_white=$(convert ${image_file}.tiff -fuzz 0% -negate -threshold 0 -negate -format "%[fx:100*mean]" info:) || percentage_white=0
        is_empty_page=$(echo "$percentage_white >= 99.8" | bc -l)
        if [[ $is_empty_page == 1 && $orientation_result == "" ]]; then
            echo "Empty page removed"

        else
            eval tesseract ${image_file}.tiff $image_file -l $OCR_LANGUAGE pdf $suppress_error_messages
            rm ${image_file}.tiff
        fi

        echo ""
    done

    # rename or unite created pdf(s)
    pdf_files=($TMP_DIR/scan-[0-9]*.pdf)
    num_pdf_files=${#pdf_files[@]}
    if [[ $num_pdf_files == 1 ]]; then
        echo "Renaming..."
        mv $TMP_DIR/scan-0*.pdf "$OUTPUT"
    elif [[ $num_pdf_files > 1 ]]; then
        echo "Concatenating pdfs..."
        pdfunite "${pdf_files[@]}" "$OUTPUT"
    fi
fi

if [[ -f "$OUTPUT" ]]; then
    echo "Done."
    if [[ $PDF_VIEWER != "" ]]; then
        if yes_or_no "Open ${OUTPUT}?"; then
            $PDF_VIEWER $OUTPUT
        fi
    fi
else
    echo "No scans found."
fi