pdfalto is a command line executable for parsing PDF files and producing structured XML representations of the PDF content in ALTO format, capturing in particular all the layout and style information of the PDF.
pdfalto is initially a fork of pdf2xml, developed at XRCE, with modifications for robustness, addition of features, improved layout element detections, and output enhanced format in ALTO (including in particular space information, useful for instance for further machine learning processing). It is based on the Xpdf library.
Versions are provided in the CHANGELOG.md and in the Github release panel on the right side of the main page of this repository.
An Archlinux package for pdfalto is available here, thanks to @andreasbaumann. The build process described below will create a portable standalone pdfalto executable that can be packaged with other tools without further installation requirements for the end-user.
- compilers : clang > 5 or gcc > 7, c++17 required
- makefile generator : cmake >= 3.10.0
- fetching dependencies : wget
General usage is as follows:
Usage: pdfalto [options] <PDF-file> [<xml-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-verbose : display pdf attributes
-noImage : deprecated, use -onlyGraphsCoord instead
-onlyGraphsCoord : only extract image coordinates, do not dump image files
-skipGraphs : skip all graphics processing (bitmap and vectorial)
-outline : create an outline file xml
-annotation : create an annotations file xml
-noLineNumbers : do not output line numbers added in manuscript-style textual documents
-readingOrder : blocks follow the reading order
-noText : do not extract textual objects (might be useful, but non-valid ALTO)
-charReadingOrderAttr : include TYPE attribute to String elements to indicate right-to-left reading order (might be useful, but non-valid ALTO)
-fullFontName : fonts names are not normalized
-nsURI <string> : add the specified namespace URI
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-filesLimit <int> : limit of asset files be extracted
-q : don't print any messages or errors
-v : print version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
In addition to the ALTO file describing the PDF content, the following files are generated:
-
_metadata.xmlfile containing a pdf file metadata (generate metadata information in a separate XML file as ALTO schema does not support that). -
_annot.xmlfile containing a description of the annotations in the PDF (e.g. GOTO, external http links, ...) obtained with-annotationoption -
_outline.xmlfile containing a possible PDF-embedded table of content (aka outline) obtained with-outlineoption -
.xml_data/subdirectory containing the vectorial (.vec) and bitmap images (.png) embedded in the PDF, this is generated by default. This extraction slows down the process very significantly, so if no image files are required, use-onlyGraphsCoord(or the deprecated alias-noImage) to keep extracting image coordinates without dumping the image files. To skip all graphics processing (bitmap and vectorial), use-skipGraphs.
The goal of pdfalto is to extract all the content of a PDF, not just text, but also layout, style, font, vector
graphics, embedded bitmap, annotation, metadata, and outline information. For convenience and debugging, we provide a
simple XSLT to extract only the text content from the produced ALTO XML file. For instance, using xsltproc command
line, the following outputs the text content only:
xsltproc schema/alto2txt.xsl alto_file.xml
Dependencies can be recompiled by running this script
./install_deps.sh
The script will download and build the dependencies unders libs/ and the additional language support packages for xpdf
under languages/.
If necessary, see compiling dependencies procedures for further details.
(issue 41) might occur while building, in this case you'll need to compile the dependencies before building pdflato.
- NOTE for windows : it's recommended to use Cygwin and install standard libraries (either for cland or gcc)
git clone https://github.com/kermitt2/pdfalto.git && cd pdfalto
- Xpdf-4.03 is shipped as git submodule, to download it:
git submodule update --init --recursive
- Build pdfalto:
cmake .
make
The executable pdfalto is generated in the root directory. Additionally, this will create a static library for
xpdf-4.03 at the following path xpdf-4.03/build/xpdf/lib/libxpdf.a and all the libraries and their respective
subdirectory.
To use the additional xpdf language support packages, the executable pdfalto comes with a config file xpdfrc and
language resources installed under languages/. Both xpdfrc and languages/ must be alongside the executable
pdfalto to be used. To add pdfalto with these additional resources to a third party application (e.g. GROBID), move
the executation together with these files:
lopez@work:~$ ls my_pdfalto/
languages pdfalto xpdfrc
(issue #135) on macOS "fontconfig.h file not found" might occur while building, see described workaround.
-
Text like containing block element characters (https://unicode.org/charts/PDF/U2B00.pdf) are used as placeholders for unknown character unicodes, instead of what would be expected when visually inspecting the text. The reason for these unsolved character unicode values is that the actual characters are glyphs that are embedded in the PDF document which use free unicode range for embedded fonts, not the right unicode. The only way to extract the valid text for those special characters is to use OCR at glyph level . This is our targeted main future enhancement, relying on a custom Deep Learning approach.
-
map special characters in secondary fonts to their expected unicode
-
try to optimize speed and memory
-
see the issue tracker for further tasks
All changes are in the CHANGELOG.md
To release pdfalto you need bump-my-version.
Create and activate a virtual environment and install the tool:
python3 -m venv venv
source venv/bin/activate
pip install bump-my-versionThen you can run show-bump to see the plan of updating
bump-my-version show-bump
0.5.1 ββ bump ββ¬β major β 1.0.0
ββ minor β 0.6.0
β°β patch β 0.5.2and make the new release by:
bump-my-version bump patch|minor|majorand git push --tags
Contact: Patrice Lopez (patrice.lopez@science-miner.com)
pdfalto is developed by Patrice Lopez (patrice.lopez@science-miner.com) and Achraf Azhar (achraf.azhar@inria.fr).
pdf2xml is orignally written by HervΓ© DΓ©jean, Sophie Andrieu, Jean-Yves Vion-Dury and Emmanuel Giguet (XRCE) under GPL2 license.
Xpdf is developed by Glyph & Cog, LLC (1996-2017) and distributed under GPL2 or GPL3 license.
The windows version has been built originally by @pboumenot and ported on windows 7 for 64 bit, then for windows (native and cygwin) by @lfoppiano and @flydutch.
As the original pdf2xml and main dependency Xpdf, pdfalto is distributed under GPL2 license.
Some tools for converting ALTO into other formats: