Pdf Converting, Linux

Table of Contents

Basic Info
Help

cmd:pdftohtml

Basic Info

Usage

pdftohtml -c -noframes test.pdf test.html

Info

Translates pdf documents into html format

Source

http://pdftohtml.sourceforge.net/

Description

Translates pdf files into HTML or XML formats, combined with png images. Supports encrypted pdf files.

Tag: interface::commandline, role::sw:utility, use::converting, works-with::text:html, works-with::text:pdf

The pdftohtml (http://www.ra.informatik.uni- stuttgart.de/gosho/pdftohtml/)takes another path. Here, the PDF data is analysed and is converted into an HTMl text file. pdftohtml also detects and converts links in PDF files. Pictures are likewise extracted from the file and built into the appropriate place in the HTML file. The visual and layout quality is not amazing (any formatting information is ignored), but at least the created file can serve as a starting point for subsequent fine-tuning - the only problem with this is that the tool has a strange habit of only putting one word in each line in the HTML code.

Help

Support

mailing list

Quick Help

pdftohtml version 0.36 http://pdftohtml.sourceforge.net/, based on Xpdf version 2.02

Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
 -f <int>          : first page to convert
 -l <int>          : last page to convert
 -q                : don't print any messages or errors
 -h                : print usage information
 -help             : print usage information
 -p                : exchange .pdf links by .html
 -c                : generate complex document
 -i                : ignore images
 -noframes         : generate no frames
 -stdout           : use standard output
 -zoom <fp>        : zoom the pdf document (default 1.5)
 -xml              : output for XML post-processing
 -hidden           : output hidden text
 -nomerge          : do not merge paragraphs
 -enc <string>     : output text encoding name
 -dev <string>     : output device name for Ghostscript (png16m, jpeg etc)
 -v                : print copyright and version info
 -opw <string>     : owner password (for encrypted files)
 -upw <string>     : user password (for encrypted files)
 -nodrm            : override document DRM settings

Detail Help

Without "-c: generate complex document", it'd be 10 times faster. But it'll loose all the type-setting info, only bare minimum content and links are left.