Table of Contents
Translates pdf files into HTML or XML formats, combined with png images. Supports encrypted pdf files.
Tag: interface::commandline, role::sw:utility, use::converting, works-with::text:html, works-with::text:pdf
http://www.linux-magazine.com/issue/16/ConvertingToHTML.pdf
The pdftohtml (http://www.ra.informatik.uni- stuttgart.de/gosho/pdftohtml/)takes another path. Here, the PDF data is analysed and is converted into an HTMl text file. pdftohtml also detects and converts links in PDF files. Pictures are likewise extracted from the file and built into the appropriate place in the HTML file. The visual and layout quality is not amazing (any formatting information is ignored), but at least the created file can serve as a starting point for subsequent fine-tuning - the only problem with this is that the tool has a strange habit of only putting one word in each line in the HTML code.
pdftohtml version 0.36 http://pdftohtml.sourceforge.net/, based on Xpdf version 2.02
Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>] -f <int> : first page to convert -l <int> : last page to convert -q : don't print any messages or errors -h : print usage information -help : print usage information -p : exchange .pdf links by .html -c : generate complex document -i : ignore images -noframes : generate no frames -stdout : use standard output -zoom <fp> : zoom the pdf document (default 1.5) -xml : output for XML post-processing -hidden : output hidden text -nomerge : do not merge paragraphs -enc <string> : output text encoding name -dev <string> : output device name for Ghostscript (png16m, jpeg etc) -v : print copyright and version info -opw <string> : owner password (for encrypted files) -upw <string> : user password (for encrypted files) -nodrm : override document DRM settings