pstotext 

*Tags*: ps to text, ps2text

Usage 

pstotext ~/dl/mustH_b/wp/latex/learn/epslatex.ps > ~/tmp/ps/txt/epslatex.txt

Info 

Description 

pstotext is a program that works with Ghostscript to extract plain text from PostScript and PDF files.

Features 

pstotext works by sending a library, followed by the PostScript file, to the Ghostscript interpreter. The library intercepts the text rendering operators and sends information about the text back to pstotext. This information includes character metrics and encoding vectors, so in most situations we're able to reconstruct the plain text (converted to ISO Latin 1 encoding), with correct word breaks and good guesses about line breaks. It even works for rotated text!

Source 

http://www.research.compaq.com/SRC/virtualpaper/pstotext.html http://www.research.compaq.com/SRC/virtualpaper/cgi-bin/nph-download.tcl/pstotext.tar.Z?object=pstotext

Comments 

  • The '-output' doesn't work. Have to redirect output instead.
  • Test on KDD2000PostWkshp.ps and the result is great
  • Test on www10_sarwar.pdf and the result is garbage

Help 

Quick Help 

 Usage: pstotext [option|file]...
Options:
  -cork            assume Cork encoding for dvips output
  -landscape       rotate 270 degrees
  -landscapeOther  rotate 90 degrees
  -portrait        don't rotate (default)
  -bboxes          output one word per line with bounding box
  -debug           show Ghostscript output and error messages
  -gs "command"    Ghostscript command
  -                read from stdin (default if no files specified)
  -output file     output results to "file" (default is stdout)

Version 1.8g of 5 February 2000 

Build & Installation 

Steps 
make

cp pstotext /opt/bin/
cp pstotext.1 /opt/man/man1

chmod 711 /opt/bin/pstotext
chmod 444 /opt/man/man1/pstotext.1

documented on: 2001.06.03 Sun 18:12:23