html2text -o outfile.txt <file> lynx <url> | html2text -ascii -nobs -width 76 -style pretty | less
html2text [ -unparse | -check ] [ -debug-scanner ] [ -debug-parser ] \ [ -rcfile <file> ] [ -style ( compact | pretty ) ] [ -width <w> ] \ [ -o <file> ] [ -nobs ] [ -ascii ] [ <input-url> ] ... Formats HTML document(s) read from <input-url> or STDIN and generates ASCII text.
-rcfile <file> Read <file> instead of "$HOME/.html2textrc" -style compact Create a "compact" output format (default) -style pretty Insert some vertical space for nicer output -width <w> Optimize for screen widths other than 79 -o <file> Redirect output into <file> -nobs Do not use backspaces for boldface and underlining -ascii Use plain ASCII for output instead of ISO-8859-1
documented on: 2005.02.21
Hi, I like your program very much!
> One feature request: Could you add some URL handling features? The > html2textrc(5) feature is really nice, but it lacks the ability to handle > urls. I.e., I'm hoping to use this tool to convert web pages to wiki format, > which means I want to be able to define that urls be translated in plain > text also. E.g., I hope I can use html2text to tranlaste urls into wiki > format like, !http://www.google.com[] This Link points to google?, or in > DokuWiki? format -- [ http://www.google.com | This Link points to google > ]]. Thanks.
Think of html2text as a filter. It aims to show you what you would see if you loaded the file in a grafical browser, without being a browser. The arguments of HTML elements are not interpreted, with exception for "IMG ALT", which is used to give a good substitute for images that cannot be represented in plain text media. While I understand that it might be desiderable for Wiki source code pre-processing to have the "A HREF" argument contents displayed verbatim, this would completely break with the idea of a filter. html2text is expected not to bother about markup as long as it does not contain any structural information (so called logical markup, think of headings, lists and so on). "A" does not.
> One suggestion: according to the manual, "html2text will not follow > redirections (HTTP 301/307). Proxy servers are not supported." This turns > html2text to a very limited use. Why not just remove all url featching code, > and rely on cat/lynx to feed html files to html2text? I.e., do it the unix > way -- "do only one thing, but do it the best". In fact, in my alias, I > always use "lynx -source" to pipe stuff to html2text -- I don't want that my > code sometimes work, but sometimes don't (I think redirections happen quite > often on the web).
As already stated in the documentation, the HTTP implementation in html2text is rather basic: All it does is more or less to issue a "GET" request. It's more a gimmick than a core function. But that's not sufficient for removing it completely and for disappointing all of the other users that might find it usefull. Thus, if the HTTP engine in html2text does not fit your needs, just don't use it.
MartinBayer