HTML/XML related File processing

Debian HTML Converting packages

html2text - An advanced HTML to text converter

markdown - Text-to-HTML conversion tool http://daringfireball.net/projects/markdown/

parsewiki - Documentation System Based on ASCII Text no English doc found on web. Very few people use.

txt2html - Text to HTML converter This package has been orphaned, since 2004-03-13 as v2.23-1.

txt2tags - a Python conversion tool generating HTML/SGML/LaTeX/man/MoinMoin/mgp/PageMaker files

unhtml - Remove the markup tags from an HTML file w3m - WWW browsable pager with excellent tables/frames support

documented on: 2005.02.20

dpkg:html2text

Usage

html2text -o outfile.txt <file>
lynx <url> | html2text -ascii -nobs -width 76 -style pretty | less

Info

An advanced HTML to text converter

Source

http://userpage.fu-berlin.de/~mbayer/tools/html2text.html

Help

Support

http://www.mbayer.de/wiki/index.php?HtmlToText

Quick Help

  html2text [ -unparse | -check ] [ -debug-scanner ] [ -debug-parser ] \
     [ -rcfile <file> ] [ -style ( compact | pretty ) ] [ -width <w> ] \
     [ -o <file> ] [ -nobs ] [ -ascii ] [ <input-url> ] ...
Formats HTML document(s) read from <input-url> or STDIN and generates ASCII
text.

-rcfile <file> Read <file> instead of "$HOME/.html2textrc"
-style compact Create a "compact" output format (default)
-style pretty  Insert some vertical space for nicer output
-width <w>     Optimize for screen widths other than 79
-o <file>      Redirect output into <file>
-nobs          Do not use backspaces for boldface and underlining
-ascii         Use plain ASCII for output instead of ISO-8859-1

documented on: 2005.02.21

enhance request

Hi, I like your program very much!

> One feature request: Could you add some URL handling features? The
> html2textrc(5) feature is really nice, but it lacks the ability to handle
> urls. I.e., I'm hoping to use this tool to convert web pages to wiki format,
> which means I want to be able to define that urls be translated in plain
> text also. E.g., I hope I can use html2text to tranlaste urls into wiki
> format like, !http://www.google.com[] This Link points to google?, or in
> DokuWiki? format -- [ http://www.google.com | This Link points to google
> ]]. Thanks.

Think of html2text as a filter. It aims to show you what you would see if you loaded the file in a grafical browser, without being a browser. The arguments of HTML elements are not interpreted, with exception for "IMG ALT", which is used to give a good substitute for images that cannot be represented in plain text media. While I understand that it might be desiderable for Wiki source code pre-processing to have the "A HREF" argument contents displayed verbatim, this would completely break with the idea of a filter. html2text is expected not to bother about markup as long as it does not contain any structural information (so called logical markup, think of headings, lists and so on). "A" does not.

> One suggestion: according to the manual, "html2text will not follow
> redirections (HTTP 301/307). Proxy servers are not supported." This turns
> html2text to a very limited use. Why not just remove all url featching code,
> and rely on cat/lynx to feed html files to html2text? I.e., do it the unix
> way -- "do only one thing, but do it the best". In fact, in my alias, I
> always use "lynx -source" to pipe stuff to html2text -- I don't want that my
> code sometimes work, but sometimes don't (I think redirections happen quite
> often on the web).

As already stated in the documentation, the HTTP implementation in html2text is rather basic: All it does is more or less to issue a "GET" request. It's more a gimmick than a core function. But that's not sufficient for removing it completely and for disappointing all of the other users that might find it usefull. Thus, if the HTTP engine in html2text does not fit your needs, just don't use it.

MartinBayer

Can anyone recommend an HTML 'beautifier'?

I am looking for a utility which will turn all of my lower case HTML (written in, say, notepad) into upper case and perhaps indent it too. Does anyone know if such a program exists?

Can anyone recommend an HTML 'beautifier'?

You can user Dave Raggett's "tidy" to force all tags into upper case, or lower case. Actually, this tool can do so much more to beautify your code, for me it's a must-have.

It's a command-line tool (great for mass manipulation), but it has also been integrated in a number of GUI programs. See http://www.w3.org/People/Raggett/tidy/ for more info.

Matthias

Can anyone recommend an HTML 'beautifier'?

> I would just like to thank Matthias and all the others who have pointed me
> in the direction of 'Tidy'.  From what I have read, it is does exactly the
> sort of thing I am looking for.  My problem now is figuring out how it
> works!  Excuse me for being thick, but the blurb on www.w3.org about the
> program is virtually incomprehensible for a novice!  All that stuff about
> stderr and stdout doesn't mean a thing to me.

Download tidy, and save it in one of the directories which are on your default path.

Open an MS-DOS window.

In it, connect to the directory where your HTML files are by typing

cd <path to directory>

Now, for each file you want to work on, type

tidy -i -u -m <filename>

Where -i means 'indent'; -u means 'upper-case tags', and -m means 'modify in place'.

Simon Brooke

Can anyone recommend an HTML 'beautifier'?

> works!  Excuse me for being thick, but the blurb on www.w3.org about the
> program is virtually incomprehensible for a novice!  All that stuff about
> stderr and stdout doesn't mean a thing to me.

If you really want to do more of this web stuff, you might as well get used to the technobabble right now. There is more to come :-).

However… HTML-Kit ( http://www.chami.com/html-kit/ ) has a nice graphic frontend for tidy, including the lowercase/uppercase thing. Programs for other platforms are mentioned at the "tidy"- homepage.

Matthias

Can anyone recommend an HTML 'beautifier'?

http://groups.google.com/groups?selm=8EF099725maddogonlineno%40127.0.0.1

Newsgroups: comp.infosystems.www.authoring.html
Date: 2000/03/07

>I am looking for a utility which will turn all of my lower case HTML
>(written in, say, notepad) into upper case and perhaps indent it too.
>Does anyone know if such a program exists?

With the upcoming shift from html to xml/xhtml I would advice against converting your elements to upper case. XHTML is case sensitive, and only lower-case is allowed.

However, if you find documents hard to read with only lowercase, I would advice that you got an editor with source highlightning. One such Freeware program I recommend for the windows platform is 1stPage 2000, downloadable from <URL:http://www.evrsoft.com/> It also has a tool called HTML tidy which will convert case, and fix indentation for your html-documents.

If you're just after tidy, it can be downloaded for several platforms from <URL: http://www.w3.org/People/Raggett/tidy/>. It can fix other problems with the html, such as nesting errors, word-generated html.

Arve Bersvendsen

HTML Beautifier

http://groups.google.com/groups?selm=38831fd4.439649587%40news.ukgateway.net

Newsgroups: alt.html.writers
Date: 2000/01/17

>I am searching an HTML Sourcecode Beautifier, "tidy" which i got on the w3c
>site doesnt work really, if the prog has also a function to check the code
>it would be fine.

Perhaps if you could tell us which aspect of tidy doesn't work for you we might be able to offer some suggestions.

I find Tidy very useful.

Calum

p.s. We're discussing http://www.w3.org/People/Raggett/tidy/

HTML Beautifier

Hi. Request canceled, I just found a tool that fits my needs perfectly. Thank you for your attendance.

BTW : If you are interested … http://freshmeat.net/projects/htmltidy/

Martin

HTML Beautifier

Note, htmltidy in freshmeat.net *is* Raggett's tidy.

Actually, the most up to date development branch is at http://tidy.sourceforge.net/

Tong

Convert HTML codes to upper-ASCII characters

Faster lex-based html2uml translator http://webglimpse.net/allusers/htuml2txt.lex.tar.gz

slower htuml2txt.pl perl filter http://webglimpse.net/docs/htuml2txt.pl

txtfmt

txtfmt is an ASCII text formatter utility which formats XML documents into ASCII text. It is most useful for formatting e-mail messages. It handles paragraphs, bullets, tables, and more.

Source

http://www.tun.com/software/txtfmt/

Related Urls

http://freshmeat.net/projects/txtfmt/

Comments

cmd:Tidy

Matthew Campbell - April 16th 1999, 03:20 EST

Usage

rm -f err; ls *htm | doeach.pl 'tidy -wrap 72 -raw @_' @g ~+1/@_ 2@g@g err

tidy --show-warnings no --force-output yes -quiet -indent -raw -wrap ${hfrm:-5000} -asxml --write-back yes

tidy --write-back yes --show-warnings no --clean yes --force-output yes -wrap 5000 l-grub-1-1.html
tidy -quiet -upper -asxml -numeric -indent --show-warnings no --clean yes --force-output yes -wrap 5000 /home/tong/try/grub/l-grub/l-grub-1-1.html > $tf.grub.htm

Description

HTML TIDY is a free utility to fix mistakes made while editing HTML and to automatically tidy up sloppy editing into nicely layed out markup. It also works great on the atrociously hard to read markup generated by specialized HTML editors and conversion tools, and can help you identify where you need to pay further attention on making your pages more accessible to people with disabilities.

Source

http://tidy.sourceforge.net/

The maintenance of Tidy has now been taken over by a group of enthusiastic volunteers at Source Forge, see http://tidy.sourceforge.net.

http://www.w3.org/People/Raggett/tidy/

build & Installation

make
mv tidy ~/local/bin

Run

rm -f err; ls *htm | doeach.pl 'tidy -wrap 72 -raw @_' @g ~+1/@_ 2@g@g err

tidy nowhere.htm
tidy 4dos650.htm
tidy -f err 4dos650.htm

documented on: Sat 11-14-98 22:38:32

Gnu help files to html

Texi2html

texi2html is a Perl script that converts GNU's Texinfo files to HTML.

The program takes Texinfo files (and not info ones) and produces a set of HTML files. The quality of the output is close to the printed output and is much better than an info->HTML gateway.

old page

http://wwwinfo.cern.ch/dis/texi2html/ isn't maintained anymore.

new page

http://www.mathematik.uni-kl.de/~obachman/Texi2html/

Texi2html's current Homepage

Texi2html is a Perl script that converts Texinfo to HTML. http://texinfo.org/ http://w3c.org/MarkUp/

Lionel Cons's Texi2html page has much more information. http://wwwinfo.cern.ch/dis/texi2html/

You can download the latest release version from http://www.mathematik.uni-kl.de/~obachman/Texi2html/

Last modified: Mon Nov 22 14:06:17 MET 1999

documented on: 1999.11.23 Tue 10:02:01

linux xml viewer

Linux native solution

GTK+ XML viewer

.. gxmlviewer is an xml viewer … License: GNU General Public License (GPL);

Natural Language: English; Operating System: Linux; …

http://sourceforge.net/projects/gxmlviewer/

Adobe SVG Viewer 3.0 betas for Linux and Solaris

http://www.xml.com/cs/user/view/cs_msg/323

New Linux tools: Adobe SVG viewer and Quick … Users of the Linux platform will be glad to hear of the release of Adobe's SVG viewer: also JXML's Quick Java/XML toolkit now has explicit Linux support. …

http://www.xmlhack.com/read.php?item=1479

Java solution: XML Viewer

alphaWorks … UNIX scripts are included. What is XML Viewer for Java TM ? XML Viewer for Java is a Java application …

http://www.alphaworks.ibm.com/tech/xmlviewer

Java solution: XML Viewer

XML tools by category http://www.garshol.priv.no/download/xmltools/cat_ix.html

XML tools by platform http://www.garshol.priv.no/download/xmltools/plat_ix.html

XML viewer

XML Viewer review

http://www.garshol.priv.no/download/xmltools/prod/XMLViewer.html

By:         IBM alphaWorks
Version:    15.Sep.99 release
Platforms:  Java
Category:   XML browsers
Info:       http://www.alphaworks.ibm.com/tech/xmlviewer

XML Viewer is a simple Java application that can display both raw XML source and a tree view of any well-formed XML document. XML Viewer is also DTD-aware and can show DTDs as well and show the declaration of any element or attribute.

Free XML Tools. http://www.garshol.priv.no/download/xmltools/

XML Viewer

http://www.alphaworks.ibm.com/tech/xmlviewer http://www.alphaworks.ibm.com/aw.nsf/download/xmlviewer

XML and DB

http://www.xml.com/lpt/a/2000/12/13/perlxmldb.html

http://www.linuxworld.com/linuxworld/lw-1999-03/lw-03-xml_p.html http://www.linuxworld.com/linuxworld/lw-1999-09/lw-09-xml2_p.html http://www.linuxworld.com/linuxworld/lw-2000-08/lw-08-xml2tools_p.html http://www.linuxworld.com/linuxworld/lw-2001-02/lw-02-xml3databases_p.html

http://www.rpbourret.com/xml/XMLAndDatabases.htm http://www.rpbourret.com/xml/XMLDatabaseProds.htm

http://www.webtechniques.com/archives/1999/06/david/

XML Parsing using Perl

Expat

Info

Release 1.95.1

Source

http://sourceforge.net/projects/expat/ http://download.sourceforge.net/expat/expat-1.95.1.tar.gz

Related Urls

A reference manual is available in the doc/reference.html in this distribution.

Discussion related to the direction of future expat development takes place on expat-discuss@lists.sourceforge.net. Archives of this list may be found at http://www.geocrawler.com/redir-sf.php3?list=expat-discuss.

Build, Test run & Installation

Steps

./configure --prefix=/opt

make

pkg=expat-1.95.1
make -n install | tee /export/pub/installs/logs/$pkg.log.0
make install | tee /export/pub/installs/logs/$pkg.log.1

documented on: 2000.12.21 Thu 20:24:24