DebianLinux.Net :: Text Management 

http://debianlinux.net/text_management.html

Table of Contents

  1. Text Management Portals

  2. Unicode Text Tools

  3. Web based Text Editors

  4. Collaborative Text Editors

  5. Screen Text Editors

  6. Stream Text Editors

  7. Screen XML Editors

  8. Stream XML Editors

  9. Screen HTML Editors

  10. Stream HTML Editors

  11. Binary Editors

  12. Text Comparison

  13. Text Conversion

  14. TypeSetting & PostScript Tools

  15. Text Synthesis & Recognition

documented on: 2006.06.10

File cutting 

File cutting summary 

line oriented 

byte oriented 

column oriented 

use cut.

documented on: 2007.04.11

cmd:head 

head
head -c 50m

Help 

-n, --lines=[-]N         print the first N lines instead of the first 10;
                           with the leading `-', print all but the last
                           N lines of each file
-c, --bytes=[-]N         print the first N bytes of each file;
                           with the leading `-', print all but the last
                           N bytes of each file
SIZE  may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg.

head or tail 

> > Is there any ready-made tool to print lines from a file *after* a given
> > line number?

Well, tail does that just fine; to skip the first 10 lines of "file":

tail +11 file

Another option is sed:

sed 1,10d file

The sed approach generizes better; to print lines 11-20 of a file:

sed -e 1,10d -e 20q file

-Ken Pizzini

head or tail 

All lines after line 10:

sed -n '11,$ p' <infile

Ken

documented on: 07-19-99

cmd:cut 

Usage 

ff -l . | cut -c1-12,29-
ls -l | cut -c30-42,56-
head /usr/X11R6/lib/X11/rgb.txt | cut -f 3
# X ls -l | cut -d ' ' -f 1,9 # not working!

Help 

-c list The list following -c  specifies  character  posi-
         tions  (for  instance, -c1-72 would pass the first
         72 characters of each line).
Note Starting from 1.
-f, --fields field-list
      Print only the fields listed in field-list.  Fields are  sepa-
      rated by a TAB by default.
-d, --delimiter delim
      For  -f,  fields are separated by the first character in delim
      instead of by TAB.
Note -d 'x' would normally always follows -f,

choosing the fields 

cut -d':' -f 2
$ cut -d: -f1,5 /etc/passwd | head
root:Super-User
daemon:
adm:Admin
lp:Line Printer Admin
smtp:Mail Daemon User
uucp:uucp Admin
Note -d ' ' is not good.

It can't imitate the awk field selection: Better use -c to pick out the range if you can.

How to eliminate 1 column 

Newsgroups: comp.unix.shell
> > I need to eliminate the second column of a certain file.
>
> Sounds like a job for cut.

Only if the columns are delimited by *exactly one* space. If they look like:

Margolin    Barry    Other stuff
Doherty     John     More other stuff

then cut by itself is useless. You could, however, use sed to collapse all the spaces into a single space and then pipe that to cut.

Barry Margolin

reproduce "head -c" behaviour with dd 

Newsgroups:  comp.unix.shell
Date:        Tue, 5 Dec 2006 12:23:46 -0500
> #-- extract gzip for binaries.out
> head -c +$GZIPBYTE binaries.out >gzip 2>$NUL
> [ $? -eq 0 -a -f gzip ] || { NO; EXIT $BINARIES_EXTRACTION_FAILED; }
>
> The problem is that the script need to be portable on Linux and most of
> Unix flavor (HP-UX, AIX, SCO, Solaris, UnixWare) and "head -c" is not
> supported everywhere (not on SCO, UnixWare, Solaris).
>
> Is it possible to reproduce the head -c behaviour with dd command (or
> with another unix commands ?)
dd bs=$GZIPBYTE count=1 if=binaries.out of=gzip

For very large files you might need to make bs <= system RAM and count=$GZIPBYTE / bs

Bill Marcum

cmd:fold 

Usage 

fold -s -w 132 bigfile | lp

Comments 

Use fmt instead for more advanced controls!

Info 

The fold utility is a filter that will fold lines  from  its
input  files,  breaking the lines to have a maximum of width
column positions (or bytes, if the -b option is  specified).

Help 

-b, --bytes         count bytes rather than columns
-s, --spaces        break at spaces
-w, --width=WIDTH   use WIDTH columns instead of 80
 `-s'
`--spaces'
     Break at word boundaries: the line is broken after the last blank
     before the maximum line length.  If the line contains no such
     blanks, the line is broken at the maximum line length as usual.

Comments 

fold and cut(1) can be used to  create  text  files  out  of
files with arbitrary line lengths.  fold should be used when
the contents of long lines need to be kept contiguous.   cut
should  be  used when the number of lines (or records) needs
to remain constant.
echo "\
Updated ${PKG_INSTALL_ROOT}/etc/inet/services with new netbios and swat \
names and made backup of original ${PKG_INSTALL_ROOT}/etc/inet/services \
as ${PKG_INSTALL_ROOT}/etc/inet/services:presamba." | fold -s -w 60 | \
while read line; do
        echo postinstall: $line
done

documented on: 2002.12.10

cmd:fmt 

Basic Info 

Usage 

fmt -t -c -w 80000
$ file /usr/bin/fmt | fmt -w 40
/usr/bin/a2ps: ELF 32-bit LSB
executable, Intel 80386, version 1,
dynamically linked (uses shared libs),
stripped

Info 

*Tags*: word wrap, :wordwrap, :formatter

fmt - simple optimal text formatter, Reformat each paragraph in the FILE(s)

Comments 

Help 

Support 

Quick Help 

-c, --crown-margin        preserve indentation of first two lines
-p, --prefix=STRING       combine only lines having STRING as prefix
-s, --split-only          split long lines, but do not refill
-t, --tagged-paragraph    indentation of first line different from second
-u, --uniform-spacing     one space between words, two after sentences
-w, --width=WIDTH         maximum line width (default of 75 columns)

Detail Help 

Trying history 1 

converted by "fmt -t -c -w 80000"

weird parargraph break 

From

      Data standards make sure that the terms people use mean the same thing. The
International Classification of Diseases (ICD) is such an example. Canada is in the
process of upgrading ICD from the old version of ICD-9 to the new version of ICD-10
nationwide (ICD-10, 2005; Healthcare Financial Management Association, 2004). The US
however, falls behind the whole world in adopting the new ICD-10 standard. They are
still using ICD-9. Even their latest research papers focus on the old ICD-9 (Glance,
Laurent, Dick, Andrew, Osler, Turner, & Mukamel, Dana, 2006; Bazarian, Jeffrey, Veazie,
Peter, Mookerjee, Sohug, & Lerner,, 2006; Williams, Charles, Hauser, Kimberlea, Correia,
Jane, & Frias, Jaime, 2005).

to

      Data standards make sure that the terms people use mean the same thing. The International Classification of Diseases (ICD) is such an example. Canada is in the process of upgrading ICD from the old version of ICD-9 to the new version of ICD-10 nationwide (ICD-10, 2005; Healthcare Financial Management Association, 2004). The US however, falls behind the
whole world in adopting the new ICD-10 standard. They are still using ICD-9. Even their
latest research papers focus on the old ICD-9 (Glance, Laurent, Dick, Andrew, Osler, Turner, & Mukamel, Dana, 2006; Bazarian, Jeffrey, Veazie, Peter, Mookerjee, Sohug, & Lerner,, 2006; Williams, Charles, Hauser,
Kimberlea, Correia, Jane, & Frias, Jaime, 2005).

the -t is useful 

From

      In US, although Open Source health care software have been actively developed, for
example OpenEMR (2006), they have not received the adequate attention yet. This is
because of the private and proprietary nature of the US Health industry.
      However, not all institutes or organizations in Canada have fully understood the
damage that private and proprietary bring to the pan-Canadian interoperable EHR
system, even after Infoway has taken the Open Source initiative. For example, Ontario's
ePhysician Project is a pay-per-month web portal software, contracted to GE Healthcare
for 15 years (Hamilton, 2005). The solution is both proprietary and exclusive. The Ontario
government managed to fund $128 million, but that only covers about "10 per cent of
what it would cost" for it to be fully accessible for all physicians (Hamilton, 2005, p. 1).

to

      In US, although Open Source health care software have been actively developed, for example OpenEMR (2006), they have not received the adequate attention yet. This is because of the private and proprietary nature of
the US Health industry.
      However, not all institutes or organizations in Canada have fully understood the damage that private and proprietary bring to the pan-Canadian interoperable EHR system, even after Infoway has taken the Open Source initiative. For example, Ontario's ePhysician Project is a pay-per-month web portal software, contracted to GE Healthcare for 15 years (Hamilton,
2005). The solution is both proprietary and exclusive. The Ontario government managed to fund $128 million, but that only covers about "10 per cent of what it would cost" for it to be fully accessible for all physicians
(Hamilton, 2005, p. 1).

as shown the -t switch (indentation of first line different from second) works great. had it not with the weird parargraph break problem, it could be a very good paragraph reformatter.

par 

Info 

Description 

Par is a paragraph reformatter, similar to the standard Unix fmt filter, but better. It uses a dynamic programming algorithm, which produces much better-looking line breaks than the greedy algorithm used by fmt. It can also deal correctly with a variety of quotation and comment conventions.

Features 

Source 

http://www.cs.berkeley.edu/~amc/Par/

Related Urls 

http://freshmeat.net/projects/par/

cmd:split 

Info 

split - split a file into pieces

Working examples 

Usage 

Usage2 
ls thefile
split -b 500k !$ !$.
split -b 500k !$ !$.split.
!! | xargsi -t split {} ~+1/{}. -d -a 3 -b 500k/10m
# space ok!
Usage1 
 !! | split -l 1000 - tmp.split.
 perl -e '$lmt=30; $pre="tmp.split."; $si="aa"; foreach (1..$lmt){ print "$pre$si\n"; $si++} '

 ls tmp.split.?? | doeach.pl fileh ftt0 @~cat @_@~
 rm tmp.split.??
Usage0 
$ jot 10 | split -l 3 -
$ ls xa? | doeach.pl echo @~cat @_@~

echo `cat xaa`
1 2 3

echo `cat xab`
4 5 6

echo `cat xac`
7 8 9

echo `cat xad`
10

Help 

quick help 
$ split --help
 Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
PREFIX is `x'.  With no INPUT, or when INPUT is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg.
man 
-d, --numeric-suffixes
       use numeric suffixes instead of alphabetic
-a, --suffix-length=N
       use suffixes of length N (default 2)
-b, --bytes=SIZE
       put SIZE bytes per output file

enumeration 

jot 30 | split -l 1 - tmp.split.

$ echo tmp.split.* | fold -sw 68
tmp.split.aa tmp.split.ab tmp.split.ac tmp.split.ad tmp.split.ae
tmp.split.af tmp.split.ag tmp.split.ah tmp.split.ai tmp.split.aj
tmp.split.ak tmp.split.al tmp.split.am tmp.split.an tmp.split.ao
tmp.split.ap tmp.split.aq tmp.split.ar tmp.split.as tmp.split.at
tmp.split.au tmp.split.av tmp.split.aw tmp.split.ax tmp.split.ay
tmp.split.az tmp.split.ba tmp.split.bb tmp.split.bc tmp.split.bd

$ perl -e '$lmt=30; $pre="tmp.split."; $si="aa"; foreach (1..$lmt){ print "$pre$si\n"; $si++} ' | xargs | fold -sw 68
tmp.split.aa tmp.split.ab tmp.split.ac tmp.split.ad tmp.split.ae
tmp.split.af tmp.split.ag tmp.split.ah tmp.split.ai tmp.split.aj
tmp.split.ak tmp.split.al tmp.split.am tmp.split.an tmp.split.ao
tmp.split.ap tmp.split.aq tmp.split.ar tmp.split.as tmp.split.at
tmp.split.au tmp.split.av tmp.split.aw tmp.split.ax tmp.split.ay
tmp.split.az tmp.split.ba tmp.split.bb tmp.split.bc tmp.split.bd

documented on: 1999.10.26

cmd:csplit 

Usage 

csplit /tmp/PHoss.log '/^>>*$/' '{*}'
csplit -f chp11_ chap11.lst '/^listing /' '{*}'

Info 

The csplit program splits a file according to context. It's part of the GNU textutils.

Help 

$ csplit --help
 Usage: csplit [OPTION]... FILE PATTERN...
Output pieces of FILE separated by PATTERN(s) to files `xx01', `xx02', ...,
and output byte counts of each piece to standard output.

  -b, --suffix-format=FORMAT use sprintf FORMAT instead of %d
  -f, --prefix=PREFIX        use PREFIX instead of `xx'
  -k, --keep-files           do not remove output files on errors
  -n, --digits=DIGITS        use specified number of digits instead of 2
  -s, --quiet, --silent      do not print counts of output file sizes
  -z, --elide-empty-files    remove empty output files
      --help                 display this help and exit
      --version              output version information and exit

Read standard input if FILE is -.  Each PATTERN may be:

  INTEGER            copy up to but not including specified line number
  /REGEXP/[OFFSET]   copy up to but not including a matching line
  %REGEXP%[OFFSET]   skip to, but not including a matching line
  {INTEGER}          repeat the previous pattern specified number of times
  {*}                repeat the previous pattern as many times as possible

A line OFFSET is a required `+' or `-' followed by a positive integer.

Working History 

Using it to split the tycpp samples. (Teach Yourself C++, http://web30.eppg.com/program/zip/tycpp.zip)

csplit -f chp11_ chap11.lst '/^listing /' '{*}'

— perfect. for this particular case, still need to remove the listing … at the top fo the cc files, though.