File/Text Processing Tools

DebianLinux.Net :: Text Management

http://debianlinux.net/text_management.html

Table of Contents

Text Management Portals
Unicode Text Tools
Web based Text Editors
Collaborative Text Editors
Screen Text Editors
Stream Text Editors
Screen XML Editors
Stream XML Editors
Screen HTML Editors
Stream HTML Editors
Binary Editors
Text Comparison
Text Conversion
TypeSetting & PostScript Tools
Text Synthesis & Recognition

documented on: 2006.06.10

File cutting

File cutting summary

line oriented

to keep the first 10 lines of "file":
```
head -10 file
```
to skip the first 10 lines of "file":
```
tail +11 file
```
to skip the last 10 lines of "file":
```
head -n -10 file
```
to keep the last 10 lines of "file":
```
tail -10 file
```
to print lines 11-20 of a file:
```
sed -e 1,10d -e 20q file
```
to cut lines by criteria, use 'grep', 'grep -v' or 'sed'.
to cut a file into pieces, use split

byte oriented

Use head to keep the first 50m
```
head -c 50m
```

to skip the first several bytes using dd

$ seq 5 | dd ibs=1 skip=6
4
5
4+0 records in
0+1 records out

to skip the last 50m
```
head -c -50m
```

column oriented

use cut.

documented on: 2007.04.11

cmd:head

head

head -c 50m

Help

-n, --lines=[-]N         print the first N lines instead of the first 10;
                           with the leading `-', print all but the last
                           N lines of each file
-c, --bytes=[-]N         print the first N bytes of each file;
                           with the leading `-', print all but the last
                           N bytes of each file

SIZE  may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg.

head or tail

> > Is there any ready-made tool to print lines from a file *after* a given
> > line number?

Well, tail does that just fine; to skip the first 10 lines of "file":

tail +11 file

Another option is sed:

sed 1,10d file

The sed approach generizes better; to print lines 11-20 of a file:

sed -e 1,10d -e 20q file

-Ken Pizzini

head or tail

All lines after line 10:

sed -n '11,$ p' <infile

Ken

documented on: 07-19-99

cmd:cut

Usage

ff -l . | cut -c1-12,29-

ls -l | cut -c30-42,56-

head /usr/X11R6/lib/X11/rgb.txt | cut -f 3
# X ls -l | cut -d ' ' -f 1,9 # not working!

Help

-c list The list following -c  specifies  character  posi-
         tions  (for  instance, -c1-72 would pass the first
         72 characters of each line).

Starting from 1.

-f, --fields field-list
      Print only the fields listed in field-list.  Fields are  sepa-
      rated by a TAB by default.

-d, --delimiter delim
      For  -f,  fields are separated by the first character in delim
      instead of by TAB.

-d 'x' would normally always follows -f,

choosing the fields

cut -d':' -f 2

$ cut -d: -f1,5 /etc/passwd | head
root:Super-User
daemon:
adm:Admin
lp:Line Printer Admin
smtp:Mail Daemon User
uucp:uucp Admin

-d ' ' is not good.

It can't imitate the awk field selection: Better use -c to pick out the range if you can.

How to eliminate 1 column

Newsgroups: comp.unix.shell

> > I need to eliminate the second column of a certain file.
>
> Sounds like a job for cut.

Only if the columns are delimited by *exactly one* space. If they look like:

Margolin    Barry    Other stuff
Doherty     John     More other stuff

then cut by itself is useless. You could, however, use sed to collapse all the spaces into a single space and then pipe that to cut.

Barry Margolin

reproduce "head -c" behaviour with dd

Newsgroups:  comp.unix.shell
Date:        Tue, 5 Dec 2006 12:23:46 -0500

> #-- extract gzip for binaries.out
> head -c +$GZIPBYTE binaries.out >gzip 2>$NUL
> [ $? -eq 0 -a -f gzip ] || { NO; EXIT $BINARIES_EXTRACTION_FAILED; }
>
> The problem is that the script need to be portable on Linux and most of
> Unix flavor (HP-UX, AIX, SCO, Solaris, UnixWare) and "head -c" is not
> supported everywhere (not on SCO, UnixWare, Solaris).
>
> Is it possible to reproduce the head -c behaviour with dd command (or
> with another unix commands ?)

dd bs=$GZIPBYTE count=1 if=binaries.out of=gzip

For very large files you might need to make bs <= system RAM and count=$GZIPBYTE / bs

Bill Marcum

cmd:fold

Usage

fold -s -w 132 bigfile | lp

Comments

Use fmt instead for more advanced controls!

Info

The fold utility is a filter that will fold lines  from  its
input  files,  breaking the lines to have a maximum of width
column positions (or bytes, if the -b option is  specified).

Help

-b, --bytes         count bytes rather than columns
-s, --spaces        break at spaces
-w, --width=WIDTH   use WIDTH columns instead of 80

 `-s'
`--spaces'
     Break at word boundaries: the line is broken after the last blank
     before the maximum line length.  If the line contains no such
     blanks, the line is broken at the maximum line length as usual.

Comments

fold and cut(1) can be used to  create  text  files  out  of
files with arbitrary line lengths.  fold should be used when
the contents of long lines need to be kept contiguous.   cut
should  be  used when the number of lines (or records) needs
to remain constant.

echo "\
Updated ${PKG_INSTALL_ROOT}/etc/inet/services with new netbios and swat \
names and made backup of original ${PKG_INSTALL_ROOT}/etc/inet/services \
as ${PKG_INSTALL_ROOT}/etc/inet/services:presamba." | fold -s -w 60 | \
while read line; do
        echo postinstall: $line
done

documented on: 2002.12.10

cmd:fmt

Basic Info

Usage

fmt -t -c -w 80000

$ file /usr/bin/fmt | fmt -w 40
/usr/bin/a2ps: ELF 32-bit LSB
executable, Intel 80386, version 1,
dynamically linked (uses shared libs),
stripped

Info

*Tags*: word wrap, :wordwrap, :formatter

fmt - simple optimal text formatter, Reformat each paragraph in the FILE(s)

Comments

Help

Support

Quick Help

-c, --crown-margin        preserve indentation of first two lines
-p, --prefix=STRING       combine only lines having STRING as prefix
-s, --split-only          split long lines, but do not refill
-t, --tagged-paragraph    indentation of first line different from second
-u, --uniform-spacing     one space between words, two after sentences
-w, --width=WIDTH         maximum line width (default of 75 columns)

Detail Help

Trying history 1

converted by "fmt -t -c -w 80000"

weird parargraph break

From

      Data standards make sure that the terms people use mean the same thing. The
International Classification of Diseases (ICD) is such an example. Canada is in the
process of upgrading ICD from the old version of ICD-9 to the new version of ICD-10
nationwide (ICD-10, 2005; Healthcare Financial Management Association, 2004). The US
however, falls behind the whole world in adopting the new ICD-10 standard. They are
still using ICD-9. Even their latest research papers focus on the old ICD-9 (Glance,
Laurent, Dick, Andrew, Osler, Turner, & Mukamel, Dana, 2006; Bazarian, Jeffrey, Veazie,
Peter, Mookerjee, Sohug, & Lerner,, 2006; Williams, Charles, Hauser, Kimberlea, Correia,
Jane, & Frias, Jaime, 2005).

      Data standards make sure that the terms people use mean the same thing. The International Classification of Diseases (ICD) is such an example. Canada is in the process of upgrading ICD from the old version of ICD-9 to the new version of ICD-10 nationwide (ICD-10, 2005; Healthcare Financial Management Association, 2004). The US however, falls behind the
whole world in adopting the new ICD-10 standard. They are still using ICD-9. Even their
latest research papers focus on the old ICD-9 (Glance, Laurent, Dick, Andrew, Osler, Turner, & Mukamel, Dana, 2006; Bazarian, Jeffrey, Veazie, Peter, Mookerjee, Sohug, & Lerner,, 2006; Williams, Charles, Hauser,
Kimberlea, Correia, Jane, & Frias, Jaime, 2005).

the -t is useful

From

      In US, although Open Source health care software have been actively developed, for
example OpenEMR (2006), they have not received the adequate attention yet. This is
because of the private and proprietary nature of the US Health industry.
      However, not all institutes or organizations in Canada have fully understood the
damage that private and proprietary bring to the pan-Canadian interoperable EHR
system, even after Infoway has taken the Open Source initiative. For example, Ontario's
ePhysician Project is a pay-per-month web portal software, contracted to GE Healthcare
for 15 years (Hamilton, 2005). The solution is both proprietary and exclusive. The Ontario
government managed to fund $128 million, but that only covers about "10 per cent of
what it would cost" for it to be fully accessible for all physicians (Hamilton, 2005, p. 1).

      In US, although Open Source health care software have been actively developed, for example OpenEMR (2006), they have not received the adequate attention yet. This is because of the private and proprietary nature of
the US Health industry.
      However, not all institutes or organizations in Canada have fully understood the damage that private and proprietary bring to the pan-Canadian interoperable EHR system, even after Infoway has taken the Open Source initiative. For example, Ontario's ePhysician Project is a pay-per-month web portal software, contracted to GE Healthcare for 15 years (Hamilton,
2005). The solution is both proprietary and exclusive. The Ontario government managed to fund $128 million, but that only covers about "10 per cent of what it would cost" for it to be fully accessible for all physicians
(Hamilton, 2005, p. 1).

as shown the -t switch (indentation of first line different from second) works great. had it not with the weird parargraph break problem, it could be a very good paragraph reformatter.

par

Info

Description

Par is a paragraph reformatter, similar to the standard Unix fmt filter, but better. It uses a dynamic programming algorithm, which produces much better-looking line breaks than the greedy algorithm used by fmt. It can also deal correctly with a variety of quotation and comment conventions.

Features

Source

http://www.cs.berkeley.edu/~amc/Par/

Related Urls

http://freshmeat.net/projects/par/

cmd:split

Info

split - split a file into pieces

Working examples

Usage

Usage2

ls thefile
split -b 500k !$ !$.
split -b 500k !$ !$.split.
!! | xargsi -t split {} ~+1/{}. -d -a 3 -b 500k/10m
# space ok!

Usage1

 !! | split -l 1000 - tmp.split.
 perl -e '$lmt=30; $pre="tmp.split."; $si="aa"; foreach (1..$lmt){ print "$pre$si\n"; $si++} '

 ls tmp.split.?? | doeach.pl fileh ftt0 @~cat @_@~
 rm tmp.split.??

Usage0

$ jot 10 | split -l 3 -
$ ls xa? | doeach.pl echo @~cat @_@~

echo `cat xaa`
1 2 3

echo `cat xab`
4 5 6

echo `cat xac`
7 8 9

echo `cat xad`
10

Help

quick help

$ split --help
 Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
PREFIX is `x'.  With no INPUT, or when INPUT is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix: b for 512, k for 1K, m for 1 Meg.

man

-d, --numeric-suffixes
       use numeric suffixes instead of alphabetic

-a, --suffix-length=N
       use suffixes of length N (default 2)

-b, --bytes=SIZE
       put SIZE bytes per output file

enumeration

jot 30 | split -l 1 - tmp.split.

$ echo tmp.split.* | fold -sw 68
tmp.split.aa tmp.split.ab tmp.split.ac tmp.split.ad tmp.split.ae
tmp.split.af tmp.split.ag tmp.split.ah tmp.split.ai tmp.split.aj
tmp.split.ak tmp.split.al tmp.split.am tmp.split.an tmp.split.ao
tmp.split.ap tmp.split.aq tmp.split.ar tmp.split.as tmp.split.at
tmp.split.au tmp.split.av tmp.split.aw tmp.split.ax tmp.split.ay
tmp.split.az tmp.split.ba tmp.split.bb tmp.split.bc tmp.split.bd

$ perl -e '$lmt=30; $pre="tmp.split."; $si="aa"; foreach (1..$lmt){ print "$pre$si\n"; $si++} ' | xargs | fold -sw 68
tmp.split.aa tmp.split.ab tmp.split.ac tmp.split.ad tmp.split.ae
tmp.split.af tmp.split.ag tmp.split.ah tmp.split.ai tmp.split.aj
tmp.split.ak tmp.split.al tmp.split.am tmp.split.an tmp.split.ao
tmp.split.ap tmp.split.aq tmp.split.ar tmp.split.as tmp.split.at
tmp.split.au tmp.split.av tmp.split.aw tmp.split.ax tmp.split.ay
tmp.split.az tmp.split.ba tmp.split.bb tmp.split.bc tmp.split.bd

documented on: 1999.10.26

cmd:csplit

Usage

csplit /tmp/PHoss.log '/^>>*$/' '{*}'
csplit -f chp11_ chap11.lst '/^listing /' '{*}'

Info

The csplit program splits a file according to context. It's part of the GNU textutils.

Help

$ csplit --help
 Usage: csplit [OPTION]... FILE PATTERN...
Output pieces of FILE separated by PATTERN(s) to files `xx01', `xx02', ...,
and output byte counts of each piece to standard output.

  -b, --suffix-format=FORMAT use sprintf FORMAT instead of %d
  -f, --prefix=PREFIX        use PREFIX instead of `xx'
  -k, --keep-files           do not remove output files on errors
  -n, --digits=DIGITS        use specified number of digits instead of 2
  -s, --quiet, --silent      do not print counts of output file sizes
  -z, --elide-empty-files    remove empty output files
      --help                 display this help and exit
      --version              output version information and exit

Read standard input if FILE is -.  Each PATTERN may be:

  INTEGER            copy up to but not including specified line number
  /REGEXP/[OFFSET]   copy up to but not including a matching line
  %REGEXP%[OFFSET]   skip to, but not including a matching line
  {INTEGER}          repeat the previous pattern specified number of times
  {*}                repeat the previous pattern as many times as possible

A line OFFSET is a required `+' or `-' followed by a positive integer.

Working History

Using it to split the tycpp samples. (Teach Yourself C++, http://web30.eppg.com/program/zip/tycpp.zip)

csplit -f chp11_ chap11.lst '/^listing /' '{*}'

— perfect. for this particular case, still need to remove the listing … at the top fo the cc files, though.