A Unix Cook-Booklet: Processing files

3. Processing files

3.1 Textual processing (awk, sed, tr)

The command awk (Aho, Weinberger, Kernighan) contains pattern matching an C-like language to process text files line by line and field by field. For example, to print the second column (columns are by default any whitespace-separated substrings), do


awk <inputfile '{print $2}' >outputfile

. The string passed to awk usually consists of pattern-program pairs. To sum the numbers in the first column, ignoring lines starting with '#', do


awk 'BEGIN {s=0} /^#/ {next} /.*/ {s+=$1} END {print s}' <inputfile >outputfile

The special patterns BEGIN and END are associated before starting and after ending the processing itself. The patterns are regular expressions, see man regexp on most systems.

The sed is a stream editor which can substitute strings. For example, to rename each .cpp file into .C file in the current directory, do


for i in *.cpp                # (Bourne-shell)
do
mv $i `echo $i | sed -e 's/\.cpp/.C/' | tr -d '\012'`
done

Echo writes the current filename.cpp to stdout for further processing by sed and tr. The sed substitute command is s/oldstring/newstring/g, however, oldstring is a regular expression and thus the dot is preferrably quoted. If the /g is given, multiple replacements are allowed (not needed in the above example). The final character translate (tr) command removes the newline characters. This is perhaps not absolutely needed in most systems because the shell may take care of it, but is a good idiom to learn.

The above example works in Bourne-type shells (sh, ksh, bash). In C-shell (csh, tcsh) there is a simpler way, and the syntax of the for-statement is different:


foreach i (*.cpp)             # (C-shell)
mv $i $i:r.C
end

The :r ``goes backward'' one dot-separated word, thus if $i is myfile.cpp, $i:r becomes myfile and $i:r.C is myfile.C.

One can also use basename to extract the 'type' field out of filenames.

Next Previous Contents