- Tools for Extracting Text
- Tools for Analyzing Text
- Tools for Manipulating Text
Tools for Extracting Text
- File Contents:
- File Excerpts:
- Extract by Keyword:
- Extract by Column or Field:
1. Viewing File Contents
cat- Dump one or more file to STDOUT
catcommand is most useful for viewing the short files
- Multiple files are concatenated together
-AShow all characters, including control characters and non-printing characters
-sSqueeze (multiple adjacent blank lines into a single black line)
-bNumber each (non-blank) line of output
NOTE!: If you dump the content of a binary file with cat to a terminal, you will make it unusable. You can use reset command to clean up your garbled terminal and go on with it.
When you type
reset, it won’t be correctly echo-ed.
lessView file or STDIN one page at a time.
lesscommand is more useful for viewing the larger files.
Navigating Text with less
SpaceMoves ahead one full screen
bMoves back one full screen
EnterMoves ahead one line
kMoves back one line
gMoves to the top of the file
GMoves to the bottom of the file
/textSearches for text
nRepeats the last search
NRepeats the last search, but in the opposite direction
vOpens the file in (vi by default)
2. Viewing File Excerpts
head: Display the first 10 lines of a file
-n: Specify the number of lines to display
tail: Display the last 10 lines of a file
-n: Specify the number of lines to display
-f: Follow subsequent additions to the file
- Continue to display the file in REAL TIME
- Very useful for monitoring log files!
- System Administrators use this feature to keep an eye on the system log.
3. Extracting Text by Keyword - grep command
grep: Prints lines of files or STDIN where the pattern is matched
- The patterns contain regular expression metacharacters and so it is considered good practice to always quote your regular expressions.
-i: Search case insensitively
-n: Print line numbers of matches
-v: Print lines that does not contain the pattern
-Ax: Include x lines after each match
-Bx: Include x lines before each match
-r: Recursively search a directory
-c: Counts the number of lines where the pattern matched
-l: Only return the name of the file that have at least one line containing the pattern
--color=auto: Highlight the match in color
4. Extract by Column or Field - cut command
cut: Display specific columns of file or STDIN
-d: Specify the column delimiter (Default is TAB)
-f: Specify the column to print
-c: Cut by characters
Tools for Analyzing Text
- Text Stats:
- Sorting Text:
- Comparing Files:
- Spell Check:
wccommand counts the Number of Lines, Words, Bytes and/or Characters in a File or STDIN.
- On traditional UNIX system every character in a text file took up exactly 1 byte.
- However, with the advent of internationalization and larger character sets like Unicode some characters can take up to 4 bytes.
-l: Only for line count
-w: Only for word count
-c: Only for byte and/or chatacter count
-m: Get an accurate charcter count
sort- Sorts Text to STDOUT - Original File Unchanged
-r: Perform a Reverse (Descending) sort
-n: Perform a Numerical sort
-f: Ignore (Folds) case of character in string
-u: Unique (Remove duplicate lines in output)
-t: Specify the column delimiter
-k: Specify the column to print
NOTE!: The argument to the
-k option can be two numbers separated by a dot. In this case, The number before the dot is the field number
The number after the dot is the character within that field with which to begin sort
Eliminating Duplicate Lines
sort-u: Removes duplicate lines from input
uniq: Removes duplicate adjacent lines from input To print only unique line occurrences in a file (Remove all duplicate lines), input to uniq must be first sorted.
-c: Produce a frequency listing - count no of occurrences. Each line will be prepended with a number indicating how many times it appears in the input
-d: Print one copy of the lines that are repeated in the input.
-u: Output only the lines that are truely unique - only occurring once in the input.
-fn: Avoid comparing the first n fields in each line.
-sn: Avoid comparing the first n characters in each line.
diff: Compare two files for difference.
gvimdifffor graphical diff - Provided by vim-X11 package.
- Suppose a service on station1 is malfunctioning but the same service works on station2.
- Thanks to
diffand the use of simple, text-based configuration files,
- We can easily compare the working and non-working configurations.
Duplicating File Changes
diff-u: Unified Diff (An alternate way of displaying the same information), Best for patch utility.
patch: Duplicate changes in other files (use with care!)
-b: Automatically backup changed files.
- To use
patch, simply store the output of a
diff -uin a file;
- And run the following command, which would make file.conf-station1 looks like file.conf-station2
- Do you actually want all of the changes above to be made?
- It would be advisable to first edit
- And remove the two lines describing the Hostname variable, since those should remain different between systems.
- If anything terrible happens,
patch -bautomatically creates a backup of each file it changes.
- backups are given the
aspell: A aspell is an interactive spell checker.
- It offers suggestions for corrections via a simple menu-driven interface.
Tools for Manipulating Text
tr: Translate (Alter) Characters.
- Only reads data from STDIN.
- Converts characters in one set to corresponding characters in another set.
sed: Stream Editor
- Performs search/replace operations on a stream of data
As with grep, it is considered good practice to always quote sed’s search/replace string
- By Default: sed make maximum one change per line
If you want make multiple changes per line then append
g(Globle) at the end of search/replace pattern.
- sed searches are case-sensitive
If you want to search case-insensitively then append
i(case insensitive) to the pattern.
- sed operates on all the lines of the file.
It is possible to provide sed with address limiting.
- Normally does not alter the source file
-i.bakto backup and alter the source file
regex: Regular Expressions
- For more details see
man 7 regex
/-----------------------------------------------------------------------\ | Metachracter | Meaning | |------------------------------------------------------------------------ | ^ | Line Begin | | $ | Line Ends | | [xyz] | A character that is x, y or z | | [^xyz] | A character that is not x, y or z | \-----------------------------------------------------------------------/