3. Text-Fu
This is the Third chapter for learning Linux on Let’s Learn Linux.
Learn basic text manipulation and navigation.
This is the Third chapter for learning Linux on Let’s Learn Linux.
Learn basic text manipulation and navigation.
By now, we’ve become familiar with many commands and their output and that brings us to our next subject I/O (input/output) streams. Let’s run the following command and we’ll discuss how this works.
$ echo Hello World > peanuts.txt
What just happened? Well check the directory where you ran that command and lo and behold you should see a file called peanuts.txt, look inside that file and you should see the text Hello World. Lots of things just happened in one command so let’s break it down.
First let’s break down the first part:
$ echo Hello World
We know this prints out Hello World to the screen, but how? Processes use I/O streams to receive input and return output. By default the echo command takes the input (standard input or stdin) from the keyboard and returns the output (standard output or stdout) to the screen. So that’s why when you type echo Hello World in your shell, you get Hello World on the screen. However, I/O redirection allows us to change this default behavior giving us greater file flexibility.
Let’s proceed to the next part of the command:
>
The > is a redirection operator that allows us the change where standard output goes. It allows us to send the output of echo Hello World to a file instead of the screen. If the file does not already exist it will create it for us. However, if it does exist it will overwrite it (you can add a shell flag to prevent this depending on what shell you are using).
And that’s basically how stdout redirection works!
Well let’s say I didn’t want to overwrite my peanuts.txt, luckily there is a redirection operator for that as well, >>
$ echo Hello World >> peanuts.txt
This will append Hello World to the end of the peanuts.txt file, if the file doesn’t already exist it will create it for us like it did with the > redirector!
Try a couple of commands:
$ ls -l /var/log > myoutput.txt
$ echo Hello World > rm
$ > somefile.txt
In our previous lesson we learned that we have different stdout streams we can use, such as a file or the screen. Well there are also different standard input (stdin) streams we can use as well. We know that we have stdin from devices like the keyboard, but we can use files, output from other processes and the terminal as well, let’s see an example.
Let’s use the peanuts.txt file in the previous lesson for this example, remember it had the text Hello World in it.
$ cat < peanuts.txt > banana.txt
Just like we had > for stdout redirection, we can use < for stdin redirection.
Normally in the cat command, you send a file to it and that file becomes the stdin, in this case, we redirected peanuts.txt to be our stdin. Then the output of cat peanuts.txt which would be Hello World gets redirected to another file called banana.txt.
Try out a couple of commands:
$ echo < peanuts.txt > banana.txt
$ ls < peanuts.txt > banana.txt
$ pwd < peanuts.txt > banana.txt
Let’s try something a little different now, let’s try to list the contents of a directory that doesn’t exist on your system and redirect the output to the peanuts.txt file again.
$ ls /fake/directory > peanuts.txt
What you should see is:
ls: cannot access /fake/directory: No such file or directory
Now you’re probably thinking, shouldn’t that message have been sent to the file? There is actually another I/O stream in play here called standard error (stderr). By default, stderr sends its output to the screen as well, it’s a completely different stream than stdout. So you’ll need to redirect its output a different way.
Unfortunately the redirector is not as nice as using < or > but it’s pretty close. We will have to use file descriptors. A file descriptor is a non-negative number that is used to access a file or stream. We will go in depth about this later, but for now know that the file descriptor for stdin, stdout and stderr is 0, 1, and 2 respectively.
So now if we want to redirect our stderr to the file we can do this:
$ ls /fake/directory 2> peanuts.txt
You should see just the stderr messages in peanuts.txt.
Now what if I wanted to see both stderr and stdout in the peanuts.txt file? It’s possible to do this with file descriptors as well:
$ ls /fake/directory > peanuts.txt 2>&1
This sends the results of ls /fake/directory to the peanuts.txt file and then it redirects stderr to the stdout via 2>&1. The order of operations here matters, 2>&1 sends stderr to whatever stdout is pointing to. In this case stdout is pointing to a file, so 2>&1 also sends stderr to a file. So if you open up that peanuts.txt file you should see both stderr and stdout. In our case, the above command only outputs stderr.
There is a shorter way to redirect both stdout and stderr to a file:
$ ls /fake/directory &> peanuts.txt
Now what if I don’t want any of that cruft and want to get rid of stderr messages completely? Well you can also redirect output to a special file call /dev/null and it will discard any input.
$ ls /fake/directory 2> /dev/null
What is the following command doing?
$ ls /fake/directory >> /dev/null 2>&1
Let’s get into some plumbing now, not really but kinda. Let’s try a command:
$ ls -la /etc
You should see a very long list of items, it’s a little hard to read actually. Instead of redirecting this output to a file, wouldn’t it be nice if we could just see the output in another command like less? Well we can!
$ ls -la /etc | less
The pipe operator |
, represented by a vertical bar, allows us to get the stdout of a command and make that the stdin to another process. In this case, we took the stdout of ls -la /etc
and then piped it to the less command. The pipe command is extremely useful and we will continue to use it for all eternity.
Well what if I wanted to write the output of my command to two different streams? That’s possible with the tee command:
$ ls | tee peanuts.txt
You should see the output of ls on your screen and if you open up the peanuts.txt file you should see the same information!
Try the following commands:
$ ls | tee peanuts.txt banan.txt
Run the following command:
$ echo $HOME
You should see the path to your home directory, mine looks like /home/pete.
What about this command?
$ echo $USER
You should see your username!
Where is this information coming from? It’s coming from your environment variables. You can view these by typing
$ env
This outputs a whole lot of information about the environment variables you currently have set. These variables contain useful information that the shell and other processes can use.
Here is a short example:
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/bin
PWD=/home/user
USER=pete
One particularly important variable is the PATH Variable. You can access these variables by sticking a $ infront of the variable name like so:
$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/bin
This returns a list of paths separated by a colon that your system searches when it runs a command. Let’s say you manually download and install a package from the internet and put it in to a non standard directory and want to run that command, you type $ coolcommand and the prompt says command not found. Well that’s silly you are looking at the binary in a folder and know it exists. What is happening is that $PATH variable doesn’t check that directory for this binary so it’s throwing an error.
Let’s say you had tons of binaries you wanted to run out of that directory, you can just modify the PATH variable to include that directory in your PATH environment variable.
What does the following output? Why?
$ echo $HOME
env
We’re gonna learn a couple of useful commands that you can use to process text. Before we get started, let’s create a file that we’ll be working with. Copy and paste the following command, once you do that add a TAB in between lazy and dog (hold down Ctrl-v + TAB).
$ echo 'The quick brown; fox jumps over the lazy dog' > sample.txt
First command we’ll be learning about is the cut command. It extracts portions of text from a file.
To extract contents by a list of characters:
$ cut -c 5 sample.txt
This outputs the 5th character in each line of the file. In this case it is “q”, note that the space also counts as a character.
To extract the contents by a field, we’ll need to do a little modification:
$ cut -f 2 sample.txt
The -f or field flag cuts text based off of fields, by default it uses TABs as delimiters, so everything separated by a TAB is considered a field. You should see “dog” as your output.
You can combine the field flag with the delimiter flag to extract the contents by a custom delimiter:
$ cut -f 1 -d ";" sample.txt
This will change the TAB delimiter to a “;” delimiter and since we are cutting the first field, the result should be “The quick brown”.
What does the following command do? Why?
$ cut -c 5-10 sample.txt
$ cut -c 5- sample.txt
$ cut -c -5 sample.txt
The paste command is similar to the cat command, it merges lines together in a file. Let’s create a new file with the following contents:
sample2.txt
The
quick
brown
fox
Let’s combine all these lines into one line:
$ paste -s sample2.txt
The default delimiter for paste is TAB, so now there is one line with TABs separating each word.
Let’s change this delimiter (-d) to something a little more readable:
$ paste -d ' ' -s sample2.txt
Now everything should be on one line delimited by spaces.
Try to paste multiple files together, what happens?
Let’s say we have a very long file, in fact we have many to choose from, go ahead and cat /var/log/syslog. You should see pages upon pages of text. What if I just wanted to see the first couple of lines in this text file? Well we can do that with the head command, by default the head command will show you the first 10 lines in a file.
$ head /var/log/syslog
You can also modify the line count to whatever you choose, let’s say I wanted to see the first 15 lines instead.
$ head -n 15 /var/log/syslog
The -n flag stands for number of lines.
What does the following command do and why?
$ head -c 15 /var/log/syslog
Similar to the head command, the tail command lets you see the last 10 lines of a file by default.
$ tail /var/log/syslog
Along with head you can change the number of lines you want to see.
$ tail -n 10 /var/log/syslog
Another great option you can use is the -f (follow) flag, this will follow the file as it grows. Give it a try and see what happens.
$ tail -f /var/log/syslog
Your syslog file will be continually changing while you interact with your system and using tail -f you can see everything that is getting added to that file.
Look at the man page of tail and read some of the other commands we didn’t discuss.
$ man tail
In our lesson on the cut command, we had our sample.txt file that contained a tab. Normally TABs would usually show a noticeable difference but some text files don’t show that well enough. Having TABs in a text file may not be the desired spacing you want. To change your TABs to spaces, use the expand command.
$ expand sample.txt
The command above will print output with each TAB converted into a group of spaces. To save this output in a file, use output redirection like below.
$ expand sample.txt > result.txt
Opposite to expand, we can convert back each group of spaces to a TAB with the unexpand command:
$ unexpand -a result.txt
What happens if you just type expand with no file input?
The join command allows you to join multiple files together by a common field:
Let’s say I had two files that I wanted to join together:
file1.txt
1 John
2 Jane
3 Mary
file2.txt
1 Doe
2 Doe
3 Sue
$ join file1.txt file2.txt
1 John Doe
2 Jane Doe
3 Mary Sue
See how it joined together my files? They are joined together by the first field by default and the fields have to be identical, if they are not you can sort them, so in this case the files are joined via 1, 2, 3.
How would we join the following files?
file1.txt
John 1
Jane 2
Mary 3
file2.txt
1 Doe
2 Doe
3 Sue
To join this file you need to specify which fields you are joining, in this case we want field 2 on file1.txt and field 1 on file2.txt, so the command would look like this:
$ join -1 2 -2 1 file1.txt file2.txt
1 John Doe
2 Jane Doe
3 Mary Sue
-1 refers to file1.txt and -2 refers to file2.txt. Pretty neat. You can also split a file up into different files with the split command:
$ split somefile
This will split it into different files, by default it will split them once they reach a 1000 line limit. The files are named x** by default.
Join two files with different number of lines in each file, what happens?
The sort command is useful for sorting lines.
file1.txt
dog
cow
cat
elephant
bird
$ sort file1.txt
bird
cat
cow
dog
elephant
You can also do a reverse sort:
$ sort -r file1.txt
elephant
dog
cow
cat
bird
And also sort via numerical value:
$ sort -n file1.txt
bird
cat
cow
elephant
dog
The real power of sort comes with its ability to be combined with other commands, try the following command and see what happens?
$ ls /etc | sort -rn
The tr (translate) command allows you to translate a set of characters into another set of characters. Let’s try an example of translating all lower case characters to uppercase characters.
$ tr a-z A-Z
hello
HELLO
As you can see we made the ranges of a-z into A-Z and all text we type that is lowercase gets uppercased.
Try the following command what happens?
$ tr -d ello
hello
The uniq (unique) command is another useful tool for parsing text.
Let’s say you had a file with lots of duplicates:
reading.txt
book
book
paper
paper
article
article
magazine
And you wanted to remove the duplicates, well you can use the uniq command:
$ uniq reading.txt
book
paper
article
magazine
Let’s get the count of how many occurrences of a line:
$ uniq -c reading.txt
2 book
2 paper
2 article
1 magazine
Let’s just get unique values:
$ uniq -u reading.txt
magazine
Let’s just get duplicate values:
$ uniq -d reading.txt
book
paper
article
Note : uniq does not detect duplicate lines unless they are adjacent. For eg:
Let’s say you had a file with duplicates which are not adjacent:
reading.txt
book
paper
book
paper
article
magazine
article
$ uniq reading.txt
reading.txt
book
paper
book
paper
article
magazine
article
The result returned by uniq will contain all the entries unlike the very first example.
To overcome this limitation of uniq we can use sort in combination with uniq:
$ sort reading.txt | uniq
article
book
magazine
paper
What result would you get if you tried uniq -uc?
The wc (word count) command shows the total count of words in a file.
$ wc /etc/passwd
96 265 5925 /etc/passwd
It display the number of lines, number of words and number of bytes, respectively.
To just see just the count of a certain field, use the -l, -w, or -c respectively.
$ wc -l /etc/passwd
96
Another command you can use to check the count of lines on a file is the nl (number lines) command.
file1.txt
i
like
turtles
$ nl file1.txt
1. i
2. like
3. turtles
How would you get the total count of lines by using the nl file without searching through the entire output? Hint: Use some of the other commands you learned in this course.
The grep command is quite possibly the most common text processing command you will use. It allows you to search files for characters that match a certain pattern. What if you wanted to know if a file existed in a certain directory or if you wanted to see if a string was found in a file? You certainly wouldn’t dig through every line of text, you would use grep!
Let’s use our sample.txt file as an example:
$ grep fox sample.txt
You should see that grep found fox in the sample.txt file.
You can also grep patterns that are case insensitive with the -i flag:
$ grep -i somepattern somefile
To get even more flexible with grep you can combine it with other commands with |.
$ env | grep -i User
As you can see grep is pretty versatile. You can even use regular expressions in your pattern:
$ ls /somedir | grep '.txt$'
Should return all files ending with .txt in somedir.
You may have heard of egrep or fgrep, these are deprecated grep calls and have since been replaced by grep -E and grep -F. Read the grep manpage to learn more.