If you all want to learn more about the command line, one thing we didn't really touch on is input/output redirection, which is ultimately what makes the command line so powerful. On the command line, you can take the output of one program and feed it as input to another program.
The power of the command line lies in its composability. The command line has general mechanisms for taking existing programs and combining them to produce new programs.
Think of this as a system-wide API that you get "for free" by using the command line. You can chain a sequence of programs together, each one feeding its output as the input to the next one in the chain, to produce a "new" program. None of the programs involved need to know they're being used in this way.
This kind of "composability" is much harder with a GUI, so programs tend to be monolithic and only interact with other programs in pre-defined, pre-sanctioned ways.
Below is a program that prints out the 40 most-frequently used words in Moby Dick. You can change the URL passed to url
and the number passed to head -n
to change how many lines are displayed. Project Gutenberg is a great source of books in the public domain (and therefore free to use).
Click here for a copy of this program outside of the README, or look at the following:
curl -s https://www.gutenberg.org/files/2701/2701-0.txt | tr -c -s A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -r -n | head -n 40
Click here for a copy of the commented version of the program outside of the README, or look at the following:
curl -s https://www.gutenberg.org/files/2701/2701-0.txt | # Print out the contents of the URL
tr -c -s A-Za-z '\n' | # TRanslate multiple non-alphabetic characters into a new line
tr A-Z a-z | # TRanslate all uppercase letters to lowercase letters
sort | # Sort the output (alphabetically from A to Z)
uniq -c | # Remove duplicates, prepending each line with a count of duplicates
sort -r -n | # Sort the output (numerically from largest to smallest)
head -n 40 # Print out the first 40 lines
One thing you might notice is that we're combining many small, single-purpose programs. The tr
command, for example, does nothing more than replace characters in one group with characters in another group. On its own we might think "That's not very useful", but its simplicity and single-purpose nature means it can be combined with many other programs. That's where it's usefulness lives.
None of these programs were written with the idea of printing the 40-most common words in mind. The first versions of most of these programs were create before the internet existed. The only exception is curl
, which is a program used to make web requests via the command line.
Even then, curl
doesn't "know" that its output might be passed specifically to tr
.
Words like "the", "of", "in", "and", etc. are called stop words. Here are the 10 most frequent words in Moby Dick:
14715 the
6742 of
6517 and
4805 a
4707 to
4241 in
3100 that
2536 it
2532 his
2127 i
Of course the word "the" is the most common word in Moby Dick. It's probably the most common word in any piece of English text. When trying to analyze text we sometimes want to filter out these stop words.
I found this list of stop words. Let's create a local copy:
curl -s https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c > stop-words
Like |
, the >
symbol is another way of redirecting output. While |
passes output from one program to another, >
funnels / redirects the output of a program to a file. So after running the above command we now have a file called stop-words
that contains the contents of the URL (fetched via curl
).
Now let's add a step to our pipeline that filters out all the stop-words:
curl -s https://www.gutenberg.org/files/2701/2701-0.txt | tr -c -s A-Za-z '\n' | tr A-Z a-z | sort | fgrep -v -f stop-words | uniq -c | sort -r -n | head -n 40
Rather than piping the output of sort
to uniq -c
in order to generate the count of each distinct word, we now pass the output of sort
to the following command:
fgrep -v -f stop-words
The fgrep
command reads input and prints out all lines that match ANY of the patterns in stop-words
. The -v
argument tells fgrep
to invert the match, so that it prints out all lines that DO NOT match any of the patterns. The net effect is to first sort all the words in Moby Dick, remove all the stop words, and then use uniq -c
to count what remains.
The new Moby Dick Top 40 looks much more...er...nautical:
473 ye
452 old
432 would
253 queequeg
247 round
224 much
217 could
216 good
206 never
206 ever
205 look
199 deck
194 go
193 even
178 pequod
157 oh
153 god
141 crew
136 full
118 found
115 ll
110 body
103 end
102 new
99 hold
91 leg
88 eye
79 ere
77 d
76 open
73 peleg
71 keep
70 deep
65 boy
64 looked
63 mr
62 blood
60 rope
60 book
57 room
Queequeg the name of an important character in Moby Dick and also the name of Dana Scully's dog in the X-files (woof woof).
Visit https://share.getcloudapp.com/Z4ukZlP8 or take a gander: