Skip to content

Instantly share code, notes, and snippets.

@jfarmer
Last active February 15, 2024 19:23
Show Gist options
  • Save jfarmer/81818deba1df88933ab80c97358c0042 to your computer and use it in GitHub Desktop.
Save jfarmer/81818deba1df88933ab80c97358c0042 to your computer and use it in GitHub Desktop.
A demonstration of the power of the command line

The Power Of The Command Line

If you all want to learn more about the command line, one thing we didn't really touch on is input/output redirection, which is ultimately what makes the command line so powerful. On the command line, you can take the output of one program and feed it as input to another program.

The power of the command line lies in its composability. The command line has general mechanisms for taking existing programs and combining them to produce new programs.

Think of this as a system-wide API that you get "for free" by using the command line. You can chain a sequence of programs together, each one feeding its output as the input to the next one in the chain, to produce a "new" program. None of the programs involved need to know they're being used in this way.

This kind of "composability" is much harder with a GUI, so programs tend to be monolithic and only interact with other programs in pre-defined, pre-sanctioned ways.

Below is a program that prints out the 40 most-frequently used words in Moby Dick. You can change the URL passed to url and the number passed to head -n to change how many lines are displayed. Project Gutenberg is a great source of books in the public domain (and therefore free to use).

The Program

Click here for a copy of this program outside of the README, or look at the following:

curl -s https://www.gutenberg.org/files/2701/2701-0.txt | tr -c -s A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -r -n | head -n 40

The Program, Commented

Click here for a copy of the commented version of the program outside of the README, or look at the following:

curl -s https://www.gutenberg.org/files/2701/2701-0.txt | # Print out the contents of the URL
tr -c -s A-Za-z '\n' |                                    # TRanslate multiple non-alphabetic characters into a new line
tr A-Z a-z |                                              # TRanslate all uppercase letters to lowercase letters
sort |                                                    # Sort the output (alphabetically from A to Z)
uniq -c |                                                 # Remove duplicates, prepending each line with a count of duplicates
sort -r -n |                                              # Sort the output (numerically from largest to smallest)
head -n 40                                                # Print out the first 40 lines

Small, Single-Purpose Programs

One thing you might notice is that we're combining many small, single-purpose programs. The tr command, for example, does nothing more than replace characters in one group with characters in another group. On its own we might think "That's not very useful", but its simplicity and single-purpose nature means it can be combined with many other programs. That's where it's usefulness lives.

None of these programs were written with the idea of printing the 40-most common words in mind. The first versions of most of these programs were create before the internet existed. The only exception is curl, which is a program used to make web requests via the command line.

Even then, curl doesn't "know" that its output might be passed specifically to tr.

Stopwords, Shmopwords

Words like "the", "of", "in", "and", etc. are called stop words. Here are the 10 most frequent words in Moby Dick:

14715 the
6742 of
6517 and
4805 a
4707 to
4241 in
3100 that
2536 it
2532 his
2127 i

Of course the word "the" is the most common word in Moby Dick. It's probably the most common word in any piece of English text. When trying to analyze text we sometimes want to filter out these stop words.

I found this list of stop words. Let's create a local copy:

curl -s https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c > stop-words

Like |, the > symbol is another way of redirecting output. While | passes output from one program to another, > funnels / redirects the output of a program to a file. So after running the above command we now have a file called stop-words that contains the contents of the URL (fetched via curl).

Now let's add a step to our pipeline that filters out all the stop-words:

curl -s https://www.gutenberg.org/files/2701/2701-0.txt | tr -c -s A-Za-z '\n' | tr A-Z a-z | sort | fgrep -v -f stop-words | uniq -c | sort -r -n | head -n 40

Rather than piping the output of sort to uniq -c in order to generate the count of each distinct word, we now pass the output of sort to the following command:

fgrep -v -f stop-words

The fgrep command reads input and prints out all lines that match ANY of the patterns in stop-words. The -v argument tells fgrep to invert the match, so that it prints out all lines that DO NOT match any of the patterns. The net effect is to first sort all the words in Moby Dick, remove all the stop words, and then use uniq -c to count what remains.

The new Moby Dick Top 40 looks much more...er...nautical:

 473 ye
 452 old
 432 would
 253 queequeg
 247 round
 224 much
 217 could
 216 good
 206 never
 206 ever
 205 look
 199 deck
 194 go
 193 even
 178 pequod
 157 oh
 153 god
 141 crew
 136 full
 118 found
 115 ll
 110 body
 103 end
 102 new
  99 hold
  91 leg
  88 eye
  79 ere
  77 d
  76 open
  73 peleg
  71 keep
  70 deep
  65 boy
  64 looked
  63 mr
  62 blood
  60 rope
  60 book
  57 room

Queequeg the name of an important character in Moby Dick and also the name of Dana Scully's dog in the X-files (woof woof).

Screenshot

Visit https://share.getcloudapp.com/Z4ukZlP8 or take a gander:

40 most frequent words in Moby Dick

curl -s https://www.gutenberg.org/files/2701/2701-0.txt | # Print out the contents of the URL
tr -c -s A-Za-z '\n' | # TRanslate multiple non-alphabetic characters into a new line
tr A-Z a-z | # TRanslate all uppercase letters to lowercase letters
sort | # Sort the output (alphabetically from A to Z)
uniq -c | # Remove duplicates, prepending each line with a count of duplicates
sort -r -n | # Sort the output (numerically from largest to smallest)
head -n 40 # Print out the first 40 lines
curl -s https://www.gutenberg.org/files/2701/2701-0.txt | tr -c -s A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -r -n | head -n 40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment