A Few Bits More Advanced

UNIX III Tutorial

This tutorial is the last in a two or three part series of UNIX tutorials that I have prepared for the ACM. This tutorial is targeted at CSE students who have a solid grasp of UNIX at the most basic levels, and are who have had some experience with some more advanced features.

The purpose of this tutorial is to get you excited and educated about two important features in UNIX: pipes and versatile utilities. We'll first go through a warm up exercise, then we'll march through a series of revisions on a data mining tool. By the end, I hope that we'll have reached our goal.

A warm up

Near the end of the previous tutorial we discussed some tools for disk usage management on the UNIX system. Let's pick up at that point.

The quota program (invoked using /usr/sbin/quota) shows you how much disk space you're using. For my directory, I have:

Disk quotas for user corin (uid 11035): 
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
      /Sucia/u4    2650   60000   80000             295       0       0

On these Alphas, blocks are 1KB, so I see that I'm using a bit less than 3 MB of space out of my total allotment of 60MB.

Now, let's say that, instead of using less than 5% of your quota, you're up nearly 95% of your quota, and you'd like to find where all your space has gone. A tool that will come in handy here is the du command (short for disk usage). My usage on the IWS machines isn't very intersting, so I'll show you what I see on my GWS machine:

tobiko% du -k
[...]
120     ./142/section
12      ./142/old-RCS
204     ./142/mac-mw
596     ./142
288     ./558/proj1/fig
776     ./558/proj1
16      ./558/proj2/lsys/CVS
448     ./558/proj2/lsys
16      ./558/proj2/smooth/CVS
3600    ./558/proj2/smooth
4052    ./558/proj2
5012    ./558
4       ./tmp
105952  .
tobiko%

Of course, I omitted a lot of output here. Each line shows a directory and the size, in KB, of the files in that directory (that's what the -k option flag told du; the default is to report blocks of 512 bytes). The last line shows the total usage of all file in and below the current directory. Here, you see that I'm using not quite 106MB of space.

Now is the time to be introduced to the first major emphasis of this tutorial: the power of pipes. We'll talk about pipes now, and the power will become evident by the end of the tutorial (I hope...). Anyway, let's run du again, but this time, let's see it a page at a time:

du -k | more

We talked about more in the first UNIX tutorial, so this isn't very exciting. What would be neat would be to see our disk usage sorted in order of greatest to least. We can do just that using the sort utility.

du -k | sort -rn | more

The -n flag tells sort to sort the entries numerically. The -r says to output in reverse (descending) order. The first several entries that I see now are:

105952  .
35932   ./archives
23232   ./www
20660   ./archives/research
13332   ./archives/courses
9640    ./557
8264    ./mail

I can immediately see that about a third of my space is dedicated to holding old coursework and research, and another quarter are my web pages. The next two entries, however, aren't really useful. My home directory space is tree-like, and I'm really only concerned about finding how the space is distributed one level down -- I don't care about how the archive space is split in research and courses. A solution to this problem is to select only the lines in du's output that contain a string of a certain type -- a regular expression. We'll use the egrep tool (cousin to grep) to do just that:

tobiko% du -k | egrep '\./[^/]*$' | sort -rn | head
35932   ./archives
23232   ./www
9640    ./557
8264    ./mail
6960    ./acm
5872    ./research
5012    ./558
2780    ./elisp
1860    ./.netscape
856     ./sw
tobiko%

I now see just the top-level directories that are taking up most of the space in my home directory. Note, also, that I'm using the head command to grab just the first 10 lines of output. head -n retrieves the first n lines of output.

I think that we're pretty well warmed up now. Let's move on to some more interesting applications, including more fun UNIX utilities and really cool pipelines.

This next section uses data log combing as a motivating example. The data logs that we'll be combing here are the department's web logs for Tuesday, April 14th, 1998, filtered to list (more or less) only the accesses to departmental course web pages.

Yeah, I know, some of you might not get really excited by this sort of motivator. But I think that it's cool. You can learn a lot from web logs. Besides information about what pages are being accessed and when, the web logs also contain info about who's accessing the page, from where they came, how long they stayed at that page, and where else they went into the department. Anyway, I think it's cool, and I am the one who's leading this tutorial. So there! 8)

The departmental web logs aren't usually accessable on the IWS machines, but I've put yesterday's logs in my home directory, in ~corin/acm/log. Don't copy it to your home directory! There's no reason that everyone needs a 5 MB log file hanging around anywhere. If it's easier for you, you can always just make a symbolic link to the file:

cd ~
ln -s ~corin/acm/log

Now, then, let's do some data mining!

Number of hits to web pages

The first thing we'll try is to find out how many hits the department's course web pages had yesterday. Each entry in the log file is a separate hit, so we'll just count the lines. UNIX's wc tool is the best bet here. The -l option asks wc (word count) to report the number of lines in a file (the default, with no option, reports the number of characters, lines, and paragraphs).

tobiko% wc -l log 18784 log tobiko%

Number of hits to HTML pages

Instead of counting all hits, let's just count hits to HTML pages. To filter through just lines that contain the string "html", we'll use grep.

tobiko% grep html log | wc -l 11095 tobiko%

Unfortunately, this method still counts hits to non-HTML pages. In particular, if you look at a single line in the access log, you'll see that there are two URLs listed -- the destination, and the referring URL. If a web page has an inlined image, then that web page is listed as the referring URL for the access of the image. We don't want to count that page twice. What do we do? I'm glad you ask!

List of machine, referrer, and destination

The solution to the dilemma above is to select only certain fields of the access log to consider. For the moment, let's select the host that accessed the page, the page accessed, and the referring URL. The tool that we'll use here is awk.

tobiko% awk '{ print $1 "\t" $11 "\t->\t" $7 }' log | head

orcas.cs.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/0003.html" -> /education/courses/551/CurrentQtr/Messages/paper9/0004.html tide16.microsoft.com "-" -> /education/courses/401/98sp/ ohaton.cs.ualberta.ca "-" -> /education/courses/401/CurrentQtr/ ohaton.cs.ualberta.ca "-" -> /education/courses/401/CurrentQtr/ dhcp133i.ee.washington.edu "http://www.cs.washington.edu/education/courses/143/98sp/homework/hw1/solution/index.html" -> /education/courses/143/98sp/homework/hw1/solution/lmatrix.cpp orcas.cs.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/0004.html" -> /education/courses/551/CurrentQtr/Messages/paper9/0005.html cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/index.html" -> /education/courses/551/CurrentQtr/Papers/paper_9_index.html cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/paper_9_index.html" -> /education/courses/551/CurrentQtr/Messages/paper9 cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/paper_9_index.html" -> /education/courses/551/CurrentQtr/Messages/paper9/ cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/" -> /education/courses/551/CurrentQtr/Messages/paper9/0000.html

tobiko%

Number of hits to HTML pages

Now that we know how to select only the fields that we want, we can grep for HTML pages in just the accessed pages. Let's do that.

tobiko% awk '{print $7}' log | grep html | wc -l 5464 tobiko%

Number of hits to HTML pages, including directory hits

We now have a new problem. Recall that there are two ways to access the main index.html file in a directory:

http://www.cs/people/acm/
http://www.cs/people/acm/index.html

When counting hits to HTML pages, we'd like to also count hits to URLs that are just the directory, as above. What we need is some way to canonicallize the two URLs into the same. We'll use the sed (stream editor) tool to add in an index.html string to any URLs that end in a / (i.e., are directories).

tobiko% awk '{print $7}' log | sed 's:/$:/index.html:' | grep html | wc -l 8880 tobiko%

Pages, referrers, and hosts, formatted better

Before we go much further, let's jump back a few steps, to when we first used awk to select out a few fields of the web log. When we did that, the output wasn't as nice as it could have been. In particular, there wasn't any control over the amount of space between the three printed fields. We can use the printf function in awk to have more control over output, just as when using printf() in C.

tobiko% head -10 log | awk '{printf "%-40s %-100s -> %-60s\n", $1, $11, $7}'

orcas.cs.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/0003.html" -> /education/courses/551/CurrentQtr/Messages/paper9/0004.html tide16.microsoft.com "-" -> /education/courses/401/98sp/ ohaton.cs.ualberta.ca "-" -> /education/courses/401/CurrentQtr/ ohaton.cs.ualberta.ca "-" -> /education/courses/401/CurrentQtr/ dhcp133i.ee.washington.edu "http://www.cs.washington.edu/education/courses/143/98sp/homework/hw1/solution/index.html" -> /education/courses/143/98sp/homework/hw1/solution/lmatrix.cpp orcas.cs.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/0004.html" -> /education/courses/551/CurrentQtr/Messages/paper9/0005.html cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/index.html" -> /education/courses/551/CurrentQtr/Papers/paper_9_index.html cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/paper_9_index.html" -> /education/courses/551/CurrentQtr/Messages/paper9 cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/paper_9_index.html" -> /education/courses/551/CurrentQtr/Messages/paper9/ cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/" -> /education/courses/551/CurrentQtr/Messages/paper9/0000.html

tobiko%

Pages and hosts, formatted another way

The downside to the format given above is that it's really wide. An alternative format would list each page in a separate stanza, and then list all the unique accessors of that page below and indented a bit. We'll again use awk, sed, and our other friends for this task. But, in this example, we'll see a new aspect of awk: the input program file.

In previous examples, we've given awk's input program on the command line. The 'program' wasn't very complex -- it was just a print line. However, for this example, the program a bit long, and much easier to enter in as a file. The file visitors.awk is reproduced below:

# Assume that our input is a list of page/visitor pairs.
# Also, assume that the input is sorted by page.  What we wish
# to do is to print the name of the visited page once, and
# then list all the vistors out on separate lines following
# the page name.

{
  # Do we have a new page yet?
  if ($1 != current_page) { 
    # Yep, new page.  List out all the visitors for the old
    # current_page. 
    print_visitors();
    
    # Reset the list of visitors and continue.
    num_visitors = 0;
    delete visitors;
    printf "%s\n", $1;
    current_page = $1;
  }
}

{ 
  # Record this visit.
  visitors[num_visitors] = $2
  num_visitors++;
}

END {
  # Be sure to list the visitors for the last page, too.
  print_visitors();
}

function print_visitors () {
  # We want to print each visitor only once.
  for (v in visitors) {
    if (!printed[visitors[v]]) {
      printf "\t\t%s\n", visitors[v];
      printed[visitors[v]] = 1;
    }
  }
  delete printed
}

tobiko% grep courses/401 log | head -100 | awk '{print $7 " " $1}' | sed 's:/$:/index.html:' | grep html | sort | awk -f visitors.awk

/education/courses/401/98sp/admin/instructors.html cs214-62.student.washington.edu /education/courses/401/98sp/details/assignments/assign02.html cs210-95.student.washington.edu kochanski.cs.washington.edu /education/courses/401/98sp/details/lex_orig.html kochanski.cs.washington.edu /education/courses/401/98sp/details/project.html kochanski.cs.washington.edu /education/courses/401/98sp/help/fib.html kochanski.cs.washington.edu /education/courses/401/98sp/help/printing.html kochanski.cs.washington.edu /education/courses/401/98sp/index.html cs210-95.student.washington.edu kochanski.cs.washington.edu tobiko%

HTML pages hit, sorted by hit frequency

Okay, now for some statistics fun. Let's see how many people hit the various course pages yesterday. We'll again use the methods from above to select out the accessed HTML pages and canonicallize them. Then, we'll use a three-step sort | uniq | sort pipe that will (1) group all hits to the same page together; (2) count the number of lines in each group; and (3) report those stats back in descending order.

tobiko% cat log | awk '{print $7}' | sed 's:/$:/index.html:' | grep html | sort | uniq -c | sort -rn | head 264 /education/courses/142/98sp/index.html 216 /education/courses/142/index.html 216 /education/courses/142/98sp/homework/hw2/index.html 202 /education/courses/143/CurrentQtr/index.html 202 /education/courses/142/98sp/homework/index.html 187 /education/courses/143/CurrentQtr/homework/hw2/index.html 168 /education/courses/142/CurrentQtr/index.html 164 /education/courses/143/CurrentQtr/homework/index.html 145 /education/courses/142/CurrentQtr/homework/hw2/index.html 138 /education/courses/142/98sp/homework/hw2/instructions.html tobiko%

HTML pages hit, sorted by hit frequency, omitting 142, 143

Wouldn't you know, 142 and 143 were the most popular pages to visit. But we're all majors now, we let's see how our classes fair against each other, ignoring 142 and 143. We'll use egrep to filter out instance of 142 and 143 course pages.

tobiko% egrep -v 'courses/14[23]' log | awk '{print $7}' | sed 's:/$:/index.html:' | grep html | sort | uniq -c | sort -rn | head 105 /education/courses/490ca/CurrentQtr/final/timeline.html 59 /education/courses/551/CurrentQtr/Messages/paper9/index.html 58 /education/courses/490csw/CurrentQtr/Groups/Group1/index.html 56 /education/courses/341/CurrentQtr/index.html 54 /education/courses/490csw/CurrentQtr/Groups/Group2/status.html 54 /education/courses/451/CurrentQtr/index.html 48 /education/courses/401/98sp/details/index.html 40 /education/courses/cse591/CurrentQtr/index.html 38 /education/courses/401/98sp/index.html 36 /education/courses/cse591/CurrentQtr/projects.html tobiko%

Hosts, sorted by hit frequency

Not surprisingly, we can do something similar with the host hits. By selecting the visiting machine name, we find that sanjuan and orcas are the two single most frequently used surfing machines to hit the department. Curiously, though, pml6 and ipl32 were pretty close.

tobiko% cat log | awk '{print $1}' | sort | uniq -c | sort -rn | head 566 sanjuan.cs.washington.edu 295 orcas.cs.washington.edu 220 pml6.cs.washington.edu 197 ipl32.cs.washington.edu 190 166.104.221.182 188 pml10.cs.washington.edu 169 pennzoil.cs.washington.edu 158 ipl38.cs.washington.edu 152 ipl11.cs.washington.edu 150 cs214-36.student.washington.edu tobiko%

Domains, sorted by hit frequency

Okay, machines were neat, but what about domains, that is, grouping all machines from a domain together? We'll again use awk, but in this case, we'll use it to remove the machine name. In this example, we'll also again use an input file, called domains.awk.

{ printf "%s", $2 }
{ for (n = 3; n <= NF; n = n + 1)
    printf ".%s", $n  }
{ printf "\n" }

tobiko% cat log | awk '{print $1}' | awk -F. -f domains.awk | sort | uniq -c | sort -rn | head 7195 cs.washington.edu 3313 student.washington.edu 1307 dhcp.washington.edu 1072 dhcp2.washington.edu 472 microsoft.com 448 u.washington.edu 321 boeing.com 250 ee.washington.edu 198 dyn.cs.washington.edu 190 104.221.182 tobiko%

Whew! That about all that I have presently. Originally, I had intended to discuss various aspects of shell programming as well. That didn't happen, however. I do have some web pages that describe, very briefly, the highlights of how to do simple things in both bash and tcsh. Simple things include flow control, iteration, and setting and using variables. Enjoy!

corin@cs.washington.edu