This tutorial is the last in a two or three part series of UNIX tutorials that I have prepared for the ACM. This tutorial is targeted at CSE students who have a solid grasp of UNIX at the most basic levels, and are who have had some experience with some more advanced features.
The purpose of this tutorial is to get you excited and educated about two important features in UNIX: pipes and versatile utilities. We'll first go through a warm up exercise, then we'll march through a series of revisions on a data mining tool. By the end, I hope that we'll have reached our goal.
Near the end of the previous tutorial we discussed some tools for disk usage management on the UNIX system. Let's pick up at that point.
The quota program (invoked using /usr/sbin/quota) shows you how much disk space you're using. For my directory, I have:
Disk quotas for user corin (uid 11035):
Filesystem blocks quota limit grace files quota limit grace
/Sucia/u4 2650 60000 80000 295 0 0
On these Alphas, blocks are 1KB, so I see that I'm using a bit less than 3 MB of space out of my total allotment of 60MB.
Now, let's say that, instead of using less than 5% of your quota, you're up nearly 95% of your quota, and you'd like to find where all your space has gone. A tool that will come in handy here is the du command (short for disk usage). My usage on the IWS machines isn't very intersting, so I'll show you what I see on my GWS machine:
tobiko% du -k [...] 120 ./142/section 12 ./142/old-RCS 204 ./142/mac-mw 596 ./142 288 ./558/proj1/fig 776 ./558/proj1 16 ./558/proj2/lsys/CVS 448 ./558/proj2/lsys 16 ./558/proj2/smooth/CVS 3600 ./558/proj2/smooth 4052 ./558/proj2 5012 ./558 4 ./tmp 105952 . tobiko%
Of course, I omitted a lot of output here. Each line shows a directory and the size, in KB, of the files in that directory (that's what the -k option flag told du; the default is to report blocks of 512 bytes). The last line shows the total usage of all file in and below the current directory. Here, you see that I'm using not quite 106MB of space.
Now is the time to be introduced to the first major emphasis of this tutorial: the power of pipes. We'll talk about pipes now, and the power will become evident by the end of the tutorial (I hope...). Anyway, let's run du again, but this time, let's see it a page at a time:
du -k | more
We talked about more in the first UNIX tutorial, so this isn't very exciting. What would be neat would be to see our disk usage sorted in order of greatest to least. We can do just that using the sort utility.
du -k | sort -rn | more
The -n flag tells sort to sort the entries numerically. The -r says to output in reverse (descending) order. The first several entries that I see now are:
105952 . 35932 ./archives 23232 ./www 20660 ./archives/research 13332 ./archives/courses 9640 ./557 8264 ./mail
I can immediately see that about a third of my space is dedicated to holding old coursework and research, and another quarter are my web pages. The next two entries, however, aren't really useful. My home directory space is tree-like, and I'm really only concerned about finding how the space is distributed one level down -- I don't care about how the archive space is split in research and courses. A solution to this problem is to select only the lines in du's output that contain a string of a certain type -- a regular expression. We'll use the egrep tool (cousin to grep) to do just that:
tobiko% du -k | egrep '\./[^/]*$' | sort -rn | head 35932 ./archives 23232 ./www 9640 ./557 8264 ./mail 6960 ./acm 5872 ./research 5012 ./558 2780 ./elisp 1860 ./.netscape 856 ./sw tobiko%
I now see just the top-level directories that are taking up most of the space in my home directory. Note, also, that I'm using the head command to grab just the first 10 lines of output. head -n retrieves the first n lines of output.
I think that we're pretty well warmed up now. Let's move on to some more interesting applications, including more fun UNIX utilities and really cool pipelines.
This next section uses data log combing as a motivating example. The data logs that we'll be combing here are the department's web logs for Tuesday, April 14th, 1998, filtered to list (more or less) only the accesses to departmental course web pages.
Yeah, I know, some of you might not get really excited by this sort of motivator. But I think that it's cool. You can learn a lot from web logs. Besides information about what pages are being accessed and when, the web logs also contain info about who's accessing the page, from where they came, how long they stayed at that page, and where else they went into the department. Anyway, I think it's cool, and I am the one who's leading this tutorial. So there! 8)
The departmental web logs aren't usually accessable on the IWS machines, but I've put yesterday's logs in my home directory, in ~corin/acm/log. Don't copy it to your home directory! There's no reason that everyone needs a 5 MB log file hanging around anywhere. If it's easier for you, you can always just make a symbolic link to the file:
cd ~ ln -s ~corin/acm/log
Now, then, let's do some data mining!
The first thing we'll try is to find out how many hits the department's course web pages had yesterday. Each entry in the log file is a separate hit, so we'll just count the lines. UNIX's wc tool is the best bet here. The -l option asks wc (word count) to report the number of lines in a file (the default, with no option, reports the number of characters, lines, and paragraphs).
tobiko% wc -l log 18784 log tobiko% |
Instead of counting all hits, let's just count hits to HTML pages. To filter through just lines that contain the string "html", we'll use grep.
tobiko% grep html log | wc -l 11095 tobiko% |
Unfortunately, this method still counts hits to non-HTML pages. In particular, if you look at a single line in the access log, you'll see that there are two URLs listed -- the destination, and the referring URL. If a web page has an inlined image, then that web page is listed as the referring URL for the access of the image. We don't want to count that page twice. What do we do? I'm glad you ask!
The solution to the dilemma above is to select only certain fields of the access log to consider. For the moment, let's select the host that accessed the page, the page accessed, and the referring URL. The tool that we'll use here is awk.
tobiko% awk '{ print $1 "\t" $11 "\t->\t" $7 }' log | head
orcas.cs.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/0003.html" -> /education/courses/551/CurrentQtr/Messages/paper9/0004.html tide16.microsoft.com "-" -> /education/courses/401/98sp/ ohaton.cs.ualberta.ca "-" -> /education/courses/401/CurrentQtr/ ohaton.cs.ualberta.ca "-" -> /education/courses/401/CurrentQtr/ dhcp133i.ee.washington.edu "http://www.cs.washington.edu/education/courses/143/98sp/homework/hw1/solution/index.html" -> /education/courses/143/98sp/homework/hw1/solution/lmatrix.cpp orcas.cs.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/0004.html" -> /education/courses/551/CurrentQtr/Messages/paper9/0005.html cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/index.html" -> /education/courses/551/CurrentQtr/Papers/paper_9_index.html cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/paper_9_index.html" -> /education/courses/551/CurrentQtr/Messages/paper9 cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/paper_9_index.html" -> /education/courses/551/CurrentQtr/Messages/paper9/ cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/" -> /education/courses/551/CurrentQtr/Messages/paper9/0000.html tobiko% |
Now that we know how to select only the fields that we want, we can grep for HTML pages in just the accessed pages. Let's do that.
tobiko% awk '{print $7}' log | grep html | wc -l
5464
tobiko%
|
We now have a new problem. Recall that there are two ways to access the main index.html file in a directory:
http://www.cs/people/acm/ http://www.cs/people/acm/index.html
When counting hits to HTML pages, we'd like to also count hits to URLs that are just the directory, as above. What we need is some way to canonicallize the two URLs into the same. We'll use the sed (stream editor) tool to add in an index.html string to any URLs that end in a / (i.e., are directories).
tobiko% awk '{print $7}' log | sed 's:/$:/index.html:' | grep html
| wc -l
8880
tobiko%
|
Before we go much further, let's jump back a few steps, to when we first used awk to select out a few fields of the web log. When we did that, the output wasn't as nice as it could have been. In particular, there wasn't any control over the amount of space between the three printed fields. We can use the printf function in awk to have more control over output, just as when using printf() in C.
tobiko% head -10 log | awk '{printf "%-40s %-100s -> %-60s\n",
$1, $11, $7}'
orcas.cs.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/0003.html" -> /education/courses/551/CurrentQtr/Messages/paper9/0004.html tide16.microsoft.com "-" -> /education/courses/401/98sp/ ohaton.cs.ualberta.ca "-" -> /education/courses/401/CurrentQtr/ ohaton.cs.ualberta.ca "-" -> /education/courses/401/CurrentQtr/ dhcp133i.ee.washington.edu "http://www.cs.washington.edu/education/courses/143/98sp/homework/hw1/solution/index.html" -> /education/courses/143/98sp/homework/hw1/solution/lmatrix.cpp orcas.cs.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/0004.html" -> /education/courses/551/CurrentQtr/Messages/paper9/0005.html cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/index.html" -> /education/courses/551/CurrentQtr/Papers/paper_9_index.html cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/paper_9_index.html" -> /education/courses/551/CurrentQtr/Messages/paper9 cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Papers/paper_9_index.html" -> /education/courses/551/CurrentQtr/Messages/paper9/ cs210-81.student.washington.edu "http://www.cs.washington.edu/education/courses/551/CurrentQtr/Messages/paper9/" -> /education/courses/551/CurrentQtr/Messages/paper9/0000.html tobiko% |
The downside to the format given above is that it's really wide. An alternative format would list each page in a separate stanza, and then list all the unique accessors of that page below and indented a bit. We'll again use awk, sed, and our other friends for this task. But, in this example, we'll see a new aspect of awk: the input program file.
In previous examples, we've given awk's input program on the command line. The 'program' wasn't very complex -- it was just a print line. However, for this example, the program a bit long, and much easier to enter in as a file. The file visitors.awk is reproduced below:
# Assume that our input is a list of page/visitor pairs.
# Also, assume that the input is sorted by page. What we wish
# to do is to print the name of the visited page once, and
# then list all the vistors out on separate lines following
# the page name.
{
# Do we have a new page yet?
if ($1 != current_page) {
# Yep, new page. List out all the visitors for the old
# current_page.
print_visitors();
# Reset the list of visitors and continue.
num_visitors = 0;
delete visitors;
printf "%s\n", $1;
current_page = $1;
}
}
{
# Record this visit.
visitors[num_visitors] = $2
num_visitors++;
}
END {
# Be sure to list the visitors for the last page, too.
print_visitors();
}
function print_visitors () {
# We want to print each visitor only once.
for (v in visitors) {
if (!printed[visitors[v]]) {
printf "\t\t%s\n", visitors[v];
printed[visitors[v]] = 1;
}
}
delete printed
}
tobiko% grep courses/401 log | head -100 | awk '{print $7 " " $1}'
| sed 's:/$:/index.html:' | grep html | sort | awk -f visitors.awk
/education/courses/401/98sp/admin/instructors.html
cs214-62.student.washington.edu
/education/courses/401/98sp/details/assignments/assign02.html
cs210-95.student.washington.edu
kochanski.cs.washington.edu
/education/courses/401/98sp/details/lex_orig.html
kochanski.cs.washington.edu
/education/courses/401/98sp/details/project.html
kochanski.cs.washington.edu
/education/courses/401/98sp/help/fib.html
kochanski.cs.washington.edu
/education/courses/401/98sp/help/printing.html
kochanski.cs.washington.edu
/education/courses/401/98sp/index.html
cs210-95.student.washington.edu
kochanski.cs.washington.edu
tobiko%
|
Okay, now for some statistics fun. Let's see how many people hit the various course pages yesterday. We'll again use the methods from above to select out the accessed HTML pages and canonicallize them. Then, we'll use a three-step sort | uniq | sort pipe that will (1) group all hits to the same page together; (2) count the number of lines in each group; and (3) report those stats back in descending order.
tobiko% cat log | awk '{print $7}' | sed 's:/$:/index.html:'
| grep html | sort | uniq -c | sort -rn | head
264 /education/courses/142/98sp/index.html
216 /education/courses/142/index.html
216 /education/courses/142/98sp/homework/hw2/index.html
202 /education/courses/143/CurrentQtr/index.html
202 /education/courses/142/98sp/homework/index.html
187 /education/courses/143/CurrentQtr/homework/hw2/index.html
168 /education/courses/142/CurrentQtr/index.html
164 /education/courses/143/CurrentQtr/homework/index.html
145 /education/courses/142/CurrentQtr/homework/hw2/index.html
138 /education/courses/142/98sp/homework/hw2/instructions.html
tobiko%
|
Wouldn't you know, 142 and 143 were the most popular pages to visit. But we're all majors now, we let's see how our classes fair against each other, ignoring 142 and 143. We'll use egrep to filter out instance of 142 and 143 course pages.
tobiko% egrep -v 'courses/14[23]' log | awk '{print $7}'
| sed 's:/$:/index.html:' | grep html | sort | uniq -c | sort -rn | head
105 /education/courses/490ca/CurrentQtr/final/timeline.html
59 /education/courses/551/CurrentQtr/Messages/paper9/index.html
58 /education/courses/490csw/CurrentQtr/Groups/Group1/index.html
56 /education/courses/341/CurrentQtr/index.html
54 /education/courses/490csw/CurrentQtr/Groups/Group2/status.html
54 /education/courses/451/CurrentQtr/index.html
48 /education/courses/401/98sp/details/index.html
40 /education/courses/cse591/CurrentQtr/index.html
38 /education/courses/401/98sp/index.html
36 /education/courses/cse591/CurrentQtr/projects.html
tobiko%
|
Not surprisingly, we can do something similar with the host hits. By selecting the visiting machine name, we find that sanjuan and orcas are the two single most frequently used surfing machines to hit the department. Curiously, though, pml6 and ipl32 were pretty close.
tobiko% cat log | awk '{print $1}' | sort
| uniq -c | sort -rn | head
566 sanjuan.cs.washington.edu
295 orcas.cs.washington.edu
220 pml6.cs.washington.edu
197 ipl32.cs.washington.edu
190 166.104.221.182
188 pml10.cs.washington.edu
169 pennzoil.cs.washington.edu
158 ipl38.cs.washington.edu
152 ipl11.cs.washington.edu
150 cs214-36.student.washington.edu
tobiko%
|
Okay, machines were neat, but what about domains, that is, grouping all machines from a domain together? We'll again use awk, but in this case, we'll use it to remove the machine name. In this example, we'll also again use an input file, called domains.awk.
{ printf "%s", $2 }
{ for (n = 3; n <= NF; n = n + 1)
printf ".%s", $n }
{ printf "\n" }
tobiko% cat log | awk '{print $1}' | awk -F. -f domains.awk | sort
| uniq -c | sort -rn | head
7195 cs.washington.edu
3313 student.washington.edu
1307 dhcp.washington.edu
1072 dhcp2.washington.edu
472 microsoft.com
448 u.washington.edu
321 boeing.com
250 ee.washington.edu
198 dyn.cs.washington.edu
190 104.221.182
tobiko%
|
Whew! That about all that I have presently. Originally, I had intended to discuss various aspects of shell programming as well. That didn't happen, however. I do have some web pages that describe, very briefly, the highlights of how to do simple things in both bash and tcsh. Simple things include flow control, iteration, and setting and using variables. Enjoy!