Awk amazingness

So, you may or may not be aware, I am really excited about awk at the moment. Also find.

I learned a particular bit of magic on Sunday. I used my most complicated awk script yet: it uses an END block!

root@host [/home]# find /home/back/ -user <user> -exec ls -ld {} \; | awk '{total=total+$5}; END {print total}'

Here is the context: In a cPanel server, a user was looking at their disk usage, and saw lots of stuff under “Other Usage”. This user had lots of stuff in that category. They wanted to know what was taking up that space. Here is the documentation section about that cpanel page: Cpanel Docs: Disk Space Usage

This feature also displays disk space usage summaries for:

  • Files contained within your home directory
  • Files in hidden subdirectories
  • Mailing lists in Mailman
  • Files not contained within your home directory (see Other Usage bar)

So, now let’s break down this command I wrote and analyze what it does.

find

find is a utility used to look in the filesystem for files for which certain conditions are true. I find the manpage for this function very useful because it has informative examples. I usually use (and currently have bookmarked) the one here, so I can see it in my browser: man find

find /home/back/

The path argument tells find where it is looking. I had noticed that there was a folder in /home owned by root. This seemed a likely place to look in this case.

find /home/back/ -user <user>

I only wanted to find the things in that directory that were owned by that particular user. There were a lot of things in the folder, so I didn’t wanna check myself if all of them were owned by that user. So I made find do it for me! 🙂

find /home/back/ -user <user> -exec ls -ld {} \;

Once I’ve found the files I wanted, I needed to find how much space they were taking. I figured a good way to do this was to pass each file through ls -l so I could grab the number of bytes from that listing. For the directories, if I didn’t have the d flag in there, it would also list all the files in each folder when it got to it, which was not the desired behavior. Another thing I could have tried, instead of using the -exec command in the find command, was to pipe the results of the results through xargs ls -ld like this:

find /home/back -user <usr> | xargs ls -ld

However, this would cause ls to be confused if any file or folder names would have a space in the name. When using xargs without the -0 flag, spaces are used as input delimiters. I could fix it by using the -0 flag in xargs and using the -print0 command in find, like this:

find /home/back -user <user> -print0 | xargs -0 ls -ld

However, if I’m adding a command to find anyway, why not save the pipe and xargs by just using -exec? So that’s why I did that.

find /home/back/ -user <user> -exec ls -ld {} \; | awk '{print $0}'

This is a testing version I used so I could make each command with as small a difference as I could from the one before so I knew exactly what changes I was making each time. {print $0} is already the default action of awk, but if you don’t specifically list a condition for which lines to match, you have to have some command. And I want it to work on all the lines.

find /home/back/ -user <user> -exec ls -ld {} \; | awk '{print $0; print $5}'

This one is a sanity-check to make sure field #5 is the one with the file sizes in it.

find /home/back/ -user <user> -exec ls -ld {} \; | awk '{print $0; print $5; total=total+$5; print total}'

This tests that my variable total is working right, that it is adding up the file sizes as it goes.

find /home/back/ -user <user> -exec ls -ld {} \; | awk '{print $0; print $5; total=total+$5; print total}; END {print "Total: ",total}'

This tests that the END block works correctly and that the total will be printed correctly at the end of the script. Since it works and is giving me the information I want, I can now modify it to remove the pieces I don’t want. Since I’m being extra careful, I take one piece out at a time.

find /home/back/ -user <user> -exec ls -ld {} \; | awk '{print $5; total=total+$5; print total}; END {print "Total: ",total}'

find /home/back/ -user <user> -exec ls -ld {} \; | awk '{total=total+$5; print total}; END {print "Total: ",total}'

find /home/back/ -user <user> -exec ls -ld {} \; | awk '{total=total+$5}; END {print "Total: ",total}'

find /home/back/ -user <user> -exec ls -ld {} \; | awk '{total=total+$5}; END {print total}'

And now we have it. We found everything in that folder owned by that user and added up how much disk space it takes in bytes. To change to a more useful unit like megabytes, you can use this nifty trick

echo $((<total>/1024/1024))

In retrospect, looking at the documentation, it appears the sizes of the folders themselves are not counted.

The disk space usage information contained in this feature does not indicate how much space the directory itself uses. It only displays disk usage information about the directory’s contents. Typically, directories themselves occupy a negligible amount of disk space.

So for that, we would want to make our script count only the files.

find /home/back/ -user <user> -type f -exec ls -ld {} \; | awk '{total=total+$5}; END {print total}'

Then what remains is finding where else needs looking. 🙂