A Little Huffman Coding with Java Tricks

Monday, May 16, 2005

A Huffman code is a way to utilize a binary tree to construct a minimal-length encoding for messages where certain characters or groups of characters have known frequencies. The tree used for such an operation called a Huffman tree. Huffman codes are the most efficient compression method for random data and are often found as steps in other compression algorithms such as JPEG and Deflate (ZIP). Building such trees is a very common exercise for Computer Science and Math classes so I can skip the details.

To construct a Huffman tree, first initialize a series of nodes, which are simply data structures holding the represented character and its frequency within the message. The series is sorted by frequency and the two lowest frequency nodes are picked from the list and combined into a composite node, which is given the sum of its child node frequencies as its frequency. The composite node is added back to the sorted list and the process is repeated until there is only one node in the list, which is the head of the Huffman tree.

In Java, the treeset data structure maintains a sorted list of comparable objects and can be used for an implementation of the Huffman code algorithm in the last paragraph. One caveat is that a treeset is a set, which means that duplicate nodes will simply disappear. The work around is to rig the compareTo method, which is implemented by all comparable objects, so that equality is never returned. In this implementation getCount returns the frequency of the node.

public int compareTo(Object a) {
HuffmanNode ha = (HuffmanNode) a;
if (this.getCount() == ha.getCount()) {
return -1;
} else {
return this.getCount() - ha.getCount();
}
} // end compareTo
The treeset is an ideal data structure for helping build the Huffman coding tree because it maintains a sorted list with a "guaranteed log(n) time cost for the basic operations (add, remove and contains)." My construction method for the Huffman tree uses two removes and one add each time nodes are combined so the worst running time is log^3(n), which is better than sorting algorithms that take n*log(n) or even n^2 time.

public int buildTree() {
tsNodes.clear();
tsNodes.addAll(htNodes.values());

while (tsNodes.size() > 1) {
Iterator it = tsNodes.iterator();
HuffmanNode a = (HuffmanNode)it.next();
it.remove();
HuffmanNode b = (HuffmanNode)it.next();
it.remove();

// Combine node a and node b into a composite node
HuffmanNode combinedNode = new HuffmanNode(a, b);
tsNodes.add(combinedNode);
} // end while tree > 1

huffmanTreeRoot = (HuffmanNode) tsNodes.first();
...
} // end buildTree

The compression rates achieved by this simple algorithm are pretty good even though they are not nearly as good as more specific ones in common use, such as JPEG, gzip, b2zip, etc.

Posted by Frank Rietta at 7:00 AM 2 comments links to this post

First Weekend in Catalunya

Saturday, May 14, 2005

Since arriving in Spain on Tuesday, I have more or less settled into life in Barcelona. I understand that this is the sixth year of the Georgia Tech College of Computing program in partnership with the Universitat Politècnica de Catalunya. This week has been filled with a lot of little things such as learning how to get to class, a university tour, and two days of actual classes. There has been a lot of time this week to explore the city and get adjusted to the new time. I can more or less go anywhere I need and get along pretty well. The primary language in Barcelona is Catalan, which is similar to Spanish but with French influences.

Working with money has been interesting because for the most part the price in euros is similar to the price in dollars. However, 1 euro is about $1.30 at this time, which makes things more expensive. It is pretty easy to get around the city using the metro system for which I purchased a monthly pass for 41.75 euros. One of the little things I have noticed is that commas and periods with respect to numbers are reversed, so the price of the monthly pass is shown as 41,75 and 1000 would be written as 1.000.

I am officially taking 9 hours of classes and intend to take the Spanish class without going through Georgia Tech. I really need the class for personal learning but do not need the hours at Tech since I have already taken Spanish there. It will be nice to be able to work on the language without having to worry much about the grade. Otherwise, I am taking Cognitive Science, Computing in Society (Ethics), and Computational Photography.

I have started a photo gallery to which I will post the pictures as they are taken. It can be accessed at the following URL:
http://gamma.rietta.com/~frank/gallery/bcn2005

More to come later.

Posted by Frank Rietta at 5:02 PM 0 comments links to this post

Re: David Bartosik: Why Robots.txt by Matt Benya

Saturday, May 07, 2005

I came across a blog article, David Bartosik: Why Robots.txt by Matt Benya, which happens to mention RoboGen, a program for editing robots.txt files that I wrote nearly six years ago! I do enjoy finding references to my previous work. Mr. Benya’s explanation on of the robots.txt file reminds me of a situation I came across a few weeks ago.

I had logged into one of the web servers and noticed the system was not responding as snappily as usual. I turns out the load average was at 15%, caused by a large number of instances of a customer CGI script. Fortunately, these scripts were being run by a particular user so I was able to find and inspect the tail end of the log file and determined that ZyBorg, from wisenutbot.com, was rapidly accessing the dynamically generated site by a CGI-interface Perl script. In order to get the server load under control, I created a robots.txt for the site and blocked the ZyBorg user-agent from indexing the Perl scripts for the site. Fortunately the robot did comply with the exclusion standard and the rapid-fire crawling stopped.

While this story has nothing to do with RoboGen, I used Vim in the SSH session, it does show one concrete example of the continued applicability of the robot exclusion standard.

Posted by Frank Rietta at 7:38 PM 0 comments links to this post

Improving my personal efficiency with KDE.

I have recently been struck with the "how can a good graphical environment help me be more efficient" bug. Many of the people I work with have migrated to Apple laptops for various reasons. It seams like the majority opinion is that Mac OS combines the best of the UNIX world with a very productive, HCI-friendly user interface and more applications. However, I do not own an Apple laptop so I figured this was a good opportunity to play around with the new KDE 3.4 on my FreeBSD 5.3-powered Toshiba Satellite. To sum it up, I like it and here is a screen shot of my current configuration with some random KDE-centric applications opened:

KDE Development Test

This is pretty much the stock KDE with very few tweaks to the interface. I moved the taskbar to the left side of the screen and set it to be partially transparent and to act a little more like a cross between the CDE and Mac OS toolbars. I turned on the global menu option and dragged several of the toolbar applets to the top-right of the screen.

I installed Kompose from the FreeBSD ports collection, which allowed me to assign hot corners so that by simply hovering the mouse pointer over either the top or bottom right-hand-side of the screen I can see all of the windows open across all virtual desktops. The Mac OS users will instantly recognize this as similar to Expose and it is indeed a good way to visually track windows on the system.

Kompose Shows Open Windows

This is not a review of KDE, just me stating that I like it. I expect to spend some more time playing around over the next few months. Perhaps I will end up with an Apple laptop sometime next year. The long battery life and UNIX environment would be very helpful for me.

I am really look forward to finding better ways to organize and work with my data. Project Tenor seams very interesting indeed. The OS News article is a good place to start to look at it. The prospect of not only being able to find data but to automatically discover functional relationships among data is a very interesting idea.

Well, that is enough for now. Back to work.

Posted by Frank Rietta at 11:12 AM 0 comments links to this post

Heading to Barcelona, Spain

Friday, May 06, 2005

I will be living in Barcelona, Spain, until July. I really look forward for the chance to learn more about Spain and what it is like to live in Europe for a while. Our group from Georgia Tech will be guests at the Universitat Politècnica de Catalunya. I will be taking a number of classes including Introduction to Cognitive Science, which should be interesting. From time to time I will post updated tidbits about the trip to this site and even link to some pictures which will be posted in a photo gallery which I will maintain.

The web hosting operations will be managed by Jay Johnson and Jonathan Cullifer. I have spent nearly six years on call as the primary upper tier support representative and am very much looking forward to the time to be able to focus primarily on taking some interesting classes, experiencing another country, and perhaps working a little on some of my research and writing programs.

This weekend is proving to be crazy. I had hoped to post some information about my experimentations with Huffman coding but do not think it will be up until next week. I have been to other places before, and even outside of the United States, but this will by far be longest time I have ever been anywhere since my family moved to Atlanta when I was five-years-old.

I will keep everyone posted.

Posted by Frank Rietta at 1:33 PM 0 comments links to this post

"Whenever you find a man who says he doesn't believe in a real Right and Wrong, you will find the same man going back on this a moment later."
-- C.S. Lewis, The Case for Christianity

Recent Posts

Archives

April 2005 / May 2005 / June 2005 / July 2005 / August 2005 / November 2005 / April 2006 / June 2006 / August 2006 / September 2006 / November 2006 / December 2006 / January 2007 / January 2008 /

My Photo
Name: Frank Rietta
Location: Atlanta, Georgia, United States

I am a software developer who has been marketing on the internet since 1999. I hold an MS in Information Security from the Georgia Institute of Technology, from where I previously earned a BS in Computer Science in 2005. I ran an Atlanta-based web hosting business from 1999 until I sold it in 2005.


Home | Product List | Privacy | Contact