Go up to the Labs table of contents page
In lecture we discussed Huffman coding and the construction of prefix code trees. We have also covered a variety of data structures this semester: lists, trees, hash tables, and heaps. Several of these may be useful for this lab. In addition, in this lab you are required to think about the underlying representation and efficiency of these structures.
There is no tutorial for this lab.
make
)For spring 2022, there is an extension to the in-lab due date; see the daily announcements for details.
make
)The post-lab is cancelled for the spring 2022 semester.
There are a few points that apply to the Huffman implementation as a whole that are important to keep in mind.
Your Huffman implementation should:
t
into the five characters 01011
), but it will
simplify the lab considerablypriority_queue
You will need to select several different data structures to implement Huffman compression and decompression. Don’t get more complicated than is necessary, but do keep efficiency in mind. Consider the following questions:
Use the answers to these questions to guide your selection. Your solution will be judged slightly on how efficient it is (both in terms of time and space). Very inefficient solutions will lose a few points. Be sure you have a good explanation for your implementation choices:
Whatever your implementation, you should be able to accurately describe its worst case Big-Theta performance both in running time and space (memory) usage.
One thing we have not dealt with in previous labs is serializing various data structures (writing them to a file or standard output, and reading them back in). You will need to do this with your prefix code tree. The fact that you must read this data structure from a file and write to standard output may affect how you choose to represent it in your program. Keep in mind that the format for how to write it to a file is fixed, as described in the pre-lab section. You will need to re-create your Huffman coding tree from the first part of that file format.
For the pre-lab, you will implement, in huffmanenc.cpp, the Huffman encoding algorithm using a binary heap that you implement in heap.cpp/h.
The basic steps for compression are:
Your program must take in a single command-line parameter, which is the name of the file whose contents will be encoded. We have some sample plain text and encoded text files (in the labs/lab10/examples/ directory) – a description of these files is in the in-lab section.
Your program should output to the standard output
(i.e. cout
) the exact file format described below, and
nothing else.
As part of the file format described below, your program will need to print out the compression ratio and the Huffman tree cost. The compression ratio is defined, in bits, as: (size of the original file)/(size of compressed file). You should exclude the size of the prefix code tree in the compression ratio, and assume that the 0’s and 1’s you generated for the compressed file count as one bit each (rather than an 8-bit character). The Huffman tree cost is the same as described in lecture.
To read in the input file character by character, see the fileio.cpp (src) file.
In an effort to simplify both the in-lab implementation and the
grading, there is a very strict file format for a Huffman encoded
message for this lab. This allows the two parts to be developed,
implemented, and tested separately. Although real Huffman encoding uses
bits, we will write these bits to a file using the characters
0
and 1
, as that will make it easier to check
and debug our code.
There are three sections to an encoded file, each separated by a specific separator.
The first section of the file are the ASCII characters that are
encoded, followed by their bit encoding. Only one such encoding per
line, with no blank lines allowed. The format for a line is the ASCII
character, a single space, and then the 1
and
0
characters that are the encoding for that character,
followed by a newline. The order of the lines does not matter. If the
character being written is the space character itself, you should write
space
instead of ” “; thus a line for the space character
might look like: space 101010
. Keep in mind that the first
character on this type of line cannot be a newline, tab, or a
non-printable character. No line in this part can be more than 256
characters. You can safely assume that the data provided will be a valid
Huffman coding tree (i.e. there won’t be an internal node with one
child, etc.).
Following that is a separator line, and is a single line containing 40 dashes and no spaces.
The second section is the encoded message, using the characters
0
and 1
. You may optionally separate each
encoded character by any form of whitespace. For example, to help
debugging, you might want to separate each encoded letter by a space and
place each encoded word on its own line.
The next line is a separator, and is also a single line containing 40 dashes and no spaces.
The last section of the file displays several stats about the compression. You should output the total number of symbols encoded, how many unique symbols are used, the number of bits in the original and compressed files, the compression ratio to five decimal places, and the cost of the huffman tree to five decimal places. The output format should look exactly as shown below. The information is required as we will be grading your submissions with these results. All floating point results must be reported to 5 decimal places.
The following is the Huffman file format for the example in the slide set that has the characers ‘a’, ‘b’, ‘c’, and ‘d’.
a 0
b 100
c 101
d 11
----------------------------------------
11 100 0 101 0 0 11
----------------------------------------
There are a total of 7 symbols that are encoded.
There are 4 distinct symbols used.
There were 56 bits in the original file.
There were 13 bits in the compressed file.
This gives a compression ratio of 4.30769.
The cost of the Huffman tree is 1.85714 bits per character.
The Huffman tree that this forms is the same as the one shown in the slide set), and is duplicated below.
Note that your encoding does not have to exactly match – in particular, the bits that your program uses to encode it will depend on the implementation of your heap. So while your bits can be off, the number of bits for each character should NOT be different than the examples given. For example, in the output above, if b was given the prefix code 101 and c was given 100 (i.e., b and c have their codes swapped), then that is ok. The statistics in section 3 of the output will be the same.
A few hints from experience with this in previous semesters.
Just copying the supplied code verbatim will not help you, as you need to understand what is going on – the code does not work “out of the box.” Either make sure you understand it, or write your own code.
It will be far easier to put HuffmanNode pointers in
your heap, rather than HuffmanNode objects (or whatever your Huffman
tree nodes are called). If you put the actual objects in, then you are
going to run into problems with scoping issues, inadvertent calls to
operator=()
, etc. Trust us, stick with the pointers.
You will need some secondary data structures in order to hold all the information necessary for compression. You’ll need some way of holding the frequencies of each character, as well as each character’s prefix codes – what would be efficient data structures to store and retrieve that information?
The best way to do this is recursively. Once you have your Huffman tree, you can recurse down both sides of your tree to grab every prefix code and print them out.
For the in-lab, you will implement, in huffmandec.cpp, the Huffman decoding algorithm.
The basic steps for decompression are:
First, make sure that your code can read in encoded files – you can download the inlab-skeleton.cpp file, which can properly read in the encoded input files.
Next, create the Huffman coding tree from the prefix codes you read in from the input file. Not creating a Huffman tree from the file will result in zero credit for the in-lab. The whole point of this part is to create the tree!
Lastly, read in the second part of the file, traverse your Huffman tree, and output a character whenever you reach a leaf node. For this lab, you must only print out the decoded message and nothing else.
We provide a number of sample files for you to test your code with. A
brief description of each is described here. The “normal” files are the
English input. The “encoded” files are the Huffman encoded files,
following the file format described above. Except where indicated, the
second section of each encoded file (the digits 0
and
1
) has a space inserted between each letter from the
original file, so that you can see which letter is encoded as which
bitcode.
dbacaad
). The Huffman tree can be
viewed here.0
s and 1
s.Like the pre-lab, you’ll need some way to store the prefix codes for each character. Unlike the pre-lab, though, you don’t need to deal with frequencies anymore. The prefix codes is enough to generate the Huffman tree, which you can then use to decode the input file.
As you are (recursively) creating each node in the tree, you know the
prefix code to get to that node (remember that following a left
child pointer generates a 0
, and following a right child
pointer generates a 1
). If that prefix is one of the listed
bitcodes in the first part of the file, then we know we are at a leaf
(remember that all characters in a Huffman tree are leaves). Otherwise,
we are dealing with an internal node and you will have to create left
and right child nodes, calling yourself recursively on each one. Keep in
mind that when you call yourself recursively on the left child, you have
to add 0
to the end of the prefix; likewise, you have to
add 1
to the prefix for the right child. This algorithm
will require you to search through the bit codes that were read in from
the first part of the file. Also keep in mind that the size of the input
here (the number of characters) is very small (only 80 or so) – which
means that if you choose a linear time complexity data structure
(vector, for example), your code will run just fine.
The post-lab is cancelled for the spring 2021 semester.
There are two parts to this post-lab: the time and space
complexity analysis and submitting all your (working) code
again.
For the post-lab, we want you to think about the time and space
complexity analysis of your compression and decompression code. You
aren’t required to submit anything for this part, but you should think
carefully about the following points, as you may need to know them on
exams. See below for a discussion about the space/time
complexity.
Worst case running time – for this be sure to consider all steps
of the compression and decompression. You can leave off the cost of
calculating the compression ratio, printing the cost of the tree, and
printing a listing of the bit code for each character that was asked for
in the pre-lab. Refer to the list of steps given earlier in the
lab.
Space complexity – for this, you should calculate the number of
bytes that are used by each data structure in your implementation. The
easiest way to do this is to step through your code, just as you have
done for the worst case running time, and make a note each time you use
a new data structure. You do not need to take into account scalar
variables (loop counters, other singleton variables), focus on the data
structures whose size depends on values such as the total number of
characters and the total number of unique characters, and use those
values in your answer.
Again, you aren’t required to submit anything for this section,
but you should use these points as a way to reflect on the work you have
done for the pre-lab and in-lab.
The purpose of this part of the post-lab is to clean up your
code from the pre-lab and in-lab, and submit all of it together. If your
pre-lab and in-lab code work properly, then there is no futher clean-up
to do; however, you must still submit the files along with a
new Makefile.
When we run make
, the code should be compiled into
two executables: encoder
and decoder
, which
are the pre-lab and in-lab code bases, respectively. Unlike the pre-lab
and in-lab, you should NOT name your
executables a.out
! After compiling your code with
make
, we will test it as such:
./encoder testfile.txt > encoded.txt
./decoder encoded.txt > output.txt
diff testfile.txt output.txt
This encodes a sample text file, then decodes it. Both the
original file (testfile.txt
) and the final file
(output.txt
) should be the same, which is what the
diff
command checks. If there are no differences between
the two files, then diff
will not print any
output.
Up until now, all of our Makefiles have generated a single
a.out
from the inputs given through OBJECTS
.
Now that we have two separate executables we need to create, you may
need some way to differentiate between the object files for each
one…
Remember the -o
flag for clang++? :)