Go up to the Labs table of contents page
The purpose of this lab is to gain experience using hash tables by building an interesting application.
A hash table is a dictionary in which keys are mapped to array positions by a hash function. Having more than one key map to the same position is called a collision. There are many ways to resolve collisions, but they may be divided into two primary strategies: separate chaining and open addressing.
Look over the first 6 sections (through and including ‘exit status’) of the Wikibooks article on Bash Shell Scripting. You will be writing a shell script during the in-lab, but you should probably start reading it before then. The remaining sections of that article will be the tutorial for the next lab, so feel free to read on, if you are interested.
while
loop: while (foo.good())
.-O2
clang++ optimization flagConsider the problem called word search, illustrated in the diagram below.
In this 10x10 grid of letters, the goal is to find words in the puzzle in any of the eight directions. The word circled in red is in the south-east direction. Words can go backward (although none do in this example), but they cannot “wrap around” from one side to the other (or from top to bottom, etc.).
A string of letters from the grid depends on four values:
For this lab you need to implement a solution to the word puzzle problem described above: given a dictionary and word search grid, your program should output all the words in the grid that are in the dictionary.
There are two main parts to this lab: creating a hash table and writing a word search solver. A high-level specification is provided for the word search solver and the hash table it must use; more details are available in the rest of the pre-lab section.
Your word search solver, implemented in wordPuzzle.cpp
,
should:
We are not as interested in how fast your program runs for the pre-lab; leave any optimizations for the post-lab.
As there may be a large number of words in the dictionary, you will have to store those words in a hash table to facilitate efficient lookup.
You must write your own hash table for this lab. We leave the implementation up to you, as long as you do not use a vector of vectors or any STL hash table.
getWordInGrid.cpp (src) provides two useful functions:
readInGrid()
and getWordInGrid()
. The former
reads in a grid file using C++ streams; the latter returns a word in a
2D grid of letters in a given direction.
You should make sure you understand how these functions work, as you will most likely be using them in your final program.
We are interested in timing how long it takes to find all the valid words in a grid, not including the time it took to read in the dictionary and grid files or perform any other sort of pre-processing. You can use the code provided in timer.cpp (src), timer.h (src), and timer_test.cpp (src) to time the relevant portion of your program. Although you don’t need to print out the timing for the pre and post-labs, it is still helpful to time your solver to help measure the effectiveness of any optimizations you may implement.
Input grids: Each input grid file has three lines.
The first line is the number of rows, and the second is the number of
columns, both representable using integers. The third line is the grid
data, represented by strictly alphabetical characters with no spaces
(i.e. it will contain rows * cols
characters). Example
grids are available in the labs/lab06/data/ directory.
Dictionary files: Dictionaries contain one word per line. The longest word in our data files is 22 letters. Words which contain non-alphabetical characters may occur in the dictionary, but would never appear in a valid grid. Your program should be able to handle dictionaries with such words, although you are not required to put them into the hash table.
Valid words: Your program should only report words with three or more letters – there are simply too many hits if one and two letter words are allowed, and it’s difficult to judge correctness, even on very small word searches.
Case sensitivity: Searches are case-sensitive. Thus, if the word in the dictionary is ‘Foo’, it should not register a match to the text ‘foo’ in the grid.
Duplicates: If a word occurs more than once in a grid, then each instance should be treated as a separate word.
Below is a sample execution run to show you the input and output
format we are looking for. The output shown is a result of running
./a.out words.txt 4x7.grid.txt
N (3, 2): text
E (0, 3): sod
N (2, 5): fad
E (1, 0): pax
NW(3, 6): eft
E (2, 0): ace
W (2, 4): tee
NE(2, 4): tat
SW(0, 6): tat
9 words found
The exact spacing and order that the words are found in does not matter, as we can easily (and automatically) ignore that when comparing your output to the desired output. Everything else should all be the same. The in-lab discusses how to directly compare your output to the expected output (while ignoring spaces and word order) - see the “Input and Output” section of the in-lab.
You should submit any files required for your word puzzle solver to
run as well as a Makefile that produces an a.out
executable.
Given the multi-faceted nature of this lab, it is helpful to only focus on one portion at the time. Start out by using the STL hashtable unordered_set to verify that your word search works, and then work on writing your own hash table implementation to replace it.
The last thing you want is to be staring at incorrect output and not knowing whether that’s because of your solver or your hashtable!
Take it one step at a time.
Start off by reading in the dictionary and printing out all of its words. Then store those words in a hashtable (probably the STL one for now). Great, you have a dictionary now.
Next, read in the grid (use the functions in
getWordInGrid.cpp
to help you!). Plan out how to use the
dictionary to find all the valid words in the grid. There are four
variables that control what word is returned – what does that imply in
terms of loops?
At this point, hopefully you have enough to get you going.
Creating a dynamic 2D array in C++ is more difficult than it should be – one solution is to create a vector of vectors, but that is not the most efficient means to do it. For this lab, you can just create a 500x500 static array, and can assume that you will not have your program run on larger input grids. This is not very elegant, but it will work until we go over dynamic array creation in lecture.
Your program must be able to run successfully if started with
./a.out <dictionary_file> <grid_file>
and no
input.
See the slide set on arrays if you need a refresher on command-line parameters; you can also see the cmdlineparams.cpp (src) file.
As discussed in lecture, a hash table needs to be a prime number in size in order to work. You can adapt the code in the primenumber.cpp (src) file to determine the next highest prime number (of course, the next highest prime number is determined after you double the size of your original hash table).
Your program will need to handle input dictionaries of various sizes and create hash tables of the appropriate size. To get the program working the first time, you can just hard code a prime number table size. But at some point, you will have to handle different size hash tables.
rehash()
function that
will double the size of the array, hash all the old values into the new
table, and then remove the old table.-O2
flagThere are many times when we want our executable program to run as fast as possible. An example of this is the current lab – as we are testing our programs for speed (and we want them to run fast), we want to tell the compiler to write an optimized executable. To do this, we specify the ‘-O2’ flag, short for “optimization level 2” (capital O). There are multiple optimization levels that clang++ provides, each with benefits and drawbacks, but we will be using ‘-O2’ for this lab. To compile a program with this optimization level, we would enter:
clang++ -O2 -Wall wordPuzzle.cpp timer.cpp hashTable.cpp
If we wanted to use the -c flag (such as through a Makefile), then the commands would be:
clang++ -O2 -c wordPuzzle.cpp
clang++ -O2 -c timer.cpp
clang++ -O2 -c hashTable.cpp
clang++ -O2 wordPuzzle.o timer.o hashTable.o
Note that on this last one, that the -O2 flag must be used on each of the individual compilations as well as the final link. Thus, when put into a Makefile, you should make your CXXFLAGS variable include ‘-O2’ so that it applies to all compilations.
We could tell the compiler to always use the -O2 flag, but there are some disadvantages to doing so. Debugging the program is much harder with an optimized executable – much of the debugging information is removed to improve the speed. The compiler also takes longer to compile the program when optimizing. Thus, programmers usually don’t use the -O2 flag until they know the program works and want to start focusing on performance.
Try running your program both with and without the -O2 flag. Do you notice a difference in the timed speed of the program? Our test program decreased in execution time from about 15 seconds to 12 seconds with the -O2 flag.
When we are running the programs, we will want to make sure that they are reporting the same results. To do so, we will need to save the output into a file, and then make sure that file matches the sample input file.
To redirect output to a file, we enter the following command (we’ve seen output redirection before – in section 3 of the first Unix tutorial – so this should be familiar material).
./a.out words2.txt 300x300.grid.txt > output.txt
Recall that this command will OVERWRITE output.txt (if it exists), and save all the output from the executable into that file. This program does not ask for input; if it did, we would not see the prompt, but would still have to provide the input.
Once the program is completed, we will need to see if that file is the same as the sample input. To do so, enter the following command. We’ll assume, for this example, that we are using the 300x300 grid, and words2.txt.
# Use `man diff` to learn more about the diff command
diff output.txt 300x300.words2.out.txt
This will examine the two files byte-by-byte, and look for
differences. If there are no differences, then diff
will
not print out ANY output. If there are differences, then they will be
printed to the screen. This means that if diff
prints
anything, then your program is NOT producing the same results as we are
expecting. All hope is not lost, though! If you look at the two files
(you can load them up both in Emacs), it may very well be that the
difference is in formatting – you use two spaces, we use one space, for
example.
Since we don’t care about spacing, you can provide the
-w
flag to diff
to have it ignore all
whitespace.
If the files are not sorted the same – meaning that the words listed
are the same, but in a different order – then diff will report
differences between the files. You can use the Unix sort
command to sort BOTH files before comparison to fix this:
# Pipe the result of your program execution to `sort`, and then redirect that to output.txt
./a.out words2.txt 300x300.grid.txt | sort > output.txt
sort 300x300.words2.out.txt > words2.sorted.out.txt
diff output.txt words2.sorted.out.txt
If there are a lot of differences, then you should run your program on a small grid with the smaller word file (words2.txt). Once you solve the differences with the smaller grid file and word file, most of the differences with the larger grid/word files should disappear.
Through the various programming languages we’ve seen (Java, C++, etc.), we can make a computer do many things. Yet there are still limitations to what we can do in these languages. How could you get a directory listing in Java, for example? Or easily invoke Unix commands from C++? While these are certainly possible, they are often difficult to accomplish.
Programmers often have a need to do these sorts of things. For example, a script that we wrote takes all the submissions made to gradescope, compiles and executes them, saves the execution output to various files and checks the results for correctness. This allows the compilation, execution, and grading of your submissions to be automated.
The shell is the command-line interface that you use in the terminal. There are many shells out there, some of the more popular being bash, zsh, and csh. Bash is the standard shell that the vast majority of shell scripts are written in, and is the shell we’ll be using for this tutorial.
The original Unix shell was written by Stephen Bourne and released in 1978. It was just called ‘sh’, which is short for ‘shell’. In 1987, Brian Fox updated the shell, and decided to name it ‘Bourne-again shell’, a play on the original author’s name. Hence the name of ‘bash’.
A shell script, then, is a series of commands for the shell
to execute. Shell scripts started off as just a way to have a series of
simple commands in one place. Over time, additional functionality was
added to these scripts, mostly the things we are used to in programming
languages, such as variables and control structures (for
and while
loops, if
and case
conditionals, etc.). Today, shell scripts are a very powerful
programming language that one can use to do many things.
Shell scripts are useful when one needs to call a large number of Unix commands (such as the aforementioned compilation/execution example). Significant computation, such as finding words in a word puzzle, or a lot of arithmetic, are not well suited to shell scripts.
Write a shell script averagetime.sh
that will prompt the
user for the dictionary and grid file names used by your word puzzle
executable, which must be called a.out
. It will then run
the program five times using those parameters. It will record the time
of each execution run, and, once the runs are completed, print out the
average run time. Note that you have not yet seen conditions
(if
or case
) or loops (for
or
while
) in shell scripts, so we do not expect your script to
have either of these – you should just have 5 separate commands without
a loop.
Modify your word puzzle program to print out the total time taken as
the last line of output. Floating point arithmetic in bash is quite
complicated, so try to use integers when possible – you may want to
multiply the result from Timer::getTime()
by 1000 to obtain
the number of milliseconds taken.
You can then capture that line by piping it through the tail
-1
command – the following line would run the program, only keep
the last line, and save that output to a variable.
RUNNING_TIME=`./a.out | tail -1`
The back quote `
tells bash to execute whatever is in
between the quotes rather than treating it as a string. It is usually
located right below the Escape key on a keyboard.
Armed with this, the rest of the required concepts for the shell script are in first six sections of the article.
As you are going through the tutorial, if there is a Unix command
that you do not know (or you once knew and have since forgotten), you
can find out more information about that command by entering man
command
at the Unix prompt. This brings up the manual for that
command, including all of the command-line parameters.
Your script should have comments (anything on a line after a ‘#’ is a comment).
Below are a few notes to keep in mind when writing your shell script. Bash is a very powerful language, but it can be rather finicky and unforgiving with syntax.
averagetime.sh
, and
should have #!/bin/bash
as the very first line of the
script
#!/bin/bash
, it will use the wrong
shell, and your program won’t work properly!./averagetime.sh
. If your computer denies you permission,
remind it who’s boss and put it in its place with chmod +x
averagetime.sh
. This tells your Unix system that averagetime.sh
is a program that can be executed (remember chmod?).Shell scripts don’t have debuggers, but we can use old-style printing
instead: you can always print out the value of the variable through the
echo
command, such as echo ${variable}
.
Bash’s arithmetic expansion $(( ... ))
performs integer
division, so as long as you are using integers in the expression itself,
you shouldn’t run into any issues.
You will need to optimize your code for the solution to the word puzzle. The auto-grader for this assignment will be running your code on larger dictionaries / word grids. In addition to this, your code will be given less time to execute.
To standardize timings, you should run all of your tests on words2.txt and 300x300.grid.txt.
Your coding task for the post-lab is to optimize your program. Try to get your solver running as fast as possible – 3 seconds is a good starting point, though you should still try to optimize it further regardless of how fast your program already runs.
First, record how long it takes to run your program with the
-O2
flag but no other optimizations. We will need this
value later.
There are plenty of optimizations you can try to implement, some of which may work better with your code than others. Here are some ideas:
Keep track of what you tried – if you try each of the collision resolution strategies, and find that the original one you used is faster, make note of your findings and try to figure out why that is.
There are two restrictions on what optimizations you may attempt. You may not:
-O2
-O2
provides a baseline for all the solutions, and your
program is not guaranteed to function the same when using higher
optimization levelsAdditionally, although we are trying to get you to think about how to make your program faster, we still want clean, correct, and readable code. Your solutions will be judged on correctness and style.
As you finish up the lab, consider the following questions:
You do not need to submit anything for these questions, but they are a good way to look back on the assignment and test your knowledge of hashtables and runtime analysis.