Lab 2 - Information Theory

This lab is currently under construction. The entire writeup may change before lab on Jan 20.

Labs are intended to be completed during lab time. Please do not begin this lab before your scheduled lab time.

It is expected that you will work with a group of 3 students in lab. Our lab TAs will facilitate introductions and assist in lab. You should not use Generative AI on lab; please work with the other people in your group!

Introduction

On the second day of class, we briefly discussed Claude Shannon and how he introduced the term bit1. As we have used in class, a bit represented a logical state; i.e., one of two possible values: true/false, on/off, 1/0. We have also just begun a discussion of how to encode other values, such as numbers, characters, strings, even objects, in a sequence of bits (i.e., multiple wires bundled together with some order which each have two states: 1 or 0).

Shannon also founded the field of information theory. A core fact in information theory is that there is a basic unit of information, here also called a “bit” or a “bit of entropy.” In this case, a bit is the “information entropy of a random binary variable that is 0 or 1 with equal probability”2. Roughly speaking, in information theory, a “bit” is an amount of information that is about as surprising as the result of a single coin flip. In the sentence “Hello, my name is Claude” the word “is” has low entropy; most of the time someone says “Hello, my name” the next word they say is “is,” so adding that word provides very little more information. On the other hand, the word “Claude” in thart same sentence has many bits of entropy; a huge number of words could reasonably follow “Hello, my name is” and any given one we pick is quite surprising.

In computers, we rarely encode things anywhere near as efficiently as its bits of entropy suggest. For example, most common encodings of strings use 8 bits per character (which we’ll see in more details in a future lab). In this lab, you’ll replicate an experiment Claude Shannon published in 19503 to see just how inefficient that encoding is.

Overview

For this lab, you and your group will write a program in either Python or Java; choose a language that you are all familiar with. Then, you’ll use that program to perform an experiment and reflect on the results.

You should work on teams of 3-4 students in this lab. Everyone in the group should work side-by-side, each creatign similar programs while talking toegether to help one another out. In well-running groups, each member is speaking about equally, describing what they are writing next or how they are testing what they have written.

You may want to look for a team who intend to write in the same language. If your team is split on languages, consider forming new teams with other members who will share a programming language.

Program Description

Your program should do the following:

  1. Read a text file into a string in memory. You should be able to specify different file names each time you run the program.
  2. Repeatedly
    1. Pick a random index in the string, at least 51 characters after the beginning of the string (your group can decide the best way to do this).
    2. Display to the user the 50 characters preceding that index (in such a way that they can tell if what you displayed ended in a space character or not) but not the character at that index.
    3. Have the user type a single character.
    4. Record if that typing was correct (i.e., if the character typed was the next character in the string).
  3. After some fixed number of iterations (20 might make sense), display
    • The ratio of correct guesses (e.g., “You got 14 out of 20 guesses correct!”)
    • The estimated bits of entropy per letter of the text, which is calculated as:
      \(\log_2(\frac{g}{r})\) where \(g\) is the total number of guesses made and \(r\) is the number that were correct (e.g., 0.5145731728297582 for 14 of 20 correct).

Note: Your code should be able to run and complete the steps above. It does not need to account for edge case problems, such as running the code without the file to read or reading a file that is too short.

Programming on the Portal

You should write (and run) your program on the CS portal. This way, you’ll gain experience programming in the portal environment using the tools we discussed during last week’s lab along with the help of your teammates and lab TAs. In homework 1, we’ll be asking you to program on the portal individually, so this is great practice!

Editing your code

Edit your code with either vim, nano, or your text editor of choice. You should NOT use VSCode. If a TA sees that you’re using VSCode, they’ll ask you to switch so that you gain experience editing on the portal.

For Python, we encourage writing your code in a file named entropy.py; for Java, we encourage writing a class named Entropy in a file named Entropy.java. (For Java, the filename (ending in .java) must match the name of the class you’re writing exactly.

Running your code

To run your code, you’ll either need to exit your editor or open a second terminal window on your computer. If you use a second terminal window, make sure that both terminals are connected to the portal (using SSH) and that they are in the same directory on the portal.

Python

If you choose to write your code in Python, then you can run it with the python3 interpreter as follows:

python3 entropy.py filename_to_read

Note: be aware of (and fix) any errors python3 prints when running your code; you will need to fix them before your code works correctly.

Java

Before compiling and running Java on the CS portal, you’ll need to load the module that supports it. You’ll only need to do this step once each time you log into the portal:

module load java/23

For Java, we’ll need to compile our code first before running it. Most IDEs (including intelliJ) will hide this step from you, but you’ll need to do this directly in the portal. If you’ve written your code in the Entropy class defined in the file Entropy.java, you should compile and run it as follows:

javac Entropy.java
java Entropy filename_to_read

The first line will compile your Java file into bytecode as a file named Entropy.class; the second line will then have Java run the main method from the Entropy class you defined.

Note: be aware of (and fix) any errors javac prints when compiling your code; you will need to fix them before your code works correctly.

Generative AI and Lab 2

For this lab, you are welcome and encouraged to use Generative AI (such as Microsoft Copilot provided by UVA) for general advice on how to get started (after discussing strategies with your group members first), help debugging any error messages you get when running your program, as well as any pointers needed when you get stuck (but only after trying to get un-stuck as a group first). That is, Generative AI should be your last resort after discussing this with your group. Your AI use should be noted to the TA when checking off.

Generative AI should NOT write your solution for you. You have a team of people to bounce ideas off and discuss strategies for creating this program; please discuss strategies in your group. For this lab, your group is allowed to share code with each other, but we expect everyone to contribute and understand how the code works for full credit.

What is the entropy of…

Once your program seems to be working, try it on a few different texts. We’ve loaded them into the portal already; passing the full paths below will provide your program with access to the file (without needing to download it). For example, you might try

  • /p/cso1/labs/entropy/tarzan.txt (500KB) – the original Tarzan book by Edgar Rice Burroughs.
  • /p/cso1/labs/entropy/pi.txt (1MB) – the first million digits of pi.
  • /p/cso1/labs/entropy/_pydecimal.py (229KB)– a large file from the Python standard library.
  • /p/cso1/labs/entropy/diff_match_patch.java (89KB) – a large file from the open source Java project diff-match-patch.

Group Check-off Acceptance Criteria

In order to receive credit for this lab, each member of your group should show your lab TA:

  1. Your working code in a text editor on the portal
  2. (Compiling and) Running your code on one of the examples above

Additionally, you should be able to answer any questions related to the lab activity posed by the TA to receive full credit.

  1. a portmanteau of “binary digit” 

  2. from Wikipedia, Bit 

  3. Shannon, Cluade E. (1950), “Prediction and Entropy of Printed English”, Bell Systems Technical Journal (3) pp. 50–64. 


Copyright © 2025 John Hott, portions Luther Tychonievich.
Released under the CC-BY-NC-SA 4.0 license.
Creative Commons License

This site uses Just the Docs, a documentation theme for Jekyll.