A Hash Table using C#

Brian Barto recently put up a post where he described how hash tables are implemented. Many modern programming languages have this data structure as a built-in feature, making it a commonly used item in a developer’s toolkit. Due to its ready availability, few programmers ever have the need to roll their implementation. So it comes as no surprise when people who do not have a formal education in software engineering do not understand how it works. Brian’s post explains the most essential parts of this magical data structure very well.

After reading about it, I felt the urge to implement the same thing in C# as a learning experience. His post carries an implementation using the C programming language, which could be copied over with some tweaking in order to work with the different semantics of C#. This post documents the exercise and its learnings.

The Associative Array

A hash table is an implementation of an associative array, which is a collection of key-value pairs. Each key is unique and can appear only once in the collection. The data structure supports the full range of CRUD operations – create, retrieve, update and delete. All operations are constant time on average or linear time in the worst case, making speed the main advantage of this data structure. In some languages, all operations are rolled into the same operator for simplicity. If the key does not exist, it is added. If the key already exists in the collection, its value is overwritten with the new one. Keys are removed by setting their value to null, or having a separate delete operator for associative arrays.

JavaScript objects are an example of associative arrays.

o[“occupation”] = “Wizard”; // Key created
o[“name”] = “Jane Doe”; // Key updated
console.log(o[“name”]); // Value retrieval
delete o[“occupation”]; // Jane just retired

A hash table uses a linear array internally to store values. The key is converted into an integer by running it through a hash function, which is then used as the index into the backing array where the value is stored. A good hash function has two characteristics – minimal conflicts and uniform distribution.

Minimal Conflicts

As explained above, a hash function works by converting an input into an integer. A modular hash function is one of the simplest implementations possible, which returns the remainder of a division (aka modulo) of the key by the table length.

int size = 100;
int key = 10;
int index = key % size; // 10

However, this function can end up with a high number of conflicts. The remainder of dividing 10, 110, 210 and so on are are all 10, and require additional steps to resolve the conflict. This condition is minimised by setting the length of the array to a prime number.

int size = 97;
int key = 10;
int index = key % size; // 10
key = 110;
index = key % size; // 13
key = 210;
index = key % size; // 16

Uniform Distribution

Discrete uniform distribution is a statistical term for a condition where all values in a finite set all have an equal probability of being observed. A uniform distribution of hash values over the indexes into an array reduces the possibilities of key conflict, and reduces the cost of frequent conflict resolution. Uniform distribution over an arbitrary set of keys is a difficult mathematical challenge, and I won’t even pretend to understand what I’m talking about. Let it just suffice that a lot of really smart folks have already worked on that and we get to build upon their achievements.

If the keys in a hash table belong to a known and finite set (such as a language dictionary), then the developer can write what is known as a perfect hash function that provides an even distribution of hashes with zero conflicts.

Resolving Conflicts

Finally, general purpose hash functions can often result in conflicts, i.e. return the same output for two separate inputs. In this case, the hash table must resolve the conflict. For example, the modular hash function returns the index 13 for the key 13 as well as 110. There must be a way to set both these keys simultaneously in the same collection. Broadly, there are three techniques to achieve this.

Separate Chaining

In this technique, the values themselves are not stored in the array. They are stored in a separate data structure such as a linked list, and a pointer or reference to that list is stored in the backing array. When a value is to be located, the function first computes its index by hashing the key, then walks through the list associated with that index until it finds the desired key-value pair.

Open Addressing

Open addressing stores all records in the array itself. When a new value is to be inserted, its index position is computed by applying the hash function on the key. If the desired position is occupied, then the rest of the positions are probed. There are several probing algorithms also. Linear probing, the simplest of them, requires walking the array in fixed increments from the desired index until an empty spot is found in the array.

Implementation

Enough theory. Let us move on to implementation details.

The first step is to define a HashTable class. The size of the hash table is restricted to 100 entries for this exercise and a fixed-length array is initialised to store these entries.

public class HashTable
{
    private static readonly int SIZE = 100;
    private object[] _entries = new object[SIZE];
}

This is where the implementation of this data structure veers away from the one shown in Brian Barto’s post. The original post used explicitly named method for insert, search and delete operations. However, C# allows the nicer and more intuitive use of a square-bracket syntax to perform these same operations.

public class HashTable
{
    ...

    public int this[int key]
    {
        get;
        set;
    }
}

The client program uses the following syntax to consume this interface, which is much more concise than having separate methods for each operation.

HashTable map = new HashTable();
map[key] = value; // Insert
int value = map[key]; Retrieve
map[key] = null; // Delete

The HashTable class internally uses a struct called Entry to store individual key-value pairs.

struct Entry
{
    internal int key;
    internal int? value;
}

The type of the value field in the Entry struct must be set to a nullable int so that the client can set it to null when the value has to be cleared. The declaration of the _entries field is changed as follows.

private Entry?[] _entries = new Entry?[HashTable.SIZE];

Finally, we come to the meat and potatoes of this exercise – the accessor & mutator functions, whose signature is also modified to support the nullable type.

public int? this[int key]
{
    get
    {
        ...
    }

    set
    {
        ...
    }
}

The code for these methods and the hash function is more or less identical to the one described in Brian’s post, with some tweaks to accommodate for the nullable value.

get
{
    int index = _hash(key);
    
    while (null != _entries[index])
    {
        if (key == _entries[index]?.key)
        {
            break;
        }
        index = (index + 1) % HashTable.SIZE;
    } 
    return _entries[index]?.value;
}

set
{
    if (null == value)
    {
        Delete(key);
    }
    else
    {
        int index = _hash(key);
        while (null != _entries[index])
        {
            if (key == _entries[index]?.key)
            {
                break;
            }
            index = (index + 1) % HashTable.SIZE;
        }

        _entries[index] = new HashEntry { key = key, value = value };
     }
}

The one additional member in this class is the private Delete method which is required to delete the entry. This method is called by the mutator function when the client passes in a null value.

private void Delete(int key)
{
    int index = _hash(key);

    while (null != _entries[index])
    {
        if (key == _entries[index]?.key)
        {
            break;
        }
        else
        {
            index = (index + 1) % HashTable.SIZE;
        }
    }

    if (null == _entries[index])
    {
        return;
    }

    _entries[index] = null;
}

That’s all that’s needed to implement this data structure. The hash function shown in this example is a very na├»ve implementation. The probability of collisions is very high and the probing function is very slow. It also cannot work with any data type other than int for both keys and their values.

In spite of these limitations, this still made for a very useful learning exercise.

Reading Time on a Binary Clock

Binary clocks are probably one of the epitomes of geek cred. Everybody can read an analog or digital clock that represents numbers in base 10. But it takes a geek to read the time from a gadget that uses an obscure and cryptic number system. Call it the hipsterdom of technology.

Understanding Number Systems

Modern number systems are remarkably similar to each other conceptually. The only difference is their applicability in different scenarios. The decimal system is in common use every day all over the world. Many fundamental concepts that are carried forward into other systems were refined using base 10 numerals. The most essential of these are naming unique digits, and positional notation.

Unique Digits

Numbers are a strange beast in that they have no end. The most primitive counting systems used scratches in the dust or pebbles to keep count. It became easier to represent larger values with the advent of separate symbols to identify different numbers. Roman numerals had special symbols for many numbers such as 5 (V), 10 (X) and 50 (L). While this made representation of larger values more compact, it still wasn’t perfect. It took the Indians, and later the Arabs, to finally come up with an extensible, yet concise number system that could represent any imaginable value.

Since it is impossible to have unique representations for every number when they are essentially infinite, the Hindu-Arabic numeral system instead has 10 unique symbols in base 10 to represent the digits from 0 to 9. By applying positional notation, all possible numerals can be represented by using these 10 symbols. Numbers greater than 9 are represented by stringing digits together. The leftmost digit has a greater magnitude than the one to its right, and the value of the numeral is a sum of its digits multiplied by their magnitudes.

Positional Notation

The magnitude itself is a power of the base. In the decimal system, the base is 10. Hence, the magnitude is 10 raised to a power that increases by one for every leftward shift. The rightmost number is multiplied by the 0th power of 10 and represents the ones position. The position to its immediate left is multiplied by 10 raised to 1, the next by 10 raised to 2, and so on.

Let us take the numeral 256 to illustrate this.

256 = 2 × 10² + 5 × 10¹ + 6 × 10⁰
    = 2 × 100 + 5 × 10 + 6 × 1
    = 200 + 50 + 6
    = 256

Binary uses the same concept to represent values. The only difference is that the rollover value in binary is 1 since it has only two digits – 0 and 1, and the number is multiplied by a power of 2 instead of a power of 10. It is also more verbose than the decimal number system. Even a relatively small value like 25 requires 5 digits in binary – 11001.

11001 = 1 × 2⁴ + 1 × 2³ + 0 × 2² + 0 × 2¹ + 1 × 2⁰
      = 1 × 16 + 1 × 8 + 0 × 4 + 0 × 2 + 1 × 1
      = 16 + 8 + 0 + 0 + 1
      = 25

In this sense, number systems are identical to odometers. When the rightmost digit reaches a certain maximum value, it goes down back to zero and the digit to its immediate left increases by one. If the second column is also at the largest digit, then it too resets to zero and the increment is applied to the third column.

Being able to read binary representations is obviously an essential requirement to read the time in a binary clock.

Structure of a Binary Clock Face

There are two types of binary clocks possible – ones that use binary coded decimals, and true binary clocks.

A clock face that uses binary-coded decimal notation

A BCD clock face is divided into six columns – two for each component of the time. Each column contains up to four rows, one for each power of two. The leftmost two columns represent the hour, the middle two are minute columns and the last two represent seconds. Each column represents a single base 10 digit of time. For example, if the value of column one is 1 and that of column 2 is 0, then the clock is representing the 10th hour of the day, or after 10 am. Similarly, if the value in column three and four is 3 and 2 respectively, the clock is in the 32nd minute of the current hour.

A clock face that represents time components in pure binary notation

True binary clocks represent each time component in a single column. Such clocks require only three columns, and up to six rows in each column to adequately cover all required values. Each column represents the absolute value of the component in binary encoding. For example, 0b001010 in the hour column represents 10 (1 × 2³ + 1 × 2¹). 0b100000 in the minutes column represents the 32nd minute of the hour. Together, the two values indicate the time as 10:32 am.

Even if you are not very accomplished at converting from binary to decimal easily, there are only a few values required to display the time in binary. Most people can easily memorize the light sequences and the values they represent after a few days of practicing.

Programming Beyond 9 to 5

In my previous post, I elaborated on the importance of reading for software developers. Keeping up with new advances in technology demands that its users have a healthy appetite for the written word. However, reading is just part of the equation for engineers. Knowledge gleaned through books offers no benefit unless it can be applied efficiently in their daily work. And quality programming, like every other art form, requires continuous practice.

A person who is uninterested in programming beyond its ability to help bring in a monthly paycheck, would typically not want to spend any more time than absolutely necessary in performing this activity. I don’t always hold side projects very high up in the evaluation matrix, but it is a pretty good indicator about the candidate’s interest in programming beyond the mundane stuff. After all, most programmers only get paid to write business software. If they want to write their own stack implementation that already ships as part of their framework, their employers will not be pleased to have them attempting to roll their own. Not only will it be a very inefficient use of time, but hand-rolled implementations are going to be nowhere as performant and bug-free as the libraries that ship with mass-consumed frameworks.

However, side projects do not come under strong timeline or performance pressures, nor are there constraints about the runtime environment or language choices. Who is to object if an individual feels inspired enough to write a stack in Brainfuck?

Educators have long since accepted the importance of learning by doing, especially for scientific topics. There is only so much one can absorb about the stack data structure by reading about it. At some point, one must put down the book and let the rubber meet the road, so to speak. Side projects in languages or frameworks other than what is used at work is a great way to learn new stuff. And it is a natural progression to reading about something new. Writing one’s own implementation of the Towers of Hanoi game cements whatever knowledge the book provides about the data structure. And depending upon the type of project we are talking about, individuals can easily play their best card. Those who are visually-gifted or otherwise have access to a collection of high quality graphics, a finely polished interactive utility or game can make a great impression. Those with a less-than-perfect eye for graphic design, a programming API with accompanying documentation can make an equally compelling case for their skills.

Again, there are many avenues for programmers who want ideas on programming projects. All that is required is the will and tenacity (not to mention the skill itself to begin with) to come up with a project idea and take it through to completion.

On Reading for Programmers

The growing maturity of the software industry over 30 years has resulted in an increasing number of success stories. Technology and tools are maturing to a point where typical business software projects have fewer and fewer technical reasons to fail. Enterprise frameworks, databases, communication tool-kits, platform abstraction layers and drivers for commodity hardware are rock solid. And yet, software projects are often delayed, over-budget, poorly built or all of the above. In worst-case scenarios, projects turn into such massive train wrecks that they are cancelled entirely, canned and brushed under the carpet like they never happened.

In many instances, poor management practices can be squarely blamed on a failed project. Impossible schedules, inferior equipment, inadequate manpower, or half-baked requirements are all major factors affecting failed projects. But what is often overlooked is a more insidious problem – poorly trained and unmotivated developers. Hiring people who have minimal interest in the craft makes them liabilities to the rest of the team that is striving to achieve quality work and can seriously jeopardize chances of project success.

For hiring decision makers, if quality of candidates comes above quantity (as it should be, unless you are being pressured to fill up some massive, VC-backed head-count void) evaluating their technical prowess is a must-do activity during the interview. While practical tests definitely should be on the list, I find reading and side projects to be very important positive signs to look out for.

Passionate developers are often voracious readers, especially on the topics of software, programming techniques, languages and technology. We are living in a world where information is literally at our fingertips. More matter is written and published than ever before. Many writers give away their work for free online. Heck, there are lists of recommended books and essays for new programmers who don’t even know where to begin. There is no reason at all for a person to not be able to read at least one substantial book every few months other than the lack of motivation.

Quality of material is important too. A half-baked, ‘syllabus-oriented’ book, read two days before final examinations does not count for anything. Passing a platform-specific certification program might indicate a programmer’s skill on that particular platform. But since most of them are promoted by vendors with a self-serving interest in increasing the popularity of their platform, it does not speak much about their ability to adapt to new technologies.

Personally, the structure of a published book sounds much more conducive for learning something new, especially since it has been evaluated and revised for structure and content by a bunch of editors. This is usually applicable only for in-depth understanding of large frameworks, however. I don’t think I could be bothered to read much more than a few paragraphs occasionally on novelty topics that don’t relate to my daily work, such as Arduino or low-level system programming. But everyone obviously has their own preferences and styles of learning.

Unfortunately, there are too many developers who simply seek to be part of the software industry for its perceived low barrier to entry and ease of massive success. For these kinds, no amount of coercion can cause them to pick up a book and plow through it unless its contents carry a critical mark on their annual appraisal. Tom Demarco’s statement from so many years ago still rings true –

“The average software developer, for example, doesn’t own a single book on the subject of his or her work, and hasn’t ever read one. That fact is horrifying for anyone concerned about the quality of work in the field,  folks like us who write books, it is positively tragic.”