Components of Entity Framework

This post is part of a series on learning how to use Entity Framework. The rest of the posts in the series are linked below.

Basics of Entity Framework

Code First

Database First


Like most well-designed APIs, Entity Framework is composed of several discrete components which complement each other. There are three high-level components – the Entity Data Model, the Querying API and the Persistence API.

Collectively, these components allow application programmers to operate upon classes which are specific to their business domain rather than indulging in the low-level details of ADO.NET and SQL.

Entity Data Model

Conceptual Model

This layer of the data access layer is built by the application programmer. It contains class representations of the domain model for whatever business requirement that the application aims to achieve. The programmer identifies the objects which make up the given domain, then implements them as classes. The classes have public properties that correspond to properties of the domain object, and are also converted into database columns by the Storage Model. At runtime, these classes are instantiated as CLR objects and their properties are populated with values from corresponding database columns.

Storage Model

The database and its various elements – the tables, views, stored procedures, indexes and keys – make up this layer. This is a relational database such as SQL Server or PostgreSQL, although future versions of the Entity Framework are said to support NoSQL databases.

Mappings

Object-oriented programming can be mapped to relational schemas quite closely in most standard scenarios. A class can easily have the same properties as a table column and use equivalent data types to represent it in memory at runtime. Object references are represented as relationships between two tables with all accompanying aspects such as defining keys between tables.

Data constraints and default values can also be declared in this layer by using the appropriate configuration APIs. Any mismatch between the objects, tables and mapping results in a runtime exception which can be handled by the programmer.

Querying API

There are two APIs that programmers can use to interact with the database from Entity Framework.

LINQ to Entities

This is a query language that retrieves data from the storage model, and with the aid of mappings, converts it into object instances from the conceptual model.

Entity SQL

This is a dialect of SQL which operates upon conceptual models instead of relational tables. It is independent of the underlying SQL engine, and as a result, can be used unchanged between various database engines.

Both querying languages are operational through the Entity Client data provider, which manages connections, translates queries into data source-specific SQL syntax, and returns a data reader with records for conversion into entity instances. By dint of being an abstraction over the connection, Entity Client can also be used to breach the Entity Framework abstraction and interact with the database directly with ADO.NET APIs.

Persistence API

Object Service

The programmer interacts with the object service to access information from the database. It is the primary actor in the process of converting the records retrieved from the data provider into an entity object instance.

Entity Client Data Provider

The Entity Client Data Provider works the other way around, to translate queries written in LINQ-to-Entity into SQL queries that the database understands. This layer interacts with the ADO.NET data provider to fetch from or send data into the database.

ADO.NET Data Provider

This is the standard data access framework from Microsoft. Entity Framework is an entity-level abstraction over the APIs provided by this library.

Introduction to ORM & Entity Framework

This post is part of a series on learning how to use Entity Framework. The rest of the posts in the series are linked below.

Basics of Entity Framework

Code First

Database First


Having application programmers design, implement and maintain databases has often been a challenge. Maintainability and performance have taken a hit, resulting in less than stellar application performance and poor extensibility.

Object Relational Mapping libraries have been written since a long time to gloss over lack of proficiency in SQL and database design. However, they too have been cited as a major bottleneck to performance due to abstracting over the underlying relational model, and generating bloated and inefficient queries.

Newer generation ORM tools have solved performance and design problems to a large extent, resulting in frameworks which developers can use with confidence even for large-scale and high-availability applications.

What is Entity Framework?

Entity Framework is Microsoft’s offering in this space for use with the .NET framework. It is built upon the traditional ADO.NET framework and therefore, can be made to work with almost any relational database with bindings for the .NET framework.

Entity Framework is an open source ORM library which is published by Microsoft. It complements the .NET framework and is built upon the ADO.NET framework.

Entity Framework frees up the programmer from having to translate information between .NET instances and DataSet objects which are used by ADO.NET to retrieve, insert and update records in the data store.

Practical Design Patterns in C# – State

The purpose of the state design pattern is to allow an object to alter its behaviour when its internal state changes.

The example of a logging function below is a model candidate for conversion into a state implementation.

The requirements of this function are as follows.

  1. Write the log entry to the end of a text file by default.
  2. The file size is unlimited by default, denoted by setting the value of the maximum file size to 0.
  3. If the value of the maximum file size is greater than 0 then the log file is archived when it reaches this size, and the entry is written to a new empty log file.
  4. If the recycle attribute is enabled and the maximum file size is set to a value greater than 0, then the oldest entry in the log file is erased to make space for the new entry instead of archiving the entire log file.

The implementation of the actual mechanism for writing to disk is not relevant in this example. Rather, the focus is on creating a framework that enables selecting the correct mechanism based on the preferences set.

public void Log(string message)
{
    if (LimitSize)
    {
        if (file.Length < MAX_SIZE)
        {
            // Write to the end of the log file
        }
        else
        {
            if (Recycle)
            {
                // Clear room at the top of the log file
            }
            else
            {
                // Create a new log file with current time stamp
            }

            // Write to the end of the log file
        }
    }
    else
    {
        // Write to the end of the log file
    }
}

The implementation shown above is very naïve, tightly coupled, rigid and brittle. However, it is also a fairly common occurrence in most code bases due to its straightforward approach to addressing the problem.

The complicated logical structure is difficult to decipher, debug and extend. The tasks of determining file size limits, recovering disk space, creating a new file and writing to the file are all discrete from each other and should have no dependency between themselves. But they are all interwoven in this approach and add a huge cognitive overhead in order to understand the execution path.

There are multiple conditions being evaluated before the log entry is made. The programmer has to simulate the computer’s own execution path for each statement before even arriving at the important lines related to writing to disk. A couple of blocks are duplicated code. It would be nice if the execution path could be streamlined and duplicate code removed.

Restructured Code

An eager approach to selecting the preferred mechanism breaks up the conditional statements into manageable chunks. It is logically no different from having the same code in the Log method. But moving it into the mutator makes it easier to understand by adding context into the logic rather than handing it all in a single giant function.

public int MaxFileSize
{
    get
    {
        return _maxFileSize;
    }

    set
    {
        _maxFileSize = value;

        if (0 == MaxFileSize)
        {
            // Activate the unlimited logger
            return;
        }

        if (Recycle)
        {
            // Activate the recycling logger
        }
        else
        {
            // Activate the rotating logger
        }
    }
}

public bool Recycle
{
    get
    {
        return _recycle;
    }

    set
    {
        _recycle = value;

        if (0 == MaxFileSize)
        {
            // Recycling is not applicable when file sizes are unlimited
            return;
        }

        if (Recycle)
        {
            // Activate the recycling logger
        }
        else
        {
            // Activate the rotating logger
        }
    }
}

The comments indicate that you’re still missing some code that actually performs the operation. This is the spot where the behaviour is selected and applied based on the state of the object. We use C# delegates to alter the behaviour of the object at runtime.

public delegate void LogDelegate(Level level, string statement)

public LogDelegate Log
{
    get;
    private set;
}

The Logger class contains private methods that implement the same signature as the Log delegate.

private void AppendAndLog(Level level, string message)
{
}

private void RotateAndLog(Level level, string message)
{
}

private void RecycleAndLog(Level level, string message)
{
}

When the condition evaluation is completed and the logging method has to be selected, a new instance of the Log delegate is created which points to the correct logging method.

if (Recycle)
{
    Log = new LogDelegate(RecycleAndLog);
}
else
{
    Log = new LogDelegate(RotateAndLog);
}

This separates the selection of the correct logging technique to use from the implementation of the technique itself. The logging method implementations can be changed at will to fix bugs or extend their features without adding risk to breaking the rest of the code.

How to Write Unmaintainable Code – ASP.NET Redux

No matter how far technology progresses, it seems that we still remain bound to the past by an innate ability of writing poorly structured programs. To me, this points to a rot that is far deeper than languages and platforms. It is a fundamental failure of people who claim to be professionals to understand their tools and the principles that guide their usage.

It has been eight years since I wrote the previous piece in this series that demonstrated poorly written PHP code. The language gets a bad rap due to the malpractices that abound among users of the platform. But this was a theme I was hoping would be left behind after graduating to the .NET framework in the past few years.

It turns out that I was wrong. Bad programmers will write bad code irrespective of the language or platform that is offered to them. And the most shocking bit is that so many of the points from the previous article (and the original by Roedy Green) are still applicable, that it feels like we learned nothing at all.

Reinvent the wheel again. Poorly.

Maintainable code adheres to standards – industry, platform, semantics, or just simply internal to the company. Standard practices make it easy to build, maintain and extend software. As such, they’re anathema to anybody who aims to exclude newcomers from modifying his program.

Therefore, ignore standards.

Take the case of date and time. It is 2018, and people want to and expect to be able to use any software product irrespective of their personal regional settings.

Be merciless in thrashing their expectations. Tailor your product to work exclusively with the regional settings used on your development computer. If you are using the American date format, say you’re paying homage to the original home of the PC. If you’re using British settings, extol upon the semantic benefits of the dd-mm-yy structure over the unintelligible American format.

Modern programming platforms have a dedicated date and time data type precisely to avoid this problem. Sidestep it by transmitting and storing dates as strings in your preferred formats (there doesn’t have to be just one). That way, you also get to scatter a 200-line snippet of code to parse and extract individual fields from the string.

For extra points, close all bug reports about the issue from the test engineers with a “Works for me” comment. Your development computer is the ultimate benchmark for your software. Everybody who wishes to run your program should aspire to replicate the immaculate state of existence of your computer. They have no business running or modifying your program otherwise.

Never acknowledge the presence of alternative universal standards.

Ignorance is bliss

Nobody writes raw C# code if they are going to deploy on the web. A standard deployment of ASP.NET contains significant amounts of framework libraries that enable the web pipeline and extensions to work with popular third-party tools. Frameworks in the ecosystem are a programming language unto themselves, and require training before use.

Skip the books and dive into writing code headfirst.

Write your own code from scratch to do everything from form handling to error logging. Only n00bs use frameworks. Real programmers write their own frameworks to work inside of frameworks. This gives rise to brilliant nuggets such as this.

public class FooController
{
    …
    public new void OnActionExecuting(ActionExecutingContext filterContext)
    {
    }
    …
}

By essentially reinventing the framework, you are the master of your destiny and that of the company that you are working for. Each line of custom-built code that you write to replace the standard library tightens your chokehold on their business, and makes you irreplaceable.

Allow unsanitised input

Protecting from SQL injection is difficult and requires constant vigilance. If everything is open to injection, the maintenance programmer will be bogged under the sheer volume of things to repair and hopefully, either go away or be denied permission to fix it due to lack of meaningful effort estimates.

Mask these shortcomings by only writing client-side validation. That way, the bugs remain hidden until the day some script kiddie uses the contact form on the site to send “; DELETE TABLE Users” to your server.

Try…catch…swallow

Nobody wants to see those ugly-ass “Server Error” pages in the browser. So do the most obvious thing and wrap your code in a try-catch block. But write only one catch handler for the most general exception possible. Then leave it empty.

This becomes doubly difficult to diagnose if you still return something which looks like a meaningful response, but is actually utterly incorrect. For example, if your method is supposed to return a JSON object for the active record, return a mock object from the error handler which looks like the real thing. But populate it with empty or completely random values. Leave some of the values correct to avoid making it too obvious.

Maintenance programmers have no business touching your code if they do not have an innate ken for creating perfect conditions where errors do not occur.

String up performance

Fundamental data types such as strings and numbers are universal. Especially strings. Therefore, store all your data as strings, including obvious numeric entities such as record identifiers.

This strategy has even more potential when working with complex data types containing multiple data fields. Eschew standard schemes such as CSV. Instead come up with your own custom scheme using uncommonly used text characters. The Unicode standard is very vast. I personally recommend using pips from playing cards. The “♥” character is appropriately labelled “Black Heart Suit”, because it lets the maintenance programmer perceive the hatred you bear towards him for attempting to tarnish the pristine beauty of the code you have so lovingly written.

This technique also has a lot of potential in the database. Storing numeric data as strings increases the potential for writing custom parsers or making type-casts mandatory before the data can be used.

Use the global scope

Global variables are one of the fundamental arsenal in the war against maintainable code. Never fail an opportunity to use them.

JavaScript is a prime environment for unleashing them upon the unwary maintenance programmer. Every variable that is not explicitly wrapped up inside a function automatically becomes accessible to all other code being loaded on that page. This is an increasingly rare scenario with modern languages. The closest it can be approximated in C# is to have a global class with several public properties which are referenced directly all over the application. While it looks the same, it is still highly insulated. Try these snippets as an example.

JavaScript –

var a = 0; // Variable a declared in global scope
 
function doFoo() {
    a++; // Modifies the variable in global scope
}
 
function doBar() {
    var a = 1;
    a++; // Modifies the variable in local scope
}

C# –

public class AppGlobals
{
    public int A = 0;
}
 
public class Foo
{
    public void DoSomething()
    {
        // Scope of A is abundantly clear
        AppGlobals.A++;
        var A = 0;
        A++;
    }
}

It is very easy to overlook the scope of the variable in JavaScript if the method is lengthy. Use it to your advantage. Camouflage any attempts to detect the scope correctly by using different conventions across the application. Modify the global variable in some functions. Define and use local variables with the same name in others. Modifying a function must require extensive meditation upon it first. Maintenance programmers must achieve a state of Zen and become one with your code in order to change it.

Use unconventional data types

Libraries often leverage the use of conventions to eliminate the need to write custom code. For example, the Entity Framework can automatically handle table per type conditions if the primary key column in the base class is an identity column.

You can sideline this feature by using string or UUID columns as primary keys. Columns with these data types cannot be marked as identity. This necessitates writing custom code to operate upon the data entities. As you must be aware by now, every extra line of code is invaluable.

Database tables without relationships

If you are working at a small organisation, chances are there is no dedicated database administrator role and developers manage their own database. Take advantage of this lack of oversight and build tables without any relationships or meaningful constraints. Extra points if you can pull it off with no primary keys or indexes.

Combine this with the practice of creating and leaving several unwanted tables with similar names to give rise to a special kind of monstrosity that nobody has the courage to deal with. For still extra marks, perform updates in all the tables, including the dead ones. Fetch it from different tables in different parts of the application. They cannot be called unwanted tables if even one part of your application depends on them. Call it “sharding” if anybody questions your design.

Conclusion

This post is not meant to trigger language wars. Experienced developers have seen bad code written in many languages. Some languages are just more amenable to poor practices than others.

The same principle applies to the .NET framework, which was supposed to be a clean break from the monstrosities of the past. On the web, the ASP.NET framework and its associated libraries are still one of the best environments I have used to build applications.

That people still write badly structured code in spite of all these advances cements my original point – bad programmers write bad code irrespective of the language thrown at them.

A Hash Table using C#

Brian Barto recently put up a post where he described how hash tables are implemented. Many modern programming languages have this data structure as a built-in feature, making it a commonly used item in a developer’s toolkit. Due to its ready availability, few programmers ever have the need to roll their implementation. So it comes as no surprise when people who do not have a formal education in software engineering do not understand how it works. Brian’s post explains the most essential parts of this magical data structure very well.

After reading about it, I felt the urge to implement the same thing in C# as a learning experience. His post carries an implementation using the C programming language, which could be copied over with some tweaking in order to work with the different semantics of C#. This post documents the exercise and its learnings.

The Associative Array

A hash table is an implementation of an associative array, which is a collection of key-value pairs. Each key is unique and can appear only once in the collection. The data structure supports the full range of CRUD operations – create, retrieve, update and delete. All operations are constant time on average or linear time in the worst case, making speed the main advantage of this data structure. In some languages, all operations are rolled into the same operator for simplicity. If the key does not exist, it is added. If the key already exists in the collection, its value is overwritten with the new one. Keys are removed by setting their value to null, or having a separate delete operator for associative arrays.

JavaScript objects are an example of associative arrays.

o[“occupation”] = “Wizard”; // Key created
o[“name”] = “Jane Doe”; // Key updated
console.log(o[“name”]); // Value retrieval
delete o[“occupation”]; // Jane just retired

A hash table uses a linear array internally to store values. The key is converted into an integer by running it through a hash function, which is then used as the index into the backing array where the value is stored. A good hash function has two characteristics – minimal conflicts and uniform distribution.

Minimal Conflicts

As explained above, a hash function works by converting an input into an integer. A modular hash function is one of the simplest implementations possible, which returns the remainder of a division (aka modulo) of the key by the table length.

int size = 100;
int key = 10;
int index = key % size; // 10

However, this function can end up with a high number of conflicts. The remainder of dividing 10, 110, 210 and so on are are all 10, and require additional steps to resolve the conflict. This condition is minimised by setting the length of the array to a prime number.

int size = 97;
int key = 10;
int index = key % size; // 10
key = 110;
index = key % size; // 13
key = 210;
index = key % size; // 16

Uniform Distribution

Discrete uniform distribution is a statistical term for a condition where all values in a finite set all have an equal probability of being observed. A uniform distribution of hash values over the indexes into an array reduces the possibilities of key conflict, and reduces the cost of frequent conflict resolution. Uniform distribution over an arbitrary set of keys is a difficult mathematical challenge, and I won’t even pretend to understand what I’m talking about. Let it just suffice that a lot of really smart folks have already worked on that and we get to build upon their achievements.

If the keys in a hash table belong to a known and finite set (such as a language dictionary), then the developer can write what is known as a perfect hash function that provides an even distribution of hashes with zero conflicts.

Resolving Conflicts

Finally, general purpose hash functions can often result in conflicts, i.e. return the same output for two separate inputs. In this case, the hash table must resolve the conflict. For example, the modular hash function returns the index 13 for the key 13 as well as 110. There must be a way to set both these keys simultaneously in the same collection. Broadly, there are three techniques to achieve this.

Separate Chaining

In this technique, the values themselves are not stored in the array. They are stored in a separate data structure such as a linked list, and a pointer or reference to that list is stored in the backing array. When a value is to be located, the function first computes its index by hashing the key, then walks through the list associated with that index until it finds the desired key-value pair.

Open Addressing

Open addressing stores all records in the array itself. When a new value is to be inserted, its index position is computed by applying the hash function on the key. If the desired position is occupied, then the rest of the positions are probed. There are several probing algorithms also. Linear probing, the simplest of them, requires walking the array in fixed increments from the desired index until an empty spot is found in the array.

Implementation

Enough theory. Let us move on to implementation details.

The first step is to define a HashTable class. The size of the hash table is restricted to 100 entries for this exercise and a fixed-length array is initialised to store these entries.

public class HashTable
{
    private static readonly int SIZE = 100;
    private object[] _entries = new object[SIZE];
}

This is where the implementation of this data structure veers away from the one shown in Brian Barto’s post. The original post used explicitly named method for insert, search and delete operations. However, C# allows the nicer and more intuitive use of a square-bracket syntax to perform these same operations.

public class HashTable
{
    ...

    public int this[int key]
    {
        get;
        set;
    }
}

The client program uses the following syntax to consume this interface, which is much more concise than having separate methods for each operation.

HashTable map = new HashTable();
map[key] = value; // Insert
int value = map[key]; Retrieve
map[key] = null; // Delete

The HashTable class internally uses a struct called Entry to store individual key-value pairs.

struct Entry
{
    internal int key;
    internal int? value;
}

The type of the value field in the Entry struct must be set to a nullable int so that the client can set it to null when the value has to be cleared. The declaration of the _entries field is changed as follows.

private Entry?[] _entries = new Entry?[HashTable.SIZE];

Finally, we come to the meat and potatoes of this exercise – the accessor & mutator functions, whose signature is also modified to support the nullable type.

public int? this[int key]
{
    get
    {
        ...
    }

    set
    {
        ...
    }
}

The code for these methods and the hash function is more or less identical to the one described in Brian’s post, with some tweaks to accommodate for the nullable value.

get
{
    int index = _hash(key);
    
    while (null != _entries[index])
    {
        if (key == _entries[index]?.key)
        {
            break;
        }
        index = (index + 1) % HashTable.SIZE;
    } 
    return _entries[index]?.value;
}

set
{
    if (null == value)
    {
        Delete(key);
    }
    else
    {
        int index = _hash(key);
        while (null != _entries[index])
        {
            if (key == _entries[index]?.key)
            {
                break;
            }
            index = (index + 1) % HashTable.SIZE;
        }

        _entries[index] = new HashEntry { key = key, value = value };
     }
}

The one additional member in this class is the private Delete method which is required to delete the entry. This method is called by the mutator function when the client passes in a null value.

private void Delete(int key)
{
    int index = _hash(key);

    while (null != _entries[index])
    {
        if (key == _entries[index]?.key)
        {
            break;
        }
        else
        {
            index = (index + 1) % HashTable.SIZE;
        }
    }

    if (null == _entries[index])
    {
        return;
    }

    _entries[index] = null;
}

That’s all that’s needed to implement this data structure. The hash function shown in this example is a very na├»ve implementation. The probability of collisions is very high and the probing function is very slow. It also cannot work with any data type other than int for both keys and their values.

In spite of these limitations, this still made for a very useful learning exercise.