The Usefulness of Data Classes

Dynamic languages support defining nested dictionaries (AKA hashmaps, hashes, hashtables, etc) with different types for the values. This makes it very easy to use dictionaries as function arguments or even as full-fledged domain models.

An alternative to dictionaries is data classes. Data classes are just classes that hold data. They should not be confused with Python’s dataclasses and are language agnostic.

In this post, we’ll discuss:

  • What are the scenarios in which data classes have an advantage over dictionaries?
  • The advantages of using data classes over dictionaries.
  • Options for working with data classes in different languages

A Data Class Definition

Before we start comparing the data classes to dictionaries, we should probably provide a non-formal definition to a data class. My informal definition for a data class is that data classes are simple data containers, usually without a behavior (methods on the class).

Data Classes are sometimes referred to as DTOs (Data Transfer Objects), Value Classes, Data Structures, etc. I often find that these synonyms mean different things to different people which is why I chose a “new” name (probably overloaded as well).

With this out of the way, let's start comparing data classes to dictionaries.

Representing Multiple Function Arguments

It’s a common technique to combine multiple function arguments into a single data structure. This can prevent function signatures from breaking and avoid code smells (which consequentially stops our linters from complaining about errors such as too-many-arguments). In my personal experience, dictionaries in dynamic languages are often used as a way to pass multiple arguments.

One of the benefits of using a dictionary to combine multiple arguments is that it’s very easy to declare. Consequently, we avoid defining many small disposable (used once) classes.

Instead of dictionaries, we can also use data classes. These have several benefits over dictionaries:

1. Less susceptible to typos — This is mostly due to autocompletion and naming conventions (naming conventions for field names tend to be more maintained than for dictionary keys).

2. Explicit signaturedata classes explicitly show what arguments the function needs in order to operate properly. This is even more true when using type hints. Dictionaries, on the other hand, don’t explicitly expose the required args forcing us to look in the source code. This gets even more complicated when the input dictionary is passed through several functions.

3. Tooling supportdata classes work very well with tools like linters, type checkers & autocompletion. This is a superset of bullet #1.

Overusing Dictionaries as Models

Almost every program we write communicates with external resources like messaging queue, DBs, external APIs, etc. In dynamic languages, we’ll often convert the incoming/outgoing data to a dictionary. In my experience, it’s common that many (if not most) of the incoming/outgoing data will remain in its raw form (as a raw dictionary).

Next, we will examine several scenarios where I think it’s better to convert these models into dedicated classes.

Models Should Be Unique

When using dictionaries to represent domain models we should note that dictionaries don’t have a unique type for every model. Consequently, two dictionaries with the same keys and values are equal even when they represent different models. Classes, on the other hand, have types and so different classes with the same fields and values are always different.

Data Classes & Dictionaries Have Different Equality Behavior

Using data classes to represent our models has several advantages:

  1. Prevents accidental equality errors.
  2. Models that are represented by a dedicated class are searchable. If we have a large codebase, it’s much easier to search for a User model or a field that belongs to a User rather than searching through lots of dictionaries which are very hard to distinguish between one another.

Why Parsing Is Often Better Than Validation?

A data class has one interesting and often overlooked advantage over a dictionary. A data class is not dependent upon external formats like JSON, YAML, etc. When we receive data in an external format we need to parse the incoming data. This also means we convert the data into our system distilled internal representation.

The key point here is that data classes preserve the parsed data and expose it outwards via the classes fields.

Dictionaries, on the other hand, are often validated. This means that a function that validates a dictionary doesn’t encode any of the knowledge it gained about the input. It basically “forgets” about the data obtained by the parser. This is one of the main ideas behind the Parse, Don’t Validate blog post. Examples are in Haskell but the ideas translate to other languages just the same.

Consider the following example:

The difference between these two approaches is obvious just by looking at the function signatures. validate_user checks for the validity of the input and has no return value. The mere fact that the function didn't raise a ValueError means that possible_user is valid (what kind of tests would you write for validate_user?). parse_user on the other hand, returns a User that encapsulates the parser's insights and consequently, it's also much easier to test.

Wait! Dictionaries Can Also Be Parsed

It’s true that we could convert the input dictionary into a new one in a way that the new version of validate_user will not discard the parsed values but instead save them to an output dictionary like this:

def validate_user(possible_user: dict) -> dict: 
"""Returns a new parsed dictionary"""

This version is probably better than the initial validate_user version but in my experience returning a new parsed dictionary is rarely ever used. When possible, it's just so much easier to use an existing schema validation library.

Benefits of Encoding Parsing Information in a Data Class

  1. Immutability — In most languages, it’s fairly easy to declare a data class immutable.
  2. Reduced likelihood of shotgun parsing — Parsing data classes reduces the possibility of shotgun parsing. This also pushes many of our internal checks to the system boundaries, which lets us have more control over error-handling. As a result, we also get to remove many redundant internal “if” checks (which is always a good thing).
  3. Avoiding temporal coupling — Functions that operate over the parsed data are explicitly dependant upon parsers’ output (just like Monads). Validation functions, however, have no visible constraints on the processing order. This makes it much easier to accidentally reorder the statements in the wrong way.

Avoiding Structural Coupling

The only way to access a dictionary’s nested value is by knowing its internal structure. Consider the following example:

When we want to get the employee’s hobbies names we need to know some information about the hobby internal structure (in our case, a hobby is a dictionary that has a name). This may not seem so bad, but if we suddenly want to represent hobbies as regular strings, every piece of code that accesses the hobbies' names will now break. This is a simple example of a Law of Demeter violation.

Adding a property (or a getter method) is a way to avoid this problem. Dictionaries don’t support properties but luckily, data classes do support them.

Revisiting the Hobbies Example

Let’s randomly choose Attrs as the data class implementation to show how we can avoid structural coupling. We refer to the numbered comments after the code example:

  1. hobbies is still represented as a list of strings.
  2. hobbies_names hides the internal structure by only exposing the relevant information.
  3. When we use hobbies_names we are no longer aware of the internal structure.

This may be a simple change (and this is a simple example), but the benefits of doing these kinds of abstractions increase when hobbies starts to accumulate many dependents. This also tends to get worse the more nested the dictionary is. The implications of breaking a class with many dependencies are discussed in Sandi Metz's POODR book (under "Finding the Dependencies That Matter").

Nested dictionaries couple code to their internal structure while data classes (and classes in general) provide us with a way to avoid this coupling. It’s definitely possible that this kind of coupling is not so bad, but when we make it, we should be aware of the implications it may have on our system in the long run.

Data Classes in Other Languages

Data classes require writing quite a bit of boilerplate code (equality, hashing, string representation methods, etc). That’s why Python dataclass, Pydantic & Attr are so useful for data classes — they act as code generators. Pydantic & Attr are also useful for parsing & validation but this is out of scope.

Other languages also offer a few useful options (I’ve either used these personally or heard good things about them):

Scala

  • Case classes
  • Refined — this is a really awesome library as it allows us to define constraints at the type-level.

Java

Python

Ruby

Summary

A data class is useful among other things for representing models and encapsulating multiple arguments to a function. In contrast to dictionaries, a data class is less susceptible to typos, has an explicit schema, plays nicely with tools like mypy and pylint, supports properties (and other methods), and has a type that differentiates between different models. Data classes represent an internal and more refined form of our data that is decoupled from its external representation.

Although this post clearly favors data classes over dictionaries for the aforementioned reasons, it’s important to note that just like everything else in software — there is no silver bullet. Both guidelines & rules of thumbs are not a substitution for good judgment and reason.

Aside from the advantages discussed in this post, data classes can also be useful for representing domain constraints and requirements but this is probably a topic for another post. 🙂

I’d love to hear your thoughts and opinions.

Originally published at https://www.gidware.com on November 30, 2020.

Programming, sports & a bad sense of humor