Type Punning
Daily dispatches from my 12 weeks at the Recurse Center in Summer 2023
Here are 32 bits, or 4 bytes. Seems like a pretty big number. And it is . . . sorta. Depends on how you look at it.
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
One way to look at it is as an 32-bit integer, in which case this number is, indeed, on the large side – 1,078,530,011.
Another way to look at it is as a float, which also takes up 4 bytes of space, but which is represented in memory very differently. As a float, the same sequence of bytes represents 3.141593.
Now for the fun part: let’s say that you want to store a value in memory as one type, but that you sometimes want to interpret that same value as another type entirely. Enter . . .
Type Punning
Before getting into why one would want to do something insane like this, it bears mentioning that doing this is generally frowned upon, since it is non-portable and can lead to unexpected behavior. Some computers and architectures are big-endian, some little-endian; some represent floats one way, others another way. So, while on my machine the underlying bits representing the integer 1,078,530,011 also represent the floating-point decimal 3.141593, the same may not be true on your machine – all depends on how types are represented.
As a result, there are a number of anti-aliasing rules that put in place some guardrails to generally discourage this kind of thing.
The C17 standard lays out the following guidelines in §6.5 Expressions:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
- a type compatible with the effective type of the object,
- a qualified version of a type compatible with the effective type of the object,
- a type that is the signed or unsigned type corresponding to the effective type of the object,
- a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object
- an aggragate or union type that includes one of the aforementioned types among its members, or
- a character type.
In other words, a variable typed int
can be accessed as a qualified version of that type (e.g., const int
) or a signed/unsigned type corresponding to its type (e.g., unsigned int
or signed int
) or both. Or it can be accessed as a character type. Or as a union type (more on this soon). But a value in memory typed as an int
cannot be accessed as a float
.
Well, it sort of can, if you try really hard. It’s just that, like I said, it can lead to unexpected and weird behaviour. Here’s one way you can force it:
uint32_t my_int = 1078530011;
float my_float = *(float*)&my_int;
printf("my_int: %u\n", my_int); // => 1078530011
prtinf("my_float: %f\n", my_float); // => 3.141593
Here we’re basically saying: get the address of my_int
, recast that address as a float pointer, and dereference the result.
But if you’re really insistent on type punning, there’s a better and more above-board way of doing it, which also happens to obey the C17 rules above. We can use a union:
union type_pun {
uint32_t i;
float f;
};
union type_pun u;
u.i = 1078530011;
printf("u.i: %u\n", u.i); // => 1078530011
printf("u.f: %f\n", u.f); // => 3.141593
The reason this works is because all the members of the union share the same hunk of memory, so the integer u.i
and the floating-point number u.f
are both represented by the same underlying bytes. It’s just a matter of whether we access those bytes through the integer-typed member or the float-typed member.
That said, there are still good reasons not to do this – and those reasons are the same as before: type punning like this is non-portable and can lead to inconsistent results that are machine- and architecture-dependent. But if you really wanna pun, then, by god, pun!
So . . . why would anyone ever want to do something like this?
Application 1: NaN-Boxing
One application is nan-boxing, where you use a double
(a 64-bit floating point number) either literally as a double or, in the event that it’s not a number
, as a vessel for carrying a payload.
A double consists of three parts:
- a sign bit (bit 63)
- 11 bits representing the exponent (bits 52-62)
- 52 bits representing the mantissa, or fraction (bits 0-51)
To represent nan
(not a number), all 11 exponent bits are set to 1. So the idea is that we can use a double encoded as a nan
to carry 52 bits of information in its mantissa – it’s just a matter of accessing them.
And that’s where type punning comes in: To access those mantissa bits and extract the payload, we’d have to access that double as a 64-bit integer so we can do the appropriate bitwise mask.
Application 2: Type Agnostic Data Structures
Another application I stumbled upon more recently is creating type-agnostic data structures. For instance, let’s say I want to create a linked list but that I want the nodes comprising that linked list to store either integers or floats or strings. Here’s a punny approach involving what’s called tagged unions:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
enum data_type { STRING, INT, FLOAT };
union value {
int i;
float f;
char *s;
};
struct variable_data {
union value val;
enum data_type type;
};
struct Node {
struct variable_data cargo;
struct Node *next;
};
struct Node* create_node(enum data_type type, void *val)
{
struct Node *n = malloc(sizeof(struct Node));
switch (type) {
case STRING:
n->cargo.val.s = (char*)malloc(strlen((char*)val) + 1);
strcpy(n->cargo.val.s, (char*)val);
break;
case INT:
n->cargo.val.i = *(int*)val;
break;
case FLOAT:
n->cargo.val.f = *(float*)val;
break;
}
n->cargo.type = type;
n->next = NULL;
return n;
}
void print_node(struct Node* n)
{
switch (n->cargo.type) {
case STRING:
printf("%s\n", n->cargo.val.s);
break;
case INT:
printf("%d\n", n->cargo.val.i);
break;
case FLOAT:
printf("%f\n", n->cargo.val.f);
break;
}
}
int main()
{
int my_int = 12345;
struct Node *int_node = create_node(INT, &my_int);
print_node(int_node); // => 12345
float my_float = 3.1415;
struct Node *float_node = create_node(FLOAT, &my_float);
print_node(float_node); // => 3.1415
char *my_str = "hello, world";
struct Node *string_node = create_node(STRING, my_str);
print_node(string_node); // => hello, world
return 0;
}
Now I can link all these nodes together in a linked list, I can iterate through them and do stuff, etc.