Implementing base64 encoding from scratch

- 31 December 2023

What is base64?

Base64 is an encoding that represents binary data using printable characters. It consists of 64 characters ranging from A-Z, a-z, 0-9, plus sign and forward slash.
The 64 characters fit neatly into 6 bits.

Why use base64?

Base64 can be useful when you're limited by the set of characters you're allowed to use. This might be in an email, url, forum post, or text message. Any of these places might alter or strip out unwanted bytes - resulting in corrupt data.

When to not use base64?

It's important to know that base64 does not compress any data, it takes up 33-37% more space. So if you're not limited by the set of characters you can use, there is no real reason to use it.
It's also not a cryptographic function and it doesn't add any form of security.

How does base64 work?

Base64 uses 64 characters - each character represents a numeric value as shown in the table below.

When encoding from binary to base64 - every 3 bytes of binary data is represented as 4 base64 characters.

Below you can see how the list of bits from the ASCII characters (top row), is interpreted - 6 bits at a time - as base64 characters (bottom row).

You can interact with table the above here:

Below are three simple examples where we're encoding "A", "B" and "C".
The base64 encoding for all 3 starts with "Q" since the left-most bits are all 010000 (16).

Looking at the character table above - we find that 16 does represent "Q".

Implementing binary to base64 encoding

First, we create a function that takes in a pointer to an input buffer, the number of bytes in that buffer, and an output buffer.
For this demonstration - we assume the caller has cleared and allocated enough memory to contain the base64 encoded string.

        
            void base64_encode(u8* input, u32 input_length, u8* output) {

            }

To keep track of our output buffer we use output_offset.

Since every 3 bytes of binary data results in 4 base64 characters, we make a loop that increments 3 bytes each iteration.

        
        
            void base64_encode(u8* input, u32 input_length, u8* output) {
                u32 output_offset = 0;

                for (u32 i = 0; i < input_length; i+=3) {

                }
            }

The binary data might not be evenly divided into pairs of 3. So to handle cases where the last iteration only has to process 1 or 2 bytes - we calculate the "remaining_bytes" and "bytes_to_process"

"bytes_to_process" is clamped to be a maximum of 3 using a MIN macro.

        
        
        
            #define MIN(a,b) ((a) < (b) ? (a) : (b))

            void base64_encode(u8* input, u32 input_length, u8* output) {
                u32 output_offset = 0;

                for (u32 i = 0; i < input_length; i+=3) {
                    u32 remaining_bytes = input_length - i;
                    u32 bytes_to_process = MIN(remaining_bytes, 3);
                }
            }

We are going to do some bitwise operations to retrieve our 4 pairs of 6 bits.

To make this process easier - we first bit-shift our 3 separate bytes into a single 32-bit integer called "bitfield".

        
        
            void base64_encode(u8* input, u32 input_length, u8* output) {
                u32 output_offset = 0;

                for (u32 i = 0; i < input_length; i+=3) {
                    u32 remaining_bytes = input_length - i;
                    u32 bytes_to_process = MIN(remaining_bytes, 3);

                    u32 bitfield = 0;
                    for (u32 b = 0; b < bytes_to_process; b++) {
                        bitfield += input[i+b] << (32 - 8 * (b+1));
                    }
                }
            }

Below you can go step through the code to see what happens when we do the bit-shifting above.

We now have all 3 bytes bit-shifted into a single 32-bit integer. The next step is to read out 6 bits at a time. Each of these 6 bits corresponds to a base64 value.

The second loop runs for "bytes_to_process + 1" iterations. Since for 1, 2, and 3 bytes processed, we'll get 2, 3, and 4 base64 values respectively.

Besides bit-shifting, we're also using a bitmask. This ensures we only get the least significant 6 bits from our bit-shifting.

        
        
            void base64_encode(u8* input, u32 input_length, u8* output) {
                u32 output_offset = 0;

                for (u32 i = 0; i < input_length; i+=3) {
                    u32 remaining_bytes = input_length - i;
                    u32 bytes_to_process = MIN(remaining_bytes, 3);

                    u32 bitfield = 0;
                    for (u32 b = 0; b < bytes_to_process; b++) {
                        bitfield += input[i+b] << (32 - 8 * (b+1));
                    }

                    for (u32 b = 0; b < bytes_to_process + 1; b++) {
                        u8 base64_value = bitfield >> (32 - 6 * (b+1)) & 0b111111;
                    }
                }
            }

Now we have the four base64 values - but these values have to be represented as characters.
So we need to translate the base64 value 22 to the ASCII value 87 (W).

There are many ways to do this translation, but the simplest is to look them up in an array - where the index is also the base64 value.

Then we write the character to our output buffer and increment output_offset.

        
        
        
            u8* base64_character_table =
                "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

            void base64_encode(u8* input, u32 input_length, u8* output) {
                u32 output_offset = 0;

                for (u32 i = 0; i < input_length; i+=3) {
                    u32 remaining_bytes = input_length - i;
                    u32 bytes_to_process = MIN(remaining_bytes, 3);

                    u32 bitfield = 0;
                    for (u32 b = 0; b < bytes_to_process; b++) {
                        bitfield += input[i+b] << (32 - 8 * (b+1));
                    }

                    for (u32 b = 0; b < bytes_to_process + 1; b++) {
                        u8 base64_value = bitfield >> (32 - 6 * (b+1)) & 0b111111;
                        output[output_offset++] = base64_character_table[base64_value];
                    }
                }
            }

One last thing I've skipped over for now, which is also a very identifying feature of base64 is it's padding characters.

These padding characters are "=" and are placed at the end of the base64 string.

The padding characters ensures that the base64 strings length is evenly divisible by 4. They do not have any affect on the encoding itself.

Encoding "TEST" will result in the base64 string "VEVTVA==" - the two added padding characters makes the length 8.

This completes the encoding of binary to base64! 🎉

        
        
            void base64_encode(u8* input, u32 input_length, u8* output) {
                u32 output_offset = 0;

                for (u32 i = 0; i < input_length; i+=3) {
                    u32 remaining_bytes = input_length - i;
                    u32 bytes_to_process = MIN(remaining_bytes, 3);

                    u32 bitfield = 0;
                    for (u32 b = 0; b < bytes_to_process; b++) {
                        bitfield += input[i+b] << (32 - 8 * (b+1));
                    }

                    for (u32 b = 0; b < bytes_to_process + 1; b++) {
                        u8 base64_value = bitfield >> (32 - 6 * (b+1)) & 0b111111;
                        output[output_offset++] = base64_character_table[base64_value];
                    }
                }

                u32 padding_characters = 2 - (input_length + 2) % 3;
                for (u32 i = 0; i < padding_characters; i++) {
                    output[output_offset++] = '=';
                }
            }

Implementing binary to base64 decoding

I won't go into the same level of detail with decoding - since it's many of the same concepts as with encoding (just in reverse).

One of the bigger differences between encoding and decoding is when encoding any input will always be valid. But when decoding only certain base64 characters are allowed.

This raises the question of what to do when an invalid base64 input is given. For simplicity, we will simply break if we cannot find a suitable character.
A pre-validation step could be used - or the function could return a success flag.

A new helper-function has been added called "get_base64_character_index". This acts much like an "index_of" function.
It returns the index (which is also the base64 value) of a given ASCII character in the "base64_character_table".

The complete source is provided below. As you can see there are many similarities between "base64_encode" and "base64_decode", but lengths and byte sizes have been swapped around.

        
        
        
            #define MIN(a,b) ((a) < (b) ? (a) : (b))

            u8* base64_character_table =
                "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";

            u8 get_base64_character_index(u8 value) {
                for (u32 i = 0; base64_character_table[i]; i++) {
                    if (base64_character_table[i] == value) return i;
                }
                return -1;
            }

            void base64_encode(u8* input, u32 input_length, u8* output) {
                u32 output_offset = 0;

                for (u32 i = 0; i < input_length; i+=3) {
                    u32 remaining_bytes = input_length - i;
                    u32 bytes_to_process = MIN(remaining_bytes, 3);

                    u32 bitfield = 0;
                    for (u32 b = 0; b < bytes_to_process; b++) {
                        bitfield += input[i+b] << (32 - 8 * (b+1));
                    }

                    for (u32 b = 0; b < bytes_to_process + 1; b++) {
                        u8 base64_value = bitfield >> (32 - 6 * (b+1)) & 0b111111;
                        output[output_offset++] = base64_character_table[base64_value];
                    }
                }

                int padding_characters = 2 - (input_length + 2) % 3;
                for (u32 i = 0; i < padding_characters; i++) {
                    output[output_offset++] = '=';
                }
            }

            void base64_decode(u8* input, u32 input_length, u8* output) {
                u32 output_offset = 0;

                for (u32 i = 0; i < input_length; i+=4) {
                    u32 remaining_bytes = input_length - i;
                    u32 bytes_to_process = MIN(remaining_bytes, 4);

                    u32 bitfield = 0;
                    for (u32 b = 0; b < bytes_to_process; b++) {
                        u8 base64_value = get_base64_character_index(input[i+b]);
                        if (base64_value == -1) break;
                        bitfield += base64_value << (32 - 6 * (b+1));
                    }

                    for (u32 b = 0; b < bytes_to_process - 1; b++) {
                        u8 byte = bitfield >> (32 - 8 * (b+1)) & 0xFF;
                        output[output_offset++] = byte;
                    }
                }
            }

Thanks!

Thank you for sticking around till the end. I hope it has been informative and perhaps you've learned a thing or two. :)

If you have any questions, thoughts, or comments, feel free to reach me at hello@eibx.com