A Simple Microcontroller Coprocessing Encryption Engine (Part 4)

The VHDL implementation of the XXTEA Algorithm

Exploding the Loops

In the last post I explained how I have implemented the MX function in VHDL. In this post I want to now take a look at the main loops of the algorithm that is indexed by q and p. We will base the pipelining of the process on the assumption that there is a single clock cycle delay between the changing of the MX variables. This is in line with the last post’s implementation example.

The p loop

Within the algorithm the p loop is executed only once as we only ever use the algorithm to encode two 32-bit blocks representing out 64-bit counter. So we can ‘unroll’ this loop and re-write the algorithm as a single loop only indexed by q. The algorithm therefore becomes.

#DEFINE DELTA 0x9E3779B9

#DEFINE MX (((Z>>5^Y<<2) + (Y>>3^Z<<4)) ^ ((SUM^Y) +
(KEY[(P&3)^E] ^ Z)))

VOID BTEA (UINT32_T *V, INT N, UINT32_T CONST KEY[4]) {
UINT32_T Y, Z, SUM;
UNSIGNED P, ROUNDS, E;
IF (N>1) {  /* CODING PART */
ROUNDS = 6 + 52/N;
SUM = 0;
Z = V[N-1];
DO {
SUM += DELTA;
E = (SUM >> 2) & 3;
Y = V[1];
Z = V[0] += MX;
Y = V[0];
Z = V[1] += MX;
} WHILE (–ROUNDS);
As I progress through this new loop I will use different values of MX so the implementation needs to be pipelined to generate these terms. I still want to keep the number of stage to a minimum so that the speed is not adversely effected.

So let’s start with the initial conditions for the algorithm and look at the individual lines as we progress through the new q loop.

The variables we are interested in tracking are sum, z, v[0], v[1] and MX. As previously established I can ignore y as this is equal to z at the various stages of the algorithm.

Initial conditions

The algorithm splits the 64-bit counter into two 32-bit words which are referenced with the index [0] and [1]. Within this implementation I have taken [0] to be the upper 32 bits of the counter and [1] to be the lower bits. So as I enter the loops we have the following preconditions set.

As the y and z variables impact on the value of MX I will assume that MX is also in a stable initial state identified as MX[0]. As variables change as I progress through the algorithm I will use a numerical index to identify the various values.

The q loop

In the first couple of lines of the loop I set the value of sum to be sum + DELTA, which is the value of DELTA itself. The change will impact upon the value of MX as the key index e is changed. So I have the following sequence of events.

The value of sum increments by DELTA, this changes the address to our memory which will then output a new value of key on the flowing clock edge. The changing value of key will, in turn, change MX which will be valid after the next clock edge.

So it will take three clock cycles for the value of MX to change to the new interim value MX[1]. Moving on to the next assignment statements these relate to z and v[0] which are equal to v[0] + MX, this will only take one clock cycle as these assignments occur concurrently. As a result of this assignment I have again changed the value of MX which will only take one clock cycle to resolve as the key address has not changed.

I can ignore the next assignment to y and progress to the final statement within the loop which is the simultaneous assignment of z and v[1] which are made equal to v[1] + MX[2]. These are resolved on the next clock cycle.

I have now completed the first iteration through the loop. At the end of this the value of z/y are changed which will result in a new MX value being created MX[3]. When I return to the start of the loop again the first command will add DELTA to the sum value; changing the key address. This will then ripple through to create another MX update MX[4]. As MX[3] is not used I can safely looped back to 1 in the previous table.

So from this I can see that each loop will take six clock cycles to complete and to encode the full counter over 32 iterations will take 192 clock cycles.

VHDL Encode

So using the information from the previous section I have coded the VHDL. The first thing I have done is to declare the various signals and constants I will use in the process. All of the arithmetic within the XXTEA algorithm is unsigned 32-bit so sum, v0, v1 and z are all declared as unsigned 32-bits signals. I know that I will only loop 32 times so I have defined q to be five bits long. Similarly I have defined the clock count as a range bound integer so that the number of bits used are constrained. This will limit the waste of resources that can occur if the synthesis tool uses 32-bit registers for integer types.

As I will be using 32-bit unsigned arithmetic I have also declared DELTA as a 32-bit constant value.

While I unrolled the p loop in my previous analysis of the algorithm in the VHDL implementation I have included a new loop which is indexed by the clk_count signal. So the proposed implementation is as follows:

I have added a control signal called en_ecrypt that I will use at a later stage of the design to active the encryption process.  The signals q, z, v0, v1, sum and clk_count are assigned the initial values before the main loop begins.

Within the q loop I have used a case statement to identify the different steps within each iteration. Note that while the original table used the step count 1 to 6 the clk_count value indexes from 0 to 5.

The case statement is then followed by the if statement that loops the clk_count through the range 0 to 5 and the q count up to 32, at which point the process completes.

Next posting

In this posting I have explained how to implement the main elements of the algorithm. So in my next post I will combine these elements into a complete VHDL entity that I can simulate.