1. Introduction

Modern systems are designed to obtain better performance in terms of merit figures, such as power, area and/or speed. In hardware terms, the natural language of a computer is a set of integers restricted by a number, named modulo, establishing a ring of integers. This introduces the concept of modular arithmetic and its consequent number representation that is used in many areas, including digital signal processing or cryptographic systems [^{5}]. This representation been probed using modular arithmetic, specifically Residue Number System RNS [^{12}], instead of two-complement or classical weighted binary, to increase the speed performance in FIR filters [^{8},^{9}].

This paper focuses on a specific implementation of a special modular function, modular exponentiation. Taking advantage of the inherent flexible characteristics of FPGA architecture, it is possible to develop a system that uses the M-ary algorithm [^{10}] to test various characteristics, which are verified with a software equivalent and compared with previous implementations of the method.

Previous systems were made with different schemes, taking advantage of a pure hardware co-design approach (hardware subsystem and software subsystem) or using algorithms such as an addition-chain to reduce multiplication steps [^{7},^{11}]. These systems were based on combining their best characteristics.

Modular exponentiation consists in finding a solution to the following eq. (1):

where *n* is the length of *M* in bits.

The M-ary algorithm can be subdivided into different sub processes as follows [^{5}]: exponent segmentation in *w* windows of *d* bits, preprocessing and storing of every possible power of any base in a range determined by 0 to 2^d, squaring and multiplying to obtain a result. These steps are illustrated in the pseudo code scheme shown in Fig. 1.

Modular exponentiation implies another operation, modular multiplication. This function is implemented with an algorithm called the Montgomery Method [^{6},^{12}]. This process takes two cycles, one for pre-calculating a result in a space called the M-residues Space with an undesirable factor R^-n called the multiplicative inverse of R = 2^n mod M. After the first cycle, a second cycle is used to recover the desired multiplication without the R^-n factor. This method avoids computing the integer division, which is an expensive operation in hardware.

The Montgomery Method is described in the pseudo code shown in Fig. 2.

Modular multiplication can be described as shown in Fig. 3.

2. Implementation

The complete system was modeled with VHDL language using basic libraries like the IEEE Standard, allowing the system to be more generic and portable among different architectures. Reprogrammable hardware is used because it allows the design of digital architectures with certain flexibility.

In a block description, modular exponentiation has four subsystems that work together including the Montgomery Multiplier, the Storage Unit, the Exponent Segmentation Unit, and the Control Unit. These blocks are shown in Fig. 4.

Sequential logic uses a Finite States Machine under the Moore scheme, where the outputs depend only on the actual state because asynchronous behaviors should be avoided due to a complex and significant quantity of states (approximately 40 for the principal control module).

Internally, the Montgomery Multiplier has another sub control unit, which synchronizes a data path and advertises to the central unit when multiplication is finished. This controller also switches sources in the multiplier when a Montgomery cycle is completed, as illustrated in Fig. 5.

Operands have a 12-bits length, implying the use of 12-bits and 1-bit carry units.

The segmentation unit uses a 4-to-3 multiplexer in order to fragment the exponent in fields that are used as a kind of pointer to preprocessing powers. A counter helps select which of those segments is going to be used. There is an offset of 2 addresses because X^0 and X^1 are not stored in the memory register bank. The block diagram (Fig. 6) shows the internal constitution of this unit.

Address lines are used to control a storage unit, which is composed by a set of 16 registers with 12 bits width each, and an input decoder takes the address and enables writing to one register. When reading is required, a 16-to-12 multiplexer connected to the address lines takes the register data and puts it into the output register; these lines send a Vj signal to the Montgomery Multiplier. This system allows writing and reading one register at a time, and the internal structure of the system is shown in Fig. 7.

3. Experimental results

For testing purposes, a test block is added. The test block contains the inputs pattern to the M-ary block. Each block contains a set of 48 bits, which are subdivided in 4 numbers of 12-bit length and represent the needed inputs as follows: a base (X), an exponent (Y), a modulus (M) and a correction factor (R). There are a total of 4 samples, as reflected in the use of a storage system with an attribute of 4 words - 48 bits ROM. This system starts each exponential process and senses a finish flag bit provided by the exponential block. When an exponential process finishes, a counter changes the address pointer that is controlling the ROM behavior and loads the respective data subsets in the output registers. Fig. 8 illustrates the testing structure.

A detailed scheme of the internal structure in the testing block can be seen in Fig. 9.

This system is for testing purposes, so the hardware requirements were not great, allowing for a low-cost system, such as the DE0 evaluation board.

Physical implementation of the previous block was carried out using Quartus II software, Signal-Tap, and a DE0 development board that has an EP3C16F484C6N FPGA from Altera Corp. Figs. 10(a) and 10(b) show the timing diagrams for some data sets.

An estimation of the merit figures was obtained using Timer Quest and Quartus II tools as follows:

With all this examples in mind, it is convenient to show another. The following example addresses the cryptographic application of the modular exponentiation, and its aim is to cipher and decipher a character that is written in an ASCII-like style.

The next steps are used to obtain the public and private keys of the RSA algorithm [^{3}]:

1. Takes two prime numbers p and q as well as their product n:

2. Calculates Euler’s Phi function of n using:

3. Chooses a prime number e, which is less than φ(n)

4. Finds the multiplicative inverse in modulus φ(n) of e, defined by:

5. Uses an iterative algorithm in Matlab to find the multiplicative inverses, based on the following fact:

Iterating over i, where d is an integer, to find that:

Multiplies both d and e:

This proves that *e* is the multiplicative inverse of *d* in modulo φ(n).

6. Creates a private key with the set (p, q, d) = (11, 227, 33) and a public key (n, e) = (2497, 137). Now, using ASCII code, “s” is represented with 83 as the message to transport to use the cipher and decipher steps.

7. Cipher step consists of using the modular exponentiation to solve the next equation:

Fig. 11 shows cipher results.

8. Finally, the decipher step consists of using modular exponentiation to solve the next equation:

Fig. 12 shows the decipher results.

4. Conclusions

With practical work experience, it was clearly shown that the synchronization between blocks is affected by the dependence of the data. Its solution facilitates the treatment of the signals involved in the design of digital systems.

The design of equivalent procedures in software facilitated error debugging in the hardware implementation.

The modular multiplication control system was optimized using a Flip-Flop T to reduce the number of steps in FSM. Additionally, if a zero value segment was detected in the exponent, the central control unit avoided a multiplication.

Using an EP3C16F484C6N Cyclone III FPGA from Altera Corp., the merit figures reported were as follows:

In future work, the optimization of system controllers is a possible aim to reduce the number of clock cycles per operation. For the data path, the systolic design should be evaluated and considered to reduce complexity in the system controllers and to avoid synchronization between blocks. On the other hand, it would be interesting to create a completely parameterizable block where the amount of data and window bits are adjustable to requirements within cryptographic designs. Another important point to be explored is the use of modular arithmetic to implement more efficient digital signal processing filters in terms of time [

^{5}].