In fact the library discussed within this text has already been used to form a polynomial basis library\footnote{See \url{http://poly.libtomcrypt.org} for more details.}.
The norm of a multiple precision integer, for example $\vert \vert x \vert \vert$, will be used to represent the number of digits in the representation
When compiled with GCC for the x86 processor and optimized for speed the entire library is approximately $100$KiB\footnote{The notation ``KiB'' means $2^{10}$ octets, similarly ``MiB'' means $2^{20}$ octets.}
As can be expected this algorithm is very simple. The loop on step one is expected to iterate only once or twice at
the most. For example, this will happen in cases where there is not a carry to fill the last position. Step two fixes the sign for
when all of the digits are zero to ensure that the mp\_int is valid at all times.
EXAM,bn_mp_clamp.c
Note on line @27,while@ how to test for the \textbf{used} count is made on the left of the \&\& operator. In the C programming
language the terms to \&\& are evaluated left to right with a boolean short-circuit if any condition fails. This is
important since if the \textbf{used} is zero the test on the right would fetch below the array. That is obviously
undesirable. The parenthesis on line @28,a->used@ is used to make sure the \textbf{used} count is decremented and not
the pointer ``a''.
\section*{Exercises}
\begin{tabular}{cl}
$\left [ 1 \right ]$ & Discuss the relevance of the \textbf{used} member of the mp\_int structure. \\
& \\
$\left [ 1 \right ]$ & Discuss the consequences of not using padding when performing allocations. \\
& \\
$\left [ 2 \right ]$ & Estimate an ideal value for \textbf{MP\_PREC} when performing 1024-bit RSA \\
& encryption when $\beta = 2^{28}$. \\
& \\
$\left [ 1 \right ]$ & Discuss the relevance of the algorithm mp\_clamp. What does it prevent? \\
& \\
$\left [ 1 \right ]$ & Give an example of when the algorithm mp\_init\_copy might be useful. \\
& \\
\end{tabular}
%%%
% CHAPTER FOUR
%%%
\chapter{Basic Operations}
\section{Introduction}
In the previous chapter a series of low level algorithms were established that dealt with initializing and maintaining
mp\_int structures. This chapter will discuss another set of seemingly non-algebraic algorithms which will form the low
level basis of the entire library. While these algorithm are relatively trivial it is important to understand how they
work before proceeding since these algorithms will be used almost intrinsically in the following chapters.
The algorithms in this chapter deal primarily with more ``programmer'' related tasks such as creating copies of
mp\_int structures, assigning small values to mp\_int structures and comparisons of the values mp\_int structures
represent.
\section{Assigning Values to mp\_int Structures}
\subsection{Copying an mp\_int}
Assigning the value that a given mp\_int structure represents to another mp\_int structure shall be known as making
a copy for the purposes of this text. The copy of the mp\_int will be a separate entity that represents the same
value as the mp\_int it was copied from. The mp\_copy algorithm provides this functionality.
\newpage\begin{figure}[here]
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_copy}. \\
\textbf{Input}. An mp\_int $a$ and $b$. \\
\textbf{Output}. Store a copy of $a$ in $b$. \\
\hline \\
1. If $b.alloc < a.used$ then grow $b$ to $a.used$ digits. (\textit{mp\_grow}) \\
2. for $n$ from 0 to $a.used - 1$ do \\
\hspace{3mm}2.1 $b_{n} \leftarrow a_{n}$ \\
3. for $n$ from $a.used$ to $b.used - 1$ do \\
\hspace{3mm}3.1 $b_{n} \leftarrow 0$ \\
4. $b.used \leftarrow a.used$ \\
5. $b.sign \leftarrow a.sign$ \\
6. return(\textit{MP\_OKAY}) \\
\hline
\end{tabular}
\end{center}
\caption{Algorithm mp\_copy}
\end{figure}
\textbf{Algorithm mp\_copy.}
This algorithm copies the mp\_int $a$ such that upon succesful termination of the algorithm the mp\_int $b$ will
represent the same integer as the mp\_int $a$. The mp\_int $b$ shall be a complete and distinct copy of the
mp\_int $a$ meaing that the mp\_int $a$ can be modified and it shall not affect the value of the mp\_int $b$.
If $b$ does not have enough room for the digits of $a$ it must first have its precision augmented via the mp\_grow
algorithm. The digits of $a$ are copied over the digits of $b$ and any excess digits of $b$ are set to zero (step two
and three). The \textbf{used} and \textbf{sign} members of $a$ are finally copied over the respective members of
$b$.
\textbf{Remark.} This algorithm also introduces a new idiosyncrasy that will be used throughout the rest of the
text. The error return codes of other algorithms are not explicitly checked in the pseudo-code presented. For example, in
step one of the mp\_copy algorithm the return of mp\_grow is not explicitly checked to ensure it succeeded. Text space is
limited so it is assumed that if a algorithm fails it will clear all temporarily allocated mp\_ints and return
the error code itself. However, the C code presented will demonstrate all of the error handling logic required to
implement the pseudo-code.
EXAM,bn_mp_copy.c
Occasionally a dependent algorithm may copy an mp\_int effectively into itself such as when the input and output
mp\_int structures passed to a function are one and the same. For this case it is optimal to return immediately without
copying digits (line @24,a == b@).
The mp\_int $b$ must have enough digits to accomodate the used digits of the mp\_int $a$. If $b.alloc$ is less than
$a.used$ the algorithm mp\_grow is used to augment the precision of $b$ (lines @29,alloc@ to @33,}@). In order to
simplify the inner loop that copies the digits from $a$ to $b$, two aliases $tmpa$ and $tmpb$ point directly at the digits
of the mp\_ints $a$ and $b$ respectively. These aliases (lines @42,tmpa@ and @45,tmpb@) allow the compiler to access the digits without first dereferencing the
mp\_int pointers and then subsequently the pointer to the digits.
After the aliases are established the digits from $a$ are copied into $b$ (lines @48,for@ to @50,}@) and then the excess
digits of $b$ are set to zero (lines @53,for@ to @55,}@). Both ``for'' loops make use of the pointer aliases and in
fact the alias for $b$ is carried through into the second ``for'' loop to clear the excess digits. This optimization
allows the alias to stay in a machine register fairly easy between the two loops.
\textbf{Remarks.} The use of pointer aliases is an implementation methodology first introduced in this function that will
be used considerably in other functions. Technically, a pointer alias is simply a short hand alias used to lower the
number of pointer dereferencing operations required to access data. For example, a for loop may resemble
\begin{alltt}
for (x = 0; x < 100; x++) \{
a->num[4]->dp[x] = 0;
\}
\end{alltt}
This could be re-written using aliases as
\begin{alltt}
mp_digit *tmpa;
a = a->num[4]->dp;
for (x = 0; x < 100; x++) \{
*a++ = 0;
\}
\end{alltt}
In this case an alias is used to access the
array of digits within an mp\_int structure directly. It may seem that a pointer alias is strictly not required
as a compiler may optimize out the redundant pointer operations. However, there are two dominant reasons to use aliases.
The first reason is that most compilers will not effectively optimize pointer arithmetic. For example, some optimizations
may work for the Microsoft Visual C++ compiler (MSVC) and not for the GNU C Compiler (GCC). Also some optimizations may
work for GCC and not MSVC. As such it is ideal to find a common ground for as many compilers as possible. Pointer
aliases optimize the code considerably before the compiler even reads the source code which means the end compiled code
stands a better chance of being faster.
The second reason is that pointer aliases often can make an algorithm simpler to read. Consider the first ``for''
loop of the function mp\_copy() re-written to not use pointer aliases.
\begin{alltt}
/* copy all the digits */
for (n = 0; n < a->used; n++) \{
b->dp[n] = a->dp[n];
\}
\end{alltt}
Whether this code is harder to read depends strongly on the individual. However, it is quantifiably slightly more
complicated as there are four variables within the statement instead of just two.
\subsubsection{Nested Statements}
Another commonly used technique in the source routines is that certain sections of code are nested. This is used in
particular with the pointer aliases to highlight code phases. For example, a Comba multiplier (discussed in chapter six)
will typically have three different phases. First the temporaries are initialized, then the columns calculated and
finally the carries are propagated. In this example the middle column production phase will typically be nested as it
uses temporary variables and aliases the most.
The nesting also simplies the source code as variables that are nested are only valid for their scope. As a result
the various temporary variables required do not propagate into other sections of code.
\subsection{Creating a Clone}
Another common operation is to make a local temporary copy of an mp\_int argument. To initialize an mp\_int
and then copy another existing mp\_int into the newly intialized mp\_int will be known as creating a clone. This is
useful within functions that need to modify an argument but do not wish to actually modify the original copy. The
mp\_init\_copy algorithm has been designed to help perform this task.
\begin{figure}[here]
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_init\_copy}. \\
\textbf{Input}. An mp\_int $a$ and $b$\\
\textbf{Output}. $a$ is initialized to be a copy of $b$. \\
\hline \\
1. Init $a$. (\textit{mp\_init}) \\
2. Copy $b$ to $a$. (\textit{mp\_copy}) \\
3. Return the status of the copy operation. \\
\hline
\end{tabular}
\end{center}
\caption{Algorithm mp\_init\_copy}
\end{figure}
\textbf{Algorithm mp\_init\_copy.}
This algorithm will initialize an mp\_int variable and copy another previously initialized mp\_int variable into it. As
such this algorithm will perform two operations in one step.
EXAM,bn_mp_init_copy.c
This will initialize \textbf{a} and make it a verbatim copy of the contents of \textbf{b}. Note that
\textbf{a} will have its own memory allocated which means that \textbf{b} may be cleared after the call
and \textbf{a} will be left intact.
\section{Zeroing an Integer}
Reseting an mp\_int to the default state is a common step in many algorithms. The mp\_zero algorithm will be the algorithm used to
perform this task.
\begin{figure}[here]
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_zero}. \\
\textbf{Input}. An mp\_int $a$ \\
\textbf{Output}. Zero the contents of $a$ \\
\hline \\
1. $a.used \leftarrow 0$ \\
2. $a.sign \leftarrow$ MP\_ZPOS \\
3. for $n$ from 0 to $a.alloc - 1$ do \\
\hspace{3mm}3.1 $a_n \leftarrow 0$ \\
\hline
\end{tabular}
\end{center}
\caption{Algorithm mp\_zero}
\end{figure}
\textbf{Algorithm mp\_zero.}
This algorithm simply resets a mp\_int to the default state.
EXAM,bn_mp_zero.c
After the function is completed, all of the digits are zeroed, the \textbf{used} count is zeroed and the
\textbf{sign} variable is set to \textbf{MP\_ZPOS}.
\section{Sign Manipulation}
\subsection{Absolute Value}
With the mp\_int representation of an integer, calculating the absolute value is trivial. The mp\_abs algorithm will compute
3. Clamp excess used digits (\textit{mp\_clamp}) \\
\hline
\end{tabular}
\end{center}
\caption{Algorithm mp\_set\_int}
\end{figure}
\textbf{Algorithm mp\_set\_int.}
The algorithm performs eight iterations of a simple loop where in each iteration four bits from the source are added to the
mp\_int. Step 2.1 will multiply the current result by sixteen making room for four more bits in the less significant positions. In step 2.2 the
next four bits from the source are extracted and are added to the mp\_int. The \textbf{used} digit count is
incremented to reflect the addition. The \textbf{used} digit counter is incremented since if any of the leading digits were zero the mp\_int would have
zero digits used and the newly added four bits would be ignored.
Excess zero digits are trimmed in steps 2.1 and 3 by using higher level algorithms mp\_mul2d and mp\_clamp.
EXAM,bn_mp_set_int.c
This function sets four bits of the number at a time to handle all practical \textbf{DIGIT\_BIT} sizes. The weird
addition on line @38,a->used@ ensures that the newly added in bits are added to the number of digits. While it may not
seem obvious as to why the digit counter does not grow exceedingly large it is because of the shift on line @27,mp_mul_2d@
as well as the call to mp\_clamp() on line @40,mp_clamp@. Both functions will clamp excess leading digits which keeps
the number of used digits low.
\section{Comparisons}
\subsection{Unsigned Comparisions}
Comparing a multiple precision integer is performed with the exact same algorithm used to compare two decimal numbers. For example,
to compare $1,234$ to $1,264$ the digits are extracted by their positions. That is we compare $1 \cdot 10^3 + 2 \cdot 10^2 + 3 \cdot 10^1 + 4 \cdot 10^0$
to $1 \cdot 10^3 + 2 \cdot 10^2 + 6 \cdot 10^1 + 4 \cdot 10^0$ by comparing single digits at a time starting with the highest magnitude
positions. If any leading digit of one integer is greater than a digit in the same position of another integer then obviously it must be greater.
The first comparision routine that will be developed is the unsigned magnitude compare which will perform a comparison based on the digits of two
mp\_int variables alone. It will ignore the sign of the two inputs. Such a function is useful when an absolute comparison is required or if the
signs are known to agree in advance.
To facilitate working with the results of the comparison functions three constants are required.
\begin{figure}[here]
\begin{center}
\begin{tabular}{|r|l|}
\hline \textbf{Constant} & \textbf{Meaning} \\
\hline \textbf{MP\_GT} & Greater Than \\
\hline \textbf{MP\_EQ} & Equal To \\
\hline \textbf{MP\_LT} & Less Than \\
\hline
\end{tabular}
\end{center}
\caption{Comparison Return Codes}
\end{figure}
\begin{figure}[here]
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_cmp\_mag}. \\
\textbf{Input}. Two mp\_ints $a$ and $b$. \\
\textbf{Output}. Unsigned comparison results ($a$ to the left of $b$). \\
\hline \\
1. If $a.used > b.used$ then return(\textit{MP\_GT}) \\
2. If $a.used < b.used$ then return(\textit{MP\_LT}) \\
3. for n from $a.used - 1$ to 0 do \\
\hspace{+3mm}3.1 if $a_n > b_n$ then return(\textit{MP\_GT}) \\
\hspace{+3mm}3.2 if $a_n < b_n$ then return(\textit{MP\_LT}) \\
4. Return(\textit{MP\_EQ}) \\
\hline
\end{tabular}
\end{center}
\caption{Algorithm mp\_cmp\_mag}
\end{figure}
\textbf{Algorithm mp\_cmp\_mag.}
By saying ``$a$ to the left of $b$'' it is meant that the comparison is with respect to $a$, that is if $a$ is greater than $b$ it will return
\textbf{MP\_GT} and similar with respect to when $a = b$ and $a < b$. The first two steps compare the number of digits used in both $a$ and $b$.
Obviously if the digit counts differ there would be an imaginary zero digit in the smaller number where the leading digit of the larger number is.
If both have the same number of digits than the actual digits themselves must be compared starting at the leading digit.
By step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to
the zero'th digit. If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}.
be both positive and a forward direction unsigned comparison is performed.
\section*{Exercises}
\begin{tabular}{cl}
$\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\
& \\
$\left [ 3 \right ]$ & Give the probability that algorithm mp\_cmp\_mag will have to compare $k$ digits \\
& of two random digits (of equal magnitude) before a difference is found. \\
& \\
$\left [ 1 \right ]$ & Suggest a simple method to speed up the implementation of mp\_cmp\_mag based \\
& on the observations made in the previous problem. \\
&
\end{tabular}
\chapter{Basic Arithmetic}
\section{Introduction}
At this point algorithms for initialization, clearing, zeroing, copying, comparing and setting small constants have been
established. The next logical set of algorithms to develop are addition, subtraction and digit shifting algorithms. These
algorithms make use of the lower level algorithms and are the cruicial building block for the multiplication algorithms. It is very important
that these algorithms are highly optimized. On their own they are simple $O(n)$ algorithms but they can be called from higher level algorithms
which easily places them at $O(n^2)$ or even $O(n^3)$ work levels.
MARK,SHIFTS
All of the algorithms within this chapter make use of the logical bit shift operations denoted by $<<$ and $>>$ for left and right
logical shifts respectively. A logical shift is analogous to sliding the decimal point of radix-10 representations. For example, the real
number $0.9345$ is equivalent to $93.45\%$ which is found by sliding the the decimal two places to the right (\textit{multiplying by $\beta^2 = 10^2$}).
Algebraically a binary logical shift is equivalent to a division or multiplication by a power of two.
For example, $a << k = a \cdot 2^k$ while $a >> k = \lfloor a/2^k \rfloor$.
One significant difference between a logical shift and the way decimals are shifted is that digits below the zero'th position are removed
from the number. For example, consider $1101_2 >> 1$ using decimal notation this would produce $110.1_2$. However, with a logical shift the
result is $110_2$.
\section{Addition and Subtraction}
In common twos complement fixed precision arithmetic negative numbers are easily represented by subtraction from the modulus. For example, with 32-bit integers
$a - b\mbox{ (mod }2^{32}\mbox{)}$ is the same as $a + (2^{32} - b) \mbox{ (mod }2^{32}\mbox{)}$ since $2^{32} \equiv 0 \mbox{ (mod }2^{32}\mbox{)}$.
As a result subtraction can be performed with a trivial series of logical operations and an addition.
However, in multiple precision arithmetic negative numbers are not represented in the same way. Instead a sign flag is used to keep track of the
sign of the integer. As a result signed addition and subtraction are actually implemented as conditional usage of lower level addition or
subtraction algorithms with the sign fixed up appropriately.
The lower level algorithms will add or subtract integers without regard to the sign flag. That is they will add or subtract the magnitude of
the integers respectively.
\subsection{Low Level Addition}
An unsigned addition of multiple precision integers is performed with the same long-hand algorithm used to add decimal numbers. That is to add the
trailing digits first and propagate the resulting carry upwards. Since this is a lower level algorithm the name will have a ``s\_'' prefix.
Historically that convention stems from the MPI library where ``s\_'' stood for static functions that were hidden from the developer entirely.
\newpage
\begin{figure}[!here]
\begin{center}
\begin{small}
\begin{tabular}{l}
\hline Algorithm \textbf{s\_mp\_add}. \\
\textbf{Input}. Two mp\_ints $a$ and $b$ \\
\textbf{Output}. The unsigned addition $c = \vert a \vert + \vert b \vert$. \\
\hline \\
1. if $a.used > b.used$ then \\
\hspace{+3mm}1.1 $min \leftarrow b.used$ \\
\hspace{+3mm}1.2 $max \leftarrow a.used$ \\
\hspace{+3mm}1.3 $x \leftarrow a$ \\
2. else \\
\hspace{+3mm}2.1 $min \leftarrow a.used$ \\
\hspace{+3mm}2.2 $max \leftarrow b.used$ \\
\hspace{+3mm}2.3 $x \leftarrow b$ \\
3. If $c.alloc < max + 1$ then grow $c$ to hold at least $max + 1$ digits (\textit{mp\_grow}) \\
\hline $+$ & $+$ & No & $c = b - a$ & $\mbox{opposite of }a.sign$ \\
\hline $-$ & $-$ & No & $c = b - a$ & $\mbox{opposite of }a.sign$ \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Subtraction Guide Chart}
\label{fig:SubChart}
\end{figure}
Similar to the case of algorithm mp\_add the \textbf{sign} is set first before the unsigned addition or subtraction. That is to prevent the
algorithm from producing $-a - -a = -0$ as a result.
EXAM,bn_mp_sub.c
Much like the implementation of algorithm mp\_add the variable $res$ is used to catch the return code of the unsigned addition or subtraction operations
and forward it to the end of the function. On line @38, != MP_LT@ the ``not equal to'' \textbf{MP\_LT} expression is used to emulate a
``greater than or equal to'' comparison.
\section{Bit and Digit Shifting}
MARK,POLY
It is quite common to think of a multiple precision integer as a polynomial in $x$, that is $y = f(\beta)$ where $f(x) = \sum_{i=0}^{n-1} a_i x^i$.
This notation arises within discussion of Montgomery and Diminished Radix Reduction as well as Karatsuba multiplication and squaring.
In order to facilitate operations on polynomials in $x$ as above a series of simple ``digit'' algorithms have to be established. That is to shift
the digits left or right as well to shift individual bits of the digits left and right. It is important to note that not all ``shift'' operations
are on radix-$\beta$ digits.
\subsection{Multiplication by Two}
In a binary system where the radix is a power of two multiplication by two not only arises often in other algorithms it is a fairly efficient
operation to perform. A single precision logical shift left is sufficient to multiply a single digit by two.
\newpage\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_mul\_2}. \\
\textbf{Input}. One mp\_int $a$ \\
\textbf{Output}. $b = 2a$. \\
\hline \\
1. If $b.alloc < a.used + 1$ then grow $b$ to hold $a.used + 1$ digits. (\textit{mp\_grow}) \\
\hspace{3mm}7.1 for $n$ from $b.used$ to $oldused - 1$ do \\
\hspace{6mm}7.1.1 $b_n \leftarrow 0$ \\
8. $b.sign \leftarrow a.sign$ \\
9. Clamp excess digits of $b$. (\textit{mp\_clamp}) \\
10. Return(\textit{MP\_OKAY}).\\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm mp\_div\_2}
\end{figure}
\textbf{Algorithm mp\_div\_2.}
This algorithm will divide an mp\_int by two using logical shifts to the right. Like mp\_mul\_2 it uses a modified low level addition
core as the basis of the algorithm. Unlike mp\_mul\_2 the shift operations work from the leading digit to the trailing digit. The algorithm
could be written to work from the trailing digit to the leading digit however, it would have to stop one short of $a.used - 1$ digits to prevent
reading past the end of the array of digits.
Essentially the loop at step 6 is similar to that of mp\_mul\_2 except the logical shifts go in the opposite direction and the carry is at the
least significant bit not the most significant bit.
EXAM,bn_mp_div_2.c
\section{Polynomial Basis Operations}
Recall from ~POLY~ that any integer can be represented as a polynomial in $x$ as $y = f(\beta)$. Such a representation is also known as
the polynomial basis \cite[pp. 48]{ROSE}. Given such a notation a multiplication or division by $x$ amounts to shifting whole digits a single
place. The need for such operations arises in several other higher level algorithms such as Barrett and Montgomery reduction, integer
division and Karatsuba multiplication.
Converting from an array of digits to polynomial basis is very simple. Consider the integer $y \equiv (a_2, a_1, a_0)_{\beta}$ and recall that
$y = \sum_{i=0}^{2} a_i \beta^i$. Simply replace $\beta$ with $x$ and the expression is in polynomial basis. For example, $f(x) = 8x + 9$ is the
polynomial basis representation for $89$ using radix ten. That is, $f(10) = 8(10) + 9 = 89$.
\subsection{Multiplication by $x$}
Given a polynomial in $x$ such as $f(x) = a_n x^n + a_{n-1} x^{n-1} + ... + a_0$ multiplying by $x$ amounts to shifting the coefficients up one
degree. In this case $f(x) \cdot x = a_n x^{n+1} + a_{n-1} x^n + ... + a_0 x$. From a scalar basis point of view multiplying by $x$ is equivalent to
multiplying by the integer $\beta$.
\newpage\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_lshd}. \\
\textbf{Input}. One mp\_int $a$ and an integer $b$ \\
\textbf{Output}. $a \leftarrow a \cdot \beta^b$ (equivalent to multiplication by $x^b$). \\
\hline \\
1. If $b \le 0$ then return(\textit{MP\_OKAY}). \\
2. If $a.alloc < a.used + b$ then grow $a$ to at least $a.used + b$ digits. (\textit{mp\_grow}). \\
3. If the reallocation failed return(\textit{MP\_MEM}). \\
4. $a.used \leftarrow a.used + b$ \\
5. $i \leftarrow a.used - 1$ \\
6. $j \leftarrow a.used - 1 - b$ \\
7. for $n$ from $a.used - 1$ to $b$ do \\
\hspace{3mm}7.1 $a_{i} \leftarrow a_{j}$ \\
\hspace{3mm}7.2 $i \leftarrow i - 1$ \\
\hspace{3mm}7.3 $j \leftarrow j - 1$ \\
8. for $n$ from 0 to $b - 1$ do \\
\hspace{3mm}8.1 $a_n \leftarrow 0$ \\
9. Return(\textit{MP\_OKAY}). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm mp\_lshd}
\end{figure}
\textbf{Algorithm mp\_lshd.}
This algorithm multiplies an mp\_int by the $b$'th power of $x$. This is equivalent to multiplying by $\beta^b$. The algorithm differs
from the other algorithms presented so far as it performs the operation in place instead storing the result in a separate location. The
motivation behind this change is due to the way this function is typically used. Algorithms such as mp\_add store the result in an optionally
different third mp\_int because the original inputs are often still required. Algorithm mp\_lshd (\textit{and similarly algorithm mp\_rshd}) is
typically used on values where the original value is no longer required. The algorithm will return success immediately if
$b \le 0$ since the rest of algorithm is only valid when $b > 0$.
First the destination $a$ is grown as required to accomodate the result. The counters $i$ and $j$ are used to form a \textit{sliding window} over
the digits of $a$ of length $b$. The head of the sliding window is at $i$ (\textit{the leading digit}) and the tail at $j$ (\textit{the trailing digit}).
The loop on step 7 copies the digit from the tail to the head. In each iteration the window is moved down one digit. The last loop on
$\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\
& in $O(n)$ time. \\
&\\
$\left [ 3 \right ] $ & Devise an efficient algorithm to multiply by small low hamming \\
& weight values such as $3$, $5$ and $9$. Extend it to handle all values \\
& upto $64$ with a hamming weight less than three. \\
&\\
$\left [ 2 \right ] $ & Modify the preceding algorithm to handle values of the form \\
& $2^k - 1$ as well. \\
&\\
$\left [ 3 \right ] $ & Using only algorithms mp\_mul\_2, mp\_div\_2 and mp\_add create an \\
& algorithm to multiply two integers in roughly $O(2n^2)$ time for \\
& any $n$-bit input. Note that the time of addition is ignored in the \\
& calculation. \\
& \\
$\left [ 5 \right ] $ & Improve the previous algorithm to have a working time of at most \\
& $O \left (2^{(k-1)}n + \left ({2n^2 \over k} \right ) \right )$ for an appropriate choice of $k$. Again ignore \\
& the cost of addition. \\
& \\
$\left [ 2 \right ] $ & Devise a chart to find optimal values of $k$ for the previous problem \\
& for $n = 64 \ldots 1024$ in steps of $64$. \\
& \\
$\left [ 2 \right ] $ & Using only algorithms mp\_abs and mp\_sub devise another method for \\
& calculating the result of a signed comparison. \\
&
\end{tabular}
\chapter{Multiplication and Squaring}
\section{The Multipliers}
For most number theoretic problems including certain public key cryptographic algorithms, the ``multipliers'' form the most important subset of
algorithms of any multiple precision integer package. The set of multiplier algorithms include integer multiplication, squaring and modular reduction
where in each of the algorithms single precision multiplication is the dominant operation performed. This chapter will discuss integer multiplication
and squaring, leaving modular reductions for the subsequent chapter.
The importance of the multiplier algorithms is for the most part driven by the fact that certain popular public key algorithms are based on modular
exponentiation, that is computing $d \equiv a^b \mbox{ (mod }c\mbox{)}$ for some arbitrary choice of $a$, $b$, $c$ and $d$. During a modular
exponentiation the majority\footnote{Roughly speaking a modular exponentiation will spend about 40\% of the time performing modular reductions,
35\% of the time performing squaring and 25\% of the time performing multiplications.} of the processor time is spent performing single precision
multiplications.
For centuries general purpose multiplication has required a lengthly $O(n^2)$ process, whereby each digit of one multiplicand has to be multiplied
against every digit of the other multiplicand. Traditional long-hand multiplication is based on this process; while the techniques can differ the
overall algorithm used is essentially the same. Only ``recently'' have faster algorithms been studied. First Karatsuba multiplication was discovered in
1962. This algorithm can multiply two numbers with considerably fewer single precision multiplications when compared to the long-hand approach.
This technique led to the discovery of polynomial basis algorithms (\textit{good reference?}) and subquently Fourier Transform based solutions.
Using the observation that $ac$ and $bd$ could be re-used only three half sized multiplications would be required to produce the product. Applying
this algorithm recursively, the work factor becomes $O(n^{lg(3)})$ which is substantially better than the work factor $O(n^2)$ of the Comba technique. It turns
out what Karatsuba did not know or at least did not publish was that this is simply polynomial basis multiplication with the points
By adding the first and last equation to the equation in the middle the term $w_1$ can be isolated and all three coefficients solved for. The simplicity
of this system of equations has made Karatsuba fairly popular. In fact the cutoff point is often fairly low\footnote{With LibTomMath 0.18 it is 70 and 109 digits for the Intel P4 and AMD Athlon respectively.}
Starting from zero and numbering the columns from right to left a very simple pattern becomes obvious. For the purposes of this discussion let $x$
represent the number being squared. The first observation is that in row $k$ the $2k$'th column of the product has a $\left (x_k \right)^2$ term in it.
The second observation is that every column $j$ in row $k$ where $j \ne 2k$ is part of a double product. Every non-square term of a column will
appear twice hence the name ``double product''. Every odd column is made up entirely of double products. In fact every column is made up of double
products and at most one square (\textit{see the exercise section}).
The third and final observation is that for row $k$ the first unique non-square term, that is, one that hasn't already appeared in an earlier row,
occurs at column $2k + 1$. For example, on row $1$ of the previous squaring, column one is part of the double product with column one from row zero.
Column two of row one is a square and column three is the first unique column.
\subsection{The Baseline Squaring Algorithm}
The baseline squaring algorithm is meant to be a catch-all squaring algorithm. It will handle any of the input sizes that the faster routines
\hspace{6mm}4.5.4 $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
5. Clamp excess digits of $t$. (\textit{mp\_clamp}) \\
6. Exchange $b$ and $t$. \\
7. Clear $t$ (\textit{mp\_clear}) \\
8. Return(\textit{MP\_OKAY}) \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm s\_mp\_sqr}
\end{figure}
\textbf{Algorithm s\_mp\_sqr.}
This algorithm computes the square of an input using the three observations on squaring. It is based fairly faithfully on algorithm 14.16 of HAC
\cite[pp.596-597]{HAC}. Similar to algorithm s\_mp\_mul\_digs, a temporary mp\_int is allocated to hold the result of the squaring. This allows the
destination mp\_int to be the same as the source mp\_int.
The outer loop of this algorithm begins on step 4. It is best to think of the outer loop as walking down the rows of the partial results, while
the inner loop computes the columns of the partial result. Step 4.1 and 4.2 compute the square term for each row, and step 4.3 and 4.4 propagate
the carry and compute the double products.
The requirement that a mp\_word be able to represent the range $0 \le x < 2 \beta^2$ arises from this
very algorithm. The product $a_{ix}a_{iy}$ will lie in the range $0 \le x \le \beta^2 - 2\beta + 1$ which is obviously less than $\beta^2$ meaning that
when it is multiplied by two, it can be properly represented by a mp\_word.
Similar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial
results calculated so far. This involves expensive carry propagation which will be eliminated in the next algorithm.
By expanding $\left (x1 + x0 \right )^2$, the $x1^2$ and $x0^2$ terms in the middle disappear, that is $(x0 - x1)^2 - (x1^2 + x0^2) = 2 \cdot x0 \cdot x1$.
Modular reduction is an operation that arises quite often within public key cryptography algorithms and various number theoretic algorithms,
such as factoring. Modular reduction algorithms are the third class of algorithms of the ``multipliers'' set. A number $a$ is said to be \textit{reduced}
modulo another number $b$ by finding the remainder of the division $a/b$. Full integer division with remainder is a topic to be covered
in~\ref{sec:division}.
Modular reduction is equivalent to solving for $r$ in the following equation. $a = bq + r$ where $q = \lfloor a/b \rfloor$. The result
$r$ is said to be ``congruent to $a$ modulo $b$'' which is also written as $r \equiv a \mbox{ (mod }b\mbox{)}$. In other vernacular $r$ is known as the
``modular residue'' which leads to ``quadratic residue''\footnote{That's fancy talk for $b \equiv a^2 \mbox{ (mod }p\mbox{)}$.} and
other forms of residues.
Modular reductions are normally used to create either finite groups, rings or fields. The most common usage for performance driven modular reductions
is in modular exponentiation algorithms. That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible. This operation is used in the
RSA and Diffie-Hellman public key algorithms, for example. Modular multiplication and squaring also appears as a fundamental operation in
exponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications. These algorithms will produce partial results in the
range $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms. They have also been used to create redundancy check
algorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems.
\section{The Barrett Reduction}
The Barrett reduction algorithm \cite{BARRETT} was inspired by fast division algorithms which multiply by the reciprocal to emulate
division. Barretts observation was that the residue $c$ of $a$ modulo $b$ is equal to
\begin{equation}
c = a - b \cdot \lfloor a/b \rfloor
\end{equation}
Since algorithms such as modular exponentiation would be using the same modulus extensively, typical DSP\footnote{It is worth noting that Barrett's paper
targeted the DSP56K processor.} intuition would indicate the next step would be to replace $a/b$ by a multiplication by the reciprocal. However,
DSP intuition on its own will not work as these numbers are considerably larger than the precision of common DSP floating point data types.
It would take another common optimization to optimize the algorithm.
\subsection{Fixed Point Arithmetic}
The trick used to optimize the above equation is based on a technique of emulating floating point data types with fixed precision integers. Fixed
point arithmetic would become very popular as it greatly optimize the ``3d-shooter'' genre of games in the mid 1990s when floating point units were
fairly slow if not unavailable. The idea behind fixed point arithmetic is to take a normal $k$-bit integer data type and break it into $p$-bit
integer and a $q$-bit fraction part (\textit{where $p+q = k$}).
In this system a $k$-bit integer $n$ would actually represent $n/2^q$. For example, with $q = 4$ the integer $n = 37$ would actually represent the
value $2.3125$. To multiply two fixed point numbers the integers are multiplied using traditional arithmetic and subsequently normalized by
moving the implied decimal point back to where it should be. For example, with $q = 4$ to multiply the integers $9$ and $5$ they must be converted
to fixed point first by multiplying by $2^q$. Let $a = 9(2^q)$ represent the fixed point representation of $9$ and $b = 5(2^q)$ represent the
fixed point representation of $5$. The product $ab$ is equal to $45(2^{2q})$ which when normalized by dividing by $2^q$ produces $45(2^q)$.
This technique became popular since a normal integer multiplication and logical shift right are the only required operations to perform a multiplication
of two fixed point numbers. Using fixed point arithmetic, division can be easily approximated by multiplying by the reciprocal. If $2^q$ is
equivalent to one than $2^q/b$ is equivalent to the fixed point approximation of $1/b$ using real arithmetic. Using this fact dividing an integer
$a$ by another integer $b$ can be achieved with the following expression.
\begin{equation}
\lfloor a / b \rfloor \mbox{ }\approx\mbox{ } \lfloor (a \cdot \lfloor 2^q / b \rfloor)/2^q \rfloor
\end{equation}
The precision of the division is proportional to the value of $q$. If the divisor $b$ is used frequently as is the case with
modular exponentiation pre-computing $2^q/b$ will allow a division to be performed with a multiplication and a right shift. Both operations
are considerably faster than division on most processors.
Consider dividing $19$ by $5$. The correct result is $\lfloor 19/5 \rfloor = 3$. With $q = 3$ the reciprocal is $\lfloor 2^q/5 \rfloor = 1$ which
leads to a product of $19$ which when divided by $2^q$ produces $2$. However, with $q = 4$ the reciprocal is $\lfloor 2^q/5 \rfloor = 3$ and
the result of the emulated division is $\lfloor 3 \cdot 19 / 2^q \rfloor = 3$ which is correct. The value of $2^q$ must be close to or ideally
larger than the dividend. In effect if $a$ is the dividend then $q$ should allow $0 \le \lfloor a/2^q \rfloor \le 1$ in order for this approach
to work correctly. Plugging this form of divison into the original equation the following modular residue equation arises.
\begin{equation}
c = a - b \cdot \lfloor (a \cdot \lfloor 2^q / b \rfloor)/2^q \rfloor
\end{equation}
Using the notation from \cite{BARRETT} the value of $\lfloor 2^q / b \rfloor$ will be represented by the $\mu$ symbol. Using the $\mu$
variable also helps re-inforce the idea that it is meant to be computed once and re-used.
\begin{equation}
c = a - b \cdot \lfloor (a \cdot \mu)/2^q \rfloor
\end{equation}
Provided that $2^q \ge a$ this algorithm will produce a quotient that is either exactly correct or off by a value of one. In the context of Barrett
reduction the value of $a$ is bound by $0 \le a \le (b - 1)^2$ meaning that $2^q \ge b^2$ is sufficient to ensure the reciprocal will have enough
precision.
Let $n$ represent the number of digits in $b$. This algorithm requires approximately $2n^2$ single precision multiplications to produce the quotient and
another $n^2$ single precision multiplications to find the residue. In total $3n^2$ single precision multiplications are required to
reduce the number.
For example, if $b = 1179677$ and $q = 41$ ($2^q > b^2$), then the reciprocal $\mu$ is equal to $\lfloor 2^q / b \rfloor = 1864089$. Consider reducing
$a = 180388626447$ modulo $b$ using the above reduction equation. The quotient using the new formula is $\lfloor (a \cdot \mu) / 2^q \rfloor = 152913$.
By subtracting $152913b$ from $a$ the correct residue $a \equiv 677346 \mbox{ (mod }b\mbox{)}$ is found.
\subsection{Choosing a Radix Point}
Using the fixed point representation a modular reduction can be performed with $3n^2$ single precision multiplications. If that were the best
that could be achieved a full division\footnote{A division requires approximately $O(2cn^2)$ single precision multiplications for a small value of $c$.
See~\ref{sec:division} for further details.} might as well be used in its place. The key to optimizing the reduction is to reduce the precision of
the initial multiplication that finds the quotient.
Let $a$ represent the number of which the residue is sought. Let $b$ represent the modulus used to find the residue. Let $m$ represent
the number of digits in $b$. For the purposes of this discussion we will assume that the number of digits in $a$ is $2m$, which is generally true if
two $m$-digit numbers have been multiplied. Dividing $a$ by $b$ is the same as dividing a $2m$ digit integer by a $m$ digit integer. Digits below the
$m - 1$'th digit of $a$ will contribute at most a value of $1$ to the quotient because $\beta^k < b$ for any $0 \le k \le m - 1$. Another way to
express this is by re-writing $a$ as two parts. If $a' \equiv a \mbox{ (mod }b^m\mbox{)}$ and $a'' = a - a'$ then
${a \over b} \equiv {{a' + a''} \over b}$ which is equivalent to ${a' \over b} + {a'' \over b}$. Since $a'$ is bound to be less than $b$ the quotient
is bound by $0 \le {a' \over b} < 1$.
Since the digits of $a'$ do not contribute much to the quotient the observation is that they might as well be zero. However, if the digits
``might as well be zero'' they might as well not be there in the first place. Let $q_0 = \lfloor a/\beta^{m-1} \rfloor$ represent the input
with the irrelevant digits trimmed. Now the modular reduction is trimmed to the almost equivalent equation
\begin{equation}
c = a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor
\end{equation}
Note that the original divisor $2^q$ has been replaced with $\beta^{m+1}$ where in this case $q$ is a multiple of $lg(\beta)$. Also note that the
exponent on the divisor when added to the amount $q_0$ was shifted by equals $2m$. If the optimization had not been performed the divisor
would have the exponent $2m$ so in the end the exponents do ``add up''. Using the above equation the quotient
$\lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ can be off from the true quotient by at most two. The original fixed point quotient can be off
by as much as one (\textit{provided the radix point is chosen suitably}) and now that the lower irrelevent digits have been trimmed the quotient
can be off by an additional value of one for a total of at most two. This implies that
$0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$. By first subtracting $b$ times the quotient and then conditionally subtracting
$b$ once or twice the residue is found.
The quotient is now found using $(m + 1)(m) = m^2 + m$ single precision multiplications and the residue with an additional $m^2$ single
precision multiplications, ignoring the subtractions required. In total $2m^2 + m$ single precision multiplications are required to find the residue.
This is considerably faster than the original attempt.
For example, let $\beta = 10$ represent the radix of the digits. Let $b = 9999$ represent the modulus which implies $m = 4$. Let $a = 99929878$
represent the value of which the residue is desired. In this case $q = 8$ since $10^7 < 9999^2$ meaning that $\mu = \lfloor \beta^{q}/b \rfloor = 10001$.
With the new observation the multiplicand for the quotient is equal to $q_0 = \lfloor a / \beta^{m - 1} \rfloor = 99929$. The quotient is then
$\lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor = 9993$. Subtracting $9993b$ from $a$ and the correct residue $a \equiv 9871 \mbox{ (mod }b\mbox{)}$
is found.
\subsection{Trimming the Quotient}
So far the reduction algorithm has been optimized from $3m^2$ single precision multiplications down to $2m^2 + m$ single precision multiplications. As
it stands now the algorithm is already fairly fast compared to a full integer division algorithm. However, there is still room for
optimization.
After the first multiplication inside the quotient ($q_0 \cdot \mu$) the value is shifted right by $m + 1$ places effectively nullifying the lower
half of the product. It would be nice to be able to remove those digits from the product to effectively cut down the number of single precision
multiplications. If the number of digits in the modulus $m$ is far less than $\beta$ a full product is not required for the algorithm to work properly.
In fact the lower $m - 2$ digits will not affect the upper half of the product at all and do not need to be computed.
The value of $\mu$ is a $m$-digit number and $q_0$ is a $m + 1$ digit number. Using a full multiplier $(m + 1)(m) = m^2 + m$ single precision
multiplications would be required. Using a multiplier that will only produce digits at and above the $m - 1$'th digit reduces the number
of single precision multiplications to ${m^2 + m} \over 2$ single precision multiplications.
\subsection{Trimming the Residue}
After the quotient has been calculated it is used to reduce the input. As previously noted the algorithm is not exact and it can be off by a small
multiple of the modulus, that is $0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$. If $b$ is $m$ digits than the
result of reduction equation is a value of at most $m + 1$ digits (\textit{provided $3 < \beta$}) implying that the upper $m - 1$ digits are
implicitly zero.
The next optimization arises from this very fact. Instead of computing $b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ using a full
$O(m^2)$ multiplication algorithm only the lower $m+1$ digits of the product have to be computed. Similarly the value of $a$ can
be reduced modulo $\beta^{m+1}$ before the multiple of $b$ is subtracted which simplifes the subtraction as well. A multiplication that produces
only the lower $m+1$ digits requires ${m^2 + 3m - 2} \over 2$ single precision multiplications.
With both optimizations in place the algorithm is the algorithm Barrett proposed. It requires $m^2 + 2m - 1$ single precision multiplications which
is considerably faster than the straightforward $3m^2$ method.
\subsection{The Barrett Algorithm}
\newpage\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_reduce}. \\
\textbf{Input}. mp\_int $a$, mp\_int $b$ and $\mu = \lfloor \beta^{2m}/b \rfloor, m = \lceil lg_{\beta}(b) \rceil, (0 \le a < b^2, b > 1)$ \\
\textbf{Output}. $a \mbox{ (mod }b\mbox{)}$ \\
\hline \\
Let $m$ represent the number of digits in $b$. \\
1. Make a copy of $a$ and store it in $q$. (\textit{mp\_init\_copy}) \\
In each iteration of the loop on step 1 a new value of $\mu$ must be calculated. The value of $-1/n_0 \mbox{ (mod }\beta\mbox{)}$ is used
extensively in this algorithm and should be precomputed. Let $\rho$ represent the negative of the modular inverse of $n_0$ modulo $\beta$.
For example, let $\beta = 10$ represent the radix. Let $n = 17$ represent the modulus which implies $k = 2$ and $\rho \equiv 7$. Let $x = 33$
represent the value to reduce.
\newpage\begin{figure}
\begin{center}
\begin{tabular}{|c|c|c|}
\hline \textbf{Step ($t$)} & \textbf{Value of $x$} & \textbf{Value of $\mu$} \\
\hline -- & $33$ & --\\
\hline $0$ & $33 + \mu n = 50$ & $1$ \\
\hline $1$ & $50 + \mu n \beta = 900$ & $5$ \\
\hline
\end{tabular}
\end{center}
\caption{Example of Montgomery Reduction}
\end{figure}
The final result $900$ is then divided by $\beta^k$ to produce the final result $9$. The first observation is that $9 \nequiv x \mbox{ (mod }n\mbox{)}$
which implies the result is not the modular residue of $x$ modulo $n$. However, recall that the residue is actually multiplied by $\beta^{-k}$ in
the algorithm. To get the true residue the value must be multiplied by $\beta^k$. In this case $\beta^k \equiv 15 \mbox{ (mod }n\mbox{)}$ and
the correct residue is $9 \cdot 15 \equiv 16 \mbox{ (mod }n\mbox{)}$.
\subsection{Baseline Montgomery Reduction}
The baseline Montgomery reduction algorithm will produce the residue for any size input. It is designed to be a catch-all algororithm for
\hspace{3mm}7.1 for $ix$ from $n.used + 1$ to $x.used - 1$ do \\
\hspace{6mm}7.1.1 $x_{ix} \leftarrow 0$ \\
8. $x.used \leftarrow n.used + 1$ \\
9. Clamp excessive digits of $x$. \\
10. If $x \ge n$ then \\
\hspace{3mm}10.1 $x \leftarrow x - n$ \\
11. Return(\textit{MP\_OKAY}). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm fast\_mp\_montgomery\_reduce}
\end{figure}
\textbf{Algorithm fast\_mp\_montgomery\_reduce.}
This algorithm will compute the Montgomery reduction of $x$ modulo $n$ using the Comba technique. It is on most computer platforms significantly
faster than algorithm mp\_montgomery\_reduce and algorithm mp\_reduce (\textit{Barrett reduction}). The algorithm has the same restrictions
on the input as the baseline reduction algorithm. An additional two restrictions are imposed on this algorithm. The number of digits $k$ in the
the modulus $n$ must not violate $MP\_WARRAY > 2k +1$ and $n < \delta$. When $\beta = 2^{28}$ this algorithm can be used to reduce modulo
a modulus of at most $3,556$ bits in length.
As in the other Comba reduction algorithms there is a $\hat W$ array which stores the columns of the product. It is initially filled with the
contents of $x$ with the excess digits zeroed. The reduction loop is very similar the to the baseline loop at heart. The multiplication on step
4.1 can be single precision only since $ab \mbox{ (mod }\beta\mbox{)} \equiv (a \mbox{ mod }\beta)(b \mbox{ mod }\beta)$. Some multipliers such
as those on the ARM processors take a variable length time to complete depending on the number of bytes of result it must produce. By performing
a single precision multiplication instead half the amount of time is spent.
Also note that digit $\hat W_{ix}$ must have the carry from the $ix - 1$'th digit propagated upwards in order for this to work. That is what step
4.3 will do. In effect over the $n.used$ iterations of the outer loop the $n.used$'th lower columns all have the their carries propagated forwards. Note
how the upper bits of those same words are not reduced modulo $\beta$. This is because those values will be discarded shortly and there is no
point.
Step 5 will propagate the remainder of the carries upwards. On step 6 the columns are reduced modulo $\beta$ and shifted simultaneously as they are
stored in the destination $x$.
EXAM,bn_fast_mp_montgomery_reduce.c
The $\hat W$ array is first filled with digits of $x$ on line @49,for@ then the rest of the digits are zeroed on line @54,for@. Both loops share
the same alias variables to make the code easier to read.
The value of $\mu$ is calculated in an interesting fashion. First the value $\hat W_{ix}$ is reduced modulo $\beta$ and cast to a mp\_digit. This
forces the compiler to use a single precision multiplication and prevents any concerns about loss of precision. Line @101,>>@ fixes the carry
for the next iteration of the loop by propagating the carry from $\hat W_{ix}$ to $\hat W_{ix+1}$.
The for loop on line @113,for@ propagates the rest of the carries upwards through the columns. The for loop on line @126,for@ reduces the columns
modulo $\beta$ and shifts them $k$ places at the same time. The alias $\_ \hat W$ actually refers to the array $\hat W$ starting at the $n.used$'th
digit, that is $\_ \hat W_{t} = \hat W_{n.used + t}$.
\subsection{Montgomery Setup}
To calculate the variable $\rho$ a relatively simple algorithm will be required.
This algorithm will reduce $x$ modulo $n - k$ and return the residue. If $0 \le x < (n - k)^2$ then the algorithm will loop almost always
once or twice and occasionally three times. For simplicity sake the value of $x$ is bounded by the following simple polynomial.
\begin{equation}
0 \le x < n^2 + k^2 - 2nk
\end{equation}
The true bound is $0 \le x < (n - k - 1)^2$ but this has quite a few more terms. The value of $q$ after step 1 is bounded by the following.
\begin{equation}
q < n - 2k - k^2/n
\end{equation}
Since $k^2$ is going to be considerably smaller than $n$ that term will always be zero. The value of $x$ after step 3 is bounded trivially as
$0 \le x < n$. By step four the sum $x + q$ is bounded by
\begin{equation}
0 \le q + x < (k + 1)n - 2k^2 - 1
\end{equation}
With a second pass $q$ will be loosely bounded by $0 \le q < k^2$ after step 2 while $x$ will still be loosely bounded by $0 \le x < n$ after step 3. After the second pass it is highly unlike that the
sum in step 4 will exceed $n - k$. In practice fewer than three passes of the algorithm are required to reduce virtually every input in the
Figure~\ref{fig:EXDR} demonstrates the reduction of $x = 123456789$ modulo $n - k = 253$ when $n = 256$ and $k = 3$. Note that even while $x$
is considerably larger than $(n - k - 1)^2 = 63504$ the algorithm still converges on the modular residue exceedingly fast. In this case only
three passes were required to find the residue $x \equiv 126$.
\subsection{Choice of Moduli}
On the surface this algorithm looks like a very expensive algorithm. It requires a couple of subtractions followed by multiplication and other
modular reductions. The usefulness of this algorithm becomes exceedingly clear when an appropriate modulus is chosen.
Division in general is a very expensive operation to perform. The one exception is when the division is by a power of the radix of representation used.
Division by ten for example is simple for pencil and paper mathematics since it amounts to shifting the decimal place to the right. Similarly division
by two (\textit{or powers of two}) is very simple for binary computers to perform. It would therefore seem logical to choose $n$ of the form $2^p$
which would imply that $\lfloor x / n \rfloor$ is a simple shift of $x$ right $p$ bits.
However, there is one operation related to division of power of twos that is even faster than this. If $n = \beta^p$ then the division may be
performed by moving whole digits to the right $p$ places. In practice division by $\beta^p$ is much faster than division by $2^p$ for any $p$.
Also with the choice of $n = \beta^p$ reducing $x$ modulo $n$ merely requires zeroing the digits above the $p-1$'th digit of $x$.
Throughout the next section the term ``restricted modulus'' will refer to a modulus of the form $\beta^p - k$ whereas the term ``unrestricted
modulus'' will refer to a modulus of the form $2^p - k$. The word ``restricted'' in this case refers to the fact that it is based on the
$2^p$ logic except $p$ must be a multiple of $lg(\beta)$.
\subsection{Choice of $k$}
Now that division and reduction (\textit{step 1 and 3 of figure~\ref{fig:DR}}) have been optimized to simple digit operations the multiplication by $k$
in step 2 is the most expensive operation. Fortunately the choice of $k$ is not terribly limited. For all intents and purposes it might
as well be a single digit. The smaller the value of $k$ is the faster the algorithm will be.
The restricted Diminished Radix algorithm can quickly reduce an input modulo a modulus of the form $n = \beta^p - k$. This algorithm can reduce
an input $x$ within the range $0 \le x < n^2$ using only a couple passes of the algorithm demonstrated in figure~\ref{fig:DR}. The implementation
of this algorithm has been optimized to avoid additional overhead associated with a division by $\beta^p$, the multiplication by $k$ or the addition
of $x$ and $q$. The resulting algorithm is very efficient and can lead to substantial improvements over Barrett and Montgomery reduction when modular
exponentiations are performed.
\newpage\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_dr\_reduce}. \\
\textbf{Input}. mp\_int $x$, $n$ and a mp\_digit $k = \beta - n_0$ \\
\hspace{11.5mm}($0 \le x < n^2$, $n > 1$, $0 < k < \beta$) \\
\textbf{Output}. $x \mbox{ mod } n$ \\
\hline \\
1. $m \leftarrow n.used$ \\
2. If $x.alloc < 2m$ then grow $x$ to $2m$ digits. \\
3. $\mu \leftarrow 0$ \\
4. for $i$ from $0$ to $m - 1$ do \\
\hspace{3mm}4.1 $\hat r \leftarrow k \cdot x_{m+i} + x_{i} + \mu$ \\
\hspace{3mm}4.2 $x_{i} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
\hspace{3mm}4.3 $\mu \leftarrow \lfloor \hat r / \beta \rfloor$ \\
5. $x_{m} \leftarrow \mu$ \\
6. for $i$ from $m + 1$ to $x.used - 1$ do \\
\hspace{3mm}6.1 $x_{i} \leftarrow 0$ \\
7. Clamp excess digits of $x$. \\
8. If $x \ge n$ then \\
\hspace{3mm}8.1 $x \leftarrow x - n$ \\
\hspace{3mm}8.2 Goto step 3. \\
9. Return(\textit{MP\_OKAY}). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm mp\_dr\_reduce}
\end{figure}
\textbf{Algorithm mp\_dr\_reduce.}
This algorithm will perform the Dimished Radix reduction of $x$ modulo $n$. It has similar restrictions to that of the Barrett reduction
with the addition that $n$ must be of the form $n = \beta^m - k$ where $0 < k <\beta$.
This algorithm essentially implements the pseudo-code in figure~\ref{fig:DR} except with a slight optimization. The division by $\beta^m$, multiplication by $k$
and addition of $x \mbox{ mod }\beta^m$ are all performed simultaneously inside the loop on step 4. The division by $\beta^m$ is emulated by accessing
the term at the $m+i$'th position which is subsequently multiplied by $k$ and added to the term at the $i$'th position. After the loop the $m$'th
digit is set to the carry and the upper digits are zeroed. Steps 5 and 6 emulate the reduction modulo $\beta^m$ that should have happend to
$x$ before the addition of the multiple of the upper half.
At step 8 if $x$ is still larger than $n$ another pass of the algorithm is required. First $n$ is subtracted from $x$ and then the algorithm resumes
at step 3.
EXAM,bn_mp_dr_reduce.c
The first step is to grow $x$ as required to $2m$ digits since the reduction is performed in place on $x$. The label on line @49,top:@ is where
the algorithm will resume if further reduction passes are required. In theory it could be placed at the top of the function however, the size of
the modulus and question of whether $x$ is large enough are invariant after the first pass meaning that it would be a waste of time.
The aliases $tmpx1$ and $tmpx2$ refer to the digits of $x$ where the latter is offset by $m$ digits. By reading digits from $x$ offset by $m$ digits
a division by $\beta^m$ can be simulated virtually for free. The loop on line @61,for@ performs the bulk of the work (\textit{corresponds to step 4 of algorithm 7.11})
in this algorithm.
By line @68,mu@ the pointer $tmpx1$ points to the $m$'th digit of $x$ which is where the final carry will be placed. Similarly by line @71,for@ the
same pointer will point to the $m+1$'th digit where the zeroes will be placed.
Since the algorithm is only valid if both $x$ and $n$ are greater than zero an unsigned comparison suffices to determine if another pass is required.
With the same logic at line @82,sub@ the value of $x$ is known to be greater than or equal to $n$ meaning that an unsigned subtraction can be used
as well. Since the destination of the subtraction is the larger of the inputs the call to algorithm s\_mp\_sub cannot fail and the return code
does not need to be checked.
\subsubsection{Setup}
To setup the restricted Diminished Radix algorithm the value $k = \beta - n_0$ is required. This algorithm is not really complicated but provided for
completeness.
\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_dr\_setup}. \\
\textbf{Input}. mp\_int $n$ \\
\textbf{Output}. $k = \beta - n_0$ \\
\hline \\
1. $k \leftarrow \beta - n_0$ \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm mp\_dr\_setup}
\end{figure}
EXAM,bn_mp_dr_setup.c
\subsubsection{Modulus Detection}
Another algorithm which will be useful is the ability to detect a restricted Diminished Radix modulus. An integer is said to be
of restricted Diminished Radix form if all of the digits are equal to $\beta - 1$ except the trailing digit which may be any value.
\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_dr\_is\_modulus}. \\
\textbf{Input}. mp\_int $n$ \\
\textbf{Output}. $1$ if $n$ is in D.R form, $0$ otherwise \\
\hline
1. If $n.used < 2$ then return($0$). \\
2. for $ix$ from $1$ to $n.used - 1$ do \\
\hspace{3mm}2.1 If $n_{ix} \ne \beta - 1$ return($0$). \\
3. Return($1$). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm mp\_dr\_is\_modulus}
\end{figure}
\textbf{Algorithm mp\_dr\_is\_modulus.}
This algorithm determines if a value is in Diminished Radix form. Step 1 rejects obvious cases where fewer than two digits are
in the mp\_int. Step 2 tests all but the first digit to see if they are equal to $\beta - 1$. If the algorithm manages to get to
In theory Montgomery and Barrett reductions would require roughly the same amount of time to complete. However, in practice since Montgomery
reduction can be written as a single function with the Comba technique it is much faster. Barrett reduction suffers from the overhead of
calling the half precision multipliers, addition and division by $\beta$ algorithms.
For almost every cryptographic algorithm Montgomery reduction is the algorithm of choice. The one set of algorithms where Diminished Radix reduction truly
shines are based on the discrete logarithm problem such as Diffie-Hellman \cite{DH} and ElGamal \cite{ELGAMAL}. In these algorithms
primes of the form $\beta^m - k$ can be found and shared amongst users. These primes will allow the Diminished Radix algorithm to be used in
modular exponentiation to greatly speed up the operation.
\section*{Exercises}
\begin{tabular}{cl}
$\left [ 3 \right ]$ & Prove that the ``trick'' in algorithm mp\_montgomery\_setup actually \\
& calculates the correct value of $\rho$. \\
& \\
$\left [ 2 \right ]$ & Devise an algorithm to reduce modulo $n + k$ for small $k$ quickly. \\
& \\
$\left [ 4 \right ]$ & Prove that the pseudo-code algorithm ``Diminished Radix Reduction'' \\
& (\textit{figure~\ref{fig:DR}}) terminates. Also prove the probability that it will \\
& terminate within $1 \le k \le 10$ iterations. \\
& \\
\end{tabular}
\chapter{Exponentiation}
Exponentiation is the operation of raising one variable to the power of another, for example, $a^b$. A variant of exponentiation, computed
in a finite field or ring, is called modular exponentiation. This latter style of operation is typically used in public key
cryptosystems such as RSA and Diffie-Hellman. The ability to quickly compute modular exponentiations is of great benefit to any
such cryptosystem and many methods have been sought to speed it up.
\section{Exponentiation Basics}
A trivial algorithm would simply multiply $a$ against itself $b - 1$ times to compute the exponentiation desired. However, as $b$ grows in size
the number of multiplications becomes prohibitive. Imagine what would happen if $b$ $\approx$ $2^{1024}$ as is the case when computing an RSA signature
with a $1024$-bit key. Such a calculation could never be completed as it would take simply far too long.
Fortunately there is a very simple algorithm based on the laws of exponents. Recall that $lg_a(a^b) = b$ and that $lg_a(a^ba^c) = b + c$ which
are two trivial relationships between the base and the exponent. Let $b_i$ represent the $i$'th bit of $b$ starting from the least
significant bit. If $b$ is a $k$-bit integer than the following equation is true.
\begin{equation}
a^b = \prod_{i=0}^{k-1} a^{2^i \cdot b_i}
\end{equation}
By taking the base $a$ logarithm of both sides of the equation the following equation is the result.
\begin{equation}
b = \sum_{i=0}^{k-1}2^i \cdot b_i
\end{equation}
The term $a^{2^i}$ can be found from the $i - 1$'th term by squaring the term since $\left ( a^{2^i} \right )^2$ is equal to
$a^{2^{i+1}}$. This observation forms the basis of essentially all fast exponentiation algorithms. It requires $k$ squarings and on average
$k \over 2$ multiplications to compute the result. This is indeed quite an improvement over simply multiplying by $a$ a total of $b-1$ times.
While this current method is a considerable speed up there are further improvements to be made. For example, the $a^{2^i}$ term does not need to
be computed in an auxilary variable. Consider the following equivalent algorithm.
\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{Left to Right Exponentiation}. \\
\textbf{Input}. Integer $a$, $b$ and $k$ \\
\textbf{Output}. $c = a^b$ \\
\hline \\
1. $c \leftarrow 1$ \\
2. for $i$ from $k - 1$ to $0$ do \\
\hspace{3mm}2.1 $c \leftarrow c^2$ \\
\hspace{3mm}2.2 $c \leftarrow c \cdot a^{b_i}$ \\
3. Return $c$. \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Left to Right Exponentiation}
\label{fig:LTOR}
\end{figure}
This algorithm starts from the most significant bit and works towards the least significant bit. When the $i$'th bit of $b$ is set $a$ is
multiplied against the current product. In each iteration the product is squared which doubles the exponent of the individual terms of the
product.
For example, let $b = 101100_2 \equiv 44_{10}$. The following chart demonstrates the actions of the algorithm.
\newpage\begin{figure}
\begin{center}
\begin{tabular}{|c|c|}
\hline \textbf{Value of $i$} & \textbf{Value of $c$} \\
\hline - & $1$ \\
\hline $5$ & $a$ \\
\hline $4$ & $a^2$ \\
\hline $3$ & $a^4 \cdot a$ \\
\hline $2$ & $a^8 \cdot a^2 \cdot a$ \\
\hline $1$ & $a^{16} \cdot a^4 \cdot a^2$ \\
\hline $0$ & $a^{32} \cdot a^8 \cdot a^4$ \\
\hline
\end{tabular}
\end{center}
\caption{Example of Left to Right Exponentiation}
\end{figure}
When the product $a^{32} \cdot a^8 \cdot a^4$ is simplified it is equal $a^{44}$ which is the desired exponentiation. This particular algorithm is
called ``Left to Right'' because it reads the exponent in that order. All of the exponentiation algorithms that will be presented are of this nature.
\subsection{Single Digit Exponentiation}
The first algorithm in the series of exponentiation algorithms will be an unbounded algorithm where the exponent is a single digit. It is intended
to be used when a small power of an input is required (\textit{e.g. $a^5$}). It is faster than simply multiplying $b - 1$ times for all values of
As children we are taught this very simple algorithm for the case of $\beta = 10$. Almost instinctively several optimizations are taught for which
their reason of existing are never explained. For this example let $y = 5471$ represent the dividend and $x = 23$ represent the divisor.
To find the first digit of the quotient the value of $k$ must be maximized such that $kx\beta^t$ is less than or equal to $y$ and
simultaneously $(k + 1)x\beta^t$ is greater than $y$. Implicitly $k$ is the maximum value the $t$'th digit of the quotient may have. The habitual method
used to find the maximum is to ``eyeball'' the two numbers, typically only the leading digits and quickly estimate a quotient. By only using leading
digits a much simpler division may be used to form an educated guess at what the value must be. In this case $k = \lfloor 54/23\rfloor = 2$ quickly
arises as a possible solution. Indeed $2x\beta^2 = 4600$ is less than $y = 5471$ and simultaneously $(k + 1)x\beta^2 = 6900$ is larger than $y$.
As a result $k\beta^2$ is added to the quotient which now equals $q = 200$ and $4600$ is subtracted from $y$ to give a remainder of $y = 841$.
Again this process is repeated to produce the quotient digit $k = 3$ which makes the quotient $q = 200 + 3\beta = 230$ and the remainder
$y = 841 - 3x\beta = 181$. Finally the last iteration of the loop produces $k = 7$ which leads to the quotient $q = 230 + 7 = 237$ and the
remainder $y = 181 - 7x = 20$. The final quotient and remainder found are $q = 237$ and $r = y = 20$ which are indeed correct since
$237 \cdot 23 + 20 = 5471$ is true.
\subsection{Quotient Estimation}
\label{sec:divest}
As alluded to earlier the quotient digit $k$ can be estimated from only the leading digits of both the divisor and dividend. When $p$ leading
digits are used from both the divisor and dividend to form an estimation the accuracy of the estimation rises as $p$ grows. Technically
speaking the estimation is based on assuming the lower $\vert \vert y \vert \vert - p$ and $\vert \vert x \vert \vert - p$ lower digits of the
dividend and divisor are zero.
The value of the estimation may off by a few values in either direction and in general is fairly correct. A simplification \cite[pp. 271]{TAOCPV2}
of the estimation technique is to use $t + 1$ digits of the dividend and $t$ digits of the divisor, in particularly when $t = 1$. The estimate
using this technique is never too small. For the following proof let $t = \vert \vert y \vert \vert - 1$ and $s = \vert \vert x \vert \vert - 1$
represent the most significant digits of the dividend and divisor respectively.
\textbf{Proof.}\textit{ The quotient $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ is greater than or equal to
$k = \lfloor y / (x \cdot \beta^{\vert \vert y \vert \vert - \vert \vert x \vert \vert - 1}) \rfloor$. }
The first obvious case is when $\hat k = \beta - 1$ in which case the proof is concluded since the real quotient cannot be larger. For all other
cases $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ and $\hat k x_s \ge y_t\beta + y_{t-1} - x_s + 1$. The latter portion of the inequalility
$-x_s + 1$ arises from the fact that a truncated integer division will give the same quotient for at most $x_s - 1$ values. Next a series of
inequalities will prove the hypothesis.
\begin{equation}
y - \hat k x \le y - \hat k x_s\beta^s
\end{equation}
This is trivially true since $x \ge x_s\beta^s$. Next we replace $\hat kx_s\beta^s$ by the previous inequality for $\hat kx_s$.
\begin{equation}
y - \hat k x \le y_t\beta^t + \ldots + y_0 - (y_t\beta^t + y_{t-1}\beta^{t-1} - x_s\beta^t + \beta^s)
\end{equation}
By simplifying the previous inequality the following inequality is formed.
\begin{equation}
y - \hat k x \le y_{t-2}\beta^{t-2} + \ldots + y_0 + x_s\beta^s - \beta^s
\textbf{Input}. A string $str$ of length $sn$ and radix $r$. \\
\textbf{Output}. The radix-$\beta$ equivalent mp\_int. \\
\hline \\
1. If $r < 2$ or $r > 64$ return(\textit{MP\_VAL}). \\
2. $ix \leftarrow 0$ \\
3. If $str_0 =$ ``-'' then do \\
\hspace{3mm}3.1 $ix \leftarrow ix + 1$ \\
\hspace{3mm}3.2 $sign \leftarrow MP\_NEG$ \\
4. else \\
\hspace{3mm}4.1 $sign \leftarrow MP\_ZPOS$ \\
5. $a \leftarrow 0$ \\
6. for $iy$ from $ix$ to $sn - 1$ do \\
\hspace{3mm}6.1 Let $y$ denote the position in the map of $str_{iy}$. \\
\hspace{3mm}6.2 If $str_{iy}$ is not in the map or $y \ge r$ then goto step 7. \\
\hspace{3mm}6.3 $a \leftarrow a \cdot r$ \\
\hspace{3mm}6.4 $a \leftarrow a + y$ \\
7. If $a \ne 0$ then $a.sign \leftarrow sign$ \\
8. Return(\textit{MP\_OKAY}). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm mp\_read\_radix}
\end{figure}
\textbf{Algorithm mp\_read\_radix.}
This algorithm will read an ASCII string and produce the radix-$\beta$ mp\_int representation of the same integer. A minus symbol ``-'' may precede the
string to indicate the value is negative, otherwise it is assumed to be positive. The algorithm will read up to $sn$ characters from the input
and will stop when it reads a character it cannot map the algorithm stops reading characters from the string. This allows numbers to be embedded
as part of larger input without any significant problem.
EXAM,bn_mp_read_radix.c
\subsection{Generating Radix-$n$ Output}
Generating radix-$n$ output is fairly trivial with a division and remainder algorithm.
\newpage\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_toradix}. \\
\textbf{Input}. A mp\_int $a$ and an integer $r$\\
\textbf{Output}. The radix-$r$ representation of $a$ \\
\hline \\
1. If $r < 2$ or $r > 64$ return(\textit{MP\_VAL}). \\
2. If $a = 0$ then $str = $ ``$0$'' and return(\textit{MP\_OKAY}). \\
3. $t \leftarrow a$ \\
4. $str \leftarrow$ ``'' \\
5. if $t.sign = MP\_NEG$ then \\
\hspace{3mm}5.1 $str \leftarrow str + $ ``-'' \\
\hspace{3mm}5.2 $t.sign = MP\_ZPOS$ \\
6. While ($t \ne 0$) do \\
\hspace{3mm}6.1 $d \leftarrow t \mbox{ (mod }r\mbox{)}$ \\
\hspace{3mm}6.2 $t \leftarrow \lfloor t / r \rfloor$ \\
\hspace{3mm}6.3 Look up $d$ in the map and store the equivalent character in $y$. \\
\hspace{3mm}6.4 $str \leftarrow str + y$ \\
7. If $str_0 = $``$-$'' then \\
\hspace{3mm}7.1 Reverse the digits $str_1, str_2, \ldots str_n$. \\
8. Otherwise \\
\hspace{3mm}8.1 Reverse the digits $str_0, str_1, \ldots str_n$. \\
9. Return(\textit{MP\_OKAY}).\\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm mp\_toradix}
\end{figure}
\textbf{Algorithm mp\_toradix.}
This algorithm computes the radix-$r$ representation of an mp\_int $a$. The ``digits'' of the representation are extracted by reducing
successive powers of $\lfloor a / r^k \rfloor$ the input modulo $r$ until $r^k > a$. Note that instead of actually dividing by $r^k$ in
each iteration the quotient $\lfloor a / r \rfloor$ is saved for the next iteration. As a result a series of trivial $n \times 1$ divisions
are required instead of a series of $n \times k$ divisions. One design flaw of this approach is that the digits are produced in the reverse order
(see~\ref{fig:mpradix}). To remedy this flaw the digits must be swapped or simply ``reversed''.
\begin{figure}
\begin{center}
\begin{tabular}{|c|c|c|}
\hline \textbf{Value of $a$} & \textbf{Value of $d$} & \textbf{Value of $str$} \\
\hline $1234$ & -- & -- \\
\hline $123$ & $4$ & ``4'' \\
\hline $12$ & $3$ & ``43'' \\
\hline $1$ & $2$ & ``432'' \\
\hline $0$ & $1$ & ``4321'' \\
\hline
\end{tabular}
\end{center}
\caption{Example of Algorithm mp\_toradix.}
\label{fig:mpradix}
\end{figure}
EXAM,bn_mp_toradix.c
\chapter{Number Theoretic Algorithms}
This chapter discusses several fundamental number theoretic algorithms such as the greatest common divisor, least common multiple and Jacobi
symbol computation. These algorithms arise as essential components in several key cryptographic algorithms such as the RSA public key algorithm and
various Sieve based factoring algorithms.
\section{Greatest Common Divisor}
The greatest common divisor of two integers $a$ and $b$, often denoted as $(a, b)$ is the largest integer $k$ that is a proper divisor of
both $a$ and $b$. That is, $k$ is the largest integer such that $0 \equiv a \mbox{ (mod }k\mbox{)}$ and $0 \equiv b \mbox{ (mod }k\mbox{)}$ occur
simultaneously.
The most common approach (cite) is to reduce one input modulo another. That is if $a$ and $b$ are divisible by some integer $k$ and if $qa + r = b$ then
$r$ is also divisible by $k$. The reduction pattern follows $\left < a , b \right > \rightarrow \left < b, a \mbox{ mod } b \right >$.
\newpage\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{Greatest Common Divisor (I)}. \\
\textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\
\textbf{Output}. The greatest common divisor $(a, b)$. \\
\hline \\
1. While ($b > 0$) do \\
\hspace{3mm}1.1 $r \leftarrow a \mbox{ (mod }b\mbox{)}$ \\
\hspace{3mm}1.2 $a \leftarrow b$ \\
\hspace{3mm}1.3 $b \leftarrow r$ \\
2. Return($a$). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm Greatest Common Divisor (I)}
\label{fig:gcd1}
\end{figure}
This algorithm will quickly converge on the greatest common divisor since the residue $r$ tends diminish rapidly. However, divisions are
relatively expensive operations to perform and should ideally be avoided. There is another approach based on a similar relationship of
greatest common divisors. The faster approach is based on the observation that if $k$ divides both $a$ and $b$ it will also divide $a - b$.
In particular, we would like $a - b$ to decrease in magnitude which implies that $b \ge a$.
\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{Greatest Common Divisor (II)}. \\
\textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\
\textbf{Output}. The greatest common divisor $(a, b)$. \\
\hline \\
1. While ($b > 0$) do \\
\hspace{3mm}1.1 Swap $a$ and $b$ such that $a$ is the smallest of the two. \\
\hspace{3mm}1.2 $b \leftarrow b - a$ \\
2. Return($a$). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm Greatest Common Divisor (II)}
\label{fig:gcd2}
\end{figure}
\textbf{Proof} \textit{Algorithm~\ref{fig:gcd2} will return the greatest common divisor of $a$ and $b$.}
The algorithm in figure~\ref{fig:gcd2} will eventually terminate since $b \ge a$ the subtraction in step 1.2 will be a value less than $b$. In other
words in every iteration that tuple $\left < a, b \right >$ decrease in magnitude until eventually $a = b$. Since both $a$ and $b$ are always
divisible by the greatest common divisor (\textit{until the last iteration}) and in the last iteration of the algorithm $b = 0$, therefore, in the
second to last iteration of the algorithm $b = a$ and clearly $(a, a) = a$ which concludes the proof. \textbf{QED}.
As a matter of practicality algorithm \ref{fig:gcd1} decreases far too slowly to be useful. Specially if $b$ is much larger than $a$ such that
$b - a$ is still very much larger than $a$. A simple addition to the algorithm is to divide $b - a$ by a power of some integer $p$ which does
not divide the greatest common divisor but will divide $b - a$. In this case ${b - a} \over p$ is also an integer and still divisible by
the greatest common divisor.
However, instead of factoring $b - a$ to find a suitable value of $p$ the powers of $p$ can be removed from $a$ and $b$ that are in common first.
Then inside the loop whenever $b - a$ is divisible by some power of $p$ it can be safely removed.
\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{Greatest Common Divisor (III)}. \\
\textbf{Input}. Two positive integers $a$ and $b$ greater than zero. \\
\textbf{Output}. The greatest common divisor $(a, b)$. \\
\hline \\
1. $k \leftarrow 0$ \\
2. While $a$ and $b$ are both divisible by $p$ do \\
\hspace{3mm}2.1 $a \leftarrow \lfloor a / p \rfloor$ \\
\hspace{3mm}2.2 $b \leftarrow \lfloor b / p \rfloor$ \\
\hspace{3mm}2.3 $k \leftarrow k + 1$ \\
3. While $a$ is divisible by $p$ do \\
\hspace{3mm}3.1 $a \leftarrow \lfloor a / p \rfloor$ \\
4. While $b$ is divisible by $p$ do \\
\hspace{3mm}4.1 $b \leftarrow \lfloor b / p \rfloor$ \\
5. While ($b > 0$) do \\
\hspace{3mm}5.1 Swap $a$ and $b$ such that $a$ is the smallest of the two. \\
\hspace{3mm}5.2 $b \leftarrow b - a$ \\
\hspace{3mm}5.3 While $b$ is divisible by $p$ do \\
\hspace{6mm}5.3.1 $b \leftarrow \lfloor b / p \rfloor$ \\
6. Return($a \cdot p^k$). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm Greatest Common Divisor (III)}
\label{fig:gcd3}
\end{figure}
This algorithm is based on the first except it removes powers of $p$ first and inside the main loop to ensure the tuple $\left < a, b \right >$
decreases more rapidly. The first loop on step two removes powers of $p$ that are in common. A count, $k$, is kept which will present a common
divisor of $p^k$. After step two the remaining common divisor of $a$ and $b$ cannot be divisible by $p$. This means that $p$ can be safely
divided out of the difference $b - a$ so long as the division leaves no remainder.
In particular the value of $p$ should be chosen such that the division on step 5.3.1 occur often. It also helps that division by $p$ be easy
to compute. The ideal choice of $p$ is two since division by two amounts to a right logical shift. Another important observation is that by
step five both $a$ and $b$ are odd. Therefore, the diffrence $b - a$ must be even which means that each iteration removes one bit from the
largest of the pair.
\subsection{Complete Greatest Common Divisor}
The algorithms presented so far cannot handle inputs which are zero or negative. The following algorithm can handle all input cases properly
and will produce the greatest common divisor.
\newpage\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_gcd}. \\
\textbf{Input}. mp\_int $a$ and $b$ \\
\textbf{Output}. The greatest common divisor $c = (a, b)$. \\
\hline \\
1. If $a = 0$ and $b \ne 0$ then \\
\hspace{3mm}1.1 $c \leftarrow b$ \\
\hspace{3mm}1.2 Return(\textit{MP\_OKAY}). \\
2. If $a \ne 0$ and $b = 0$ then \\
\hspace{3mm}2.1 $c \leftarrow a$ \\
\hspace{3mm}2.2 Return(\textit{MP\_OKAY}). \\
3. If $a = b = 0$ then \\
\hspace{3mm}3.1 $c \leftarrow 1$ \\
\hspace{3mm}3.2 Return(\textit{MP\_OKAY}). \\
4. $u \leftarrow \vert a \vert, v \leftarrow \vert b \vert$ \\
5. $k \leftarrow 0$ \\
6. While $u.used > 0$ and $v.used > 0$ and $u_0 \equiv v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
\hspace{3mm}6.1 $k \leftarrow k + 1$ \\
\hspace{3mm}6.2 $u \leftarrow \lfloor u / 2 \rfloor$ \\
\hspace{3mm}6.3 $v \leftarrow \lfloor v / 2 \rfloor$ \\
7. While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
\hspace{3mm}7.1 $u \leftarrow \lfloor u / 2 \rfloor$ \\
8. While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
\hspace{3mm}8.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\
9. While $v.used > 0$ \\
\hspace{3mm}9.1 If $\vert u \vert > \vert v \vert$ then \\
\hspace{6mm}9.1.1 Swap $u$ and $v$. \\
\hspace{3mm}9.2 $v \leftarrow \vert v \vert - \vert u \vert$ \\
\hspace{3mm}9.3 While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
\hspace{6mm}9.3.1 $v \leftarrow \lfloor v / 2 \rfloor$ \\
10. $c \leftarrow u \cdot 2^k$ \\
11. Return(\textit{MP\_OKAY}). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm mp\_gcd}
\end{figure}
\textbf{Algorithm mp\_gcd.}
This algorithm will produce the greatest common divisor of two mp\_ints $a$ and $b$. The algorithm was originally based on Algorithm B of
Knuth \cite[pp. 338]{TAOCPV2} but has been modified to be simpler to explain. In theory it achieves the same asymptotic working time as
Algorithm B and in practice this appears to be true.
The first three steps handle the cases where either one of or both inputs are zero. If either input is zero the greatest common divisor is the
largest input or zero if they are both zero. If the inputs are not trivial than $u$ and $v$ are assigned the absolute values of
$a$ and $b$ respectively and the algorithm will proceed to reduce the pair.
Step six will divide out any common factors of two and keep track of the count in the variable $k$. After this step two is no longer a
factor of the remaining greatest common divisor between $u$ and $v$ and can be safely evenly divided out of either whenever they are even. Step
seven and eight ensure that the $u$ and $v$ respectively have no more factors of two. At most only one of the while loops will iterate since
they cannot both be even.
By step nine both of $u$ and $v$ are odd which is required for the inner logic. First the pair are swapped such that $v$ is equal to
or greater than $u$. This ensures that the subtraction on step 9.2 will always produce a positive and even result. Step 9.3 removes any
factors of two from the difference $u$ to ensure that in the next iteration of the loop both are once again odd.
After $v = 0$ occurs the variable $u$ has the greatest common divisor of the pair $\left < u, v \right >$ just after step six. The result
must be adjusted by multiplying by the common factors of two ($2^k$) removed earlier.
EXAM,bn_mp_gcd.c
This function makes use of the macros mp\_iszero and mp\_iseven. The former evaluates to $1$ if the input mp\_int is equivalent to the
integer zero otherwise it evaluates to $0$. The latter evaluates to $1$ if the input mp\_int represents a non-zero even integer otherwise
it evaluates to $0$. Note that just because mp\_iseven may evaluate to $0$ does not mean the input is odd, it could also be zero. The three
trivial cases of inputs are handled on lines @25,zero@ through @34,}@. After those lines the inputs are assumed to be non-zero.
Lines @36,if@ and @40,if@ make local copies $u$ and $v$ of the inputs $a$ and $b$ respectively. At this point the common factors of two
must be divided out of the two inputs. The while loop on line @49,while@ iterates so long as both are even. The local integer $k$ is used to
keep track of how many factors of $2$ are pulled out of both values. It is assumed that the number of factors will not exceed the maximum
value of a C ``int'' data type\footnote{Strictly speaking no array in C may have more than entries than are accessible by an ``int'' so this is not
a limitation.}.
At this point there are no more common factors of two in the two values. The while loops on lines @60,while@ and @65,while@ remove any independent
factors of two such that both $u$ and $v$ are guaranteed to be an odd integer before hitting the main body of the algorithm. The while loop
on line @71, while@ performs the reduction of the pair until $v$ is equal to zero. The unsigned comparison and subtraction algorithms are used in
place of the full signed routines since both values are guaranteed to be positive and the result of the subtraction is guaranteed to be non-negative.
\section{Least Common Multiple}
The least common multiple of a pair of integers is their product divided by their greatest common divisor. For two integers $a$ and $b$ the
least common multiple is normally denoted as $[ a, b ]$ and numerically equivalent to ${ab} \over {(a, b)}$. For example, if $a = 2 \cdot 2 \cdot 3 = 12$
and $b = 2 \cdot 3 \cdot 3 \cdot 7 = 126$ the least common multiple is ${126 \over {(12, 126)}} = {126 \over 6} = 21$.
The least common multiple arises often in coding theory as well as number theory. If two functions have periods of $a$ and $b$ respectively they will
collide, that is be in synchronous states, after only $[ a, b ]$ iterations. This is why, for example, random number generators based on
Linear Feedback Shift Registers (LFSR) tend to use registers with periods which are co-prime (\textit{e.g. the greatest common divisor is one.}).
Similarly in number theory if a composite $n$ has two prime factors $p$ and $q$ then maximal order of any unit of $\Z/n\Z$ will be $[ p - 1, q - 1] $.
\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_lcm}. \\
\textbf{Input}. mp\_int $a$ and $b$ \\
\textbf{Output}. The least common multiple $c = [a, b]$. \\
\hline \\
1. $c \leftarrow (a, b)$ \\
2. $t \leftarrow a \cdot b$ \\
3. $c \leftarrow \lfloor t / c \rfloor$ \\
4. Return(\textit{MP\_OKAY}). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm mp\_lcm}
\end{figure}
\textbf{Algorithm mp\_lcm.}
This algorithm computes the least common multiple of two mp\_int inputs $a$ and $b$. It computes the least common multiple directly by
dividing the product of the two inputs by their greatest common divisor.
EXAM,bn_mp_lcm.c
\section{Jacobi Symbol Computation}
To explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg. What is the name of this?} off which the Jacobi symbol is
defined. The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$. Numerically it is
8. If $p_0 \equiv a'_0 \equiv 3 \mbox{ (mod }4\mbox{)}$ then \\
\hspace{3mm}8.1 $s \leftarrow -s$ \\
9. If $a' \ne 1$ then \\
\hspace{3mm}9.1 $p' \leftarrow p \mbox{ (mod }a'\mbox{)}$ \\
\hspace{3mm}9.2 $s \leftarrow s \cdot \mbox{mp\_jacobi}(p', a')$ \\
10. $c \leftarrow s$ \\
11. Return(\textit{MP\_OKAY}). \\
\hline
\end{tabular}
\end{center}
\end{small}
\caption{Algorithm mp\_jacobi}
\end{figure}
\textbf{Algorithm mp\_jacobi.}
This algorithm computes the Jacobi symbol for an arbitrary positive integer $a$ with respect to an odd integer $p$ greater than three. The algorithm
is based on algorithm 2.149 of HAC \cite[pp. 73]{HAC}.
Step numbers one and two handle the trivial cases of $a = 0$ and $a = 1$ respectively. Step five determines the number of two factors in the
input $a$. If $k$ is even than the term $\left ( { 2 \over p } \right )^k$ must always evaluate to one. If $k$ is odd than the term evaluates to one
if $p_0$ is congruent to one or seven modulo eight, otherwise it evaluates to $-1$. After the the $\left ( { 2 \over p } \right )^k$ term is handled
the $(-1)^{(p-1)(a'-1)/4}$ is computed and multiplied against the current product $s$. The latter term evaluates to one if both $p$ and $a'$
are congruent to one modulo four, otherwise it evaluates to negative one.
By step nine if $a'$ does not equal one a recursion is required. Step 9.1 computes $p' \equiv p \mbox{ (mod }a'\mbox{)}$ and will recurse to compute
$\left ( {p' \over a'} \right )$ which is multiplied against the current Jacobi product.
EXAM,bn_mp_jacobi.c
As a matter of practicality the variable $a'$ as per the pseudo-code is reprensented by the variable $a1$ since the $'$ symbol is not valid for a C
variable name character.
The two simple cases of $a = 0$ and $a = 1$ are handled at the very beginning to simplify the algorithm. If the input is non-trivial the algorithm
has to proceed compute the Jacobi. The variable $s$ is used to hold the current Jacobi product. Note that $s$ is merely a C ``int'' data type since
the values it may obtain are merely $-1$, $0$ and $1$.
After a local copy of $a$ is made all of the factors of two are divided out and the total stored in $k$. Technically only the least significant
bit of $k$ is required, however, it makes the algorithm simpler to follow to perform an addition. In practice an exclusive-or and addition have the same
processor requirements and neither is faster than the other.
Line @59, if@ through @70, }@ determines the value of $\left ( { 2 \over p } \right )^k$. If the least significant bit of $k$ is zero than
$k$ is even and the value is one. Otherwise, the value of $s$ depends on which residue class $p$ belongs to modulo eight. The value of
$(-1)^{(p-1)(a'-1)/4}$ is compute and multiplied against $s$ on lines @73, if@ through @75, }@.
Finally, if $a1$ does not equal one the algorithm must recurse and compute $\left ( {p' \over a'} \right )$.
\textit{-- Comment about default $s$ and such...}
\section{Modular Inverse}
\label{sec:modinv}
The modular inverse of a number actually refers to the modular multiplicative inverse. Essentially for any integer $a$ such that $(a, p) = 1$ there
exist another integer $b$ such that $ab \equiv 1 \mbox{ (mod }p\mbox{)}$. The integer $b$ is called the multiplicative inverse of $a$ which is
denoted as $b = a^{-1}$. Technically speaking modular inversion is a well defined operation for any finite ring or field not just for rings and
fields of integers. However, the former will be the matter of discussion.
The simplest approach is to compute the algebraic inverse of the input. That is to compute $b \equiv a^{\Phi(p) - 1}$. If $\Phi(p)$ is the
order of the multiplicative subgroup modulo $p$ then $b$ must be the multiplicative inverse of $a$. The proof of which is trivial.
\begin{equation}
ab \equiv a \left (a^{\Phi(p) - 1} \right ) \equiv a^{\Phi(p)} \equiv a^0 \equiv 1 \mbox{ (mod }p\mbox{)}
\end{equation}
However, as simple as this approach may be it has two serious flaws. It requires that the value of $\Phi(p)$ be known which if $p$ is composite
requires all of the prime factors. This approach also is very slow as the size of $p$ grows.
A simpler approach is based on the observation that solving for the multiplicative inverse is equivalent to solving the linear
Diophantine\footnote{See LeVeque \cite[pp. 40-43]{LeVeque} for more information.} equation.
\begin{equation}
ab + pq = 1
\end{equation}
Where $a$, $b$, $p$ and $q$ are all integers. If such a pair of integers $ \left < b, q \right >$ exist than $b$ is the multiplicative inverse of
$a$ modulo $p$. The extended Euclidean algorithm (Knuth \cite[pp. 342]{TAOCPV2}) can be used to solve such equations provided $(a, p) = 1$.
However, instead of using that algorithm directly a variant known as the binary Extended Euclidean algorithm will be used in its place. The
binary approach is very similar to the binary greatest common divisor algorithm except it will produce a full solution to the Diophantine
equation.
\subsection{General Case}
\newpage\begin{figure}[!here]
\begin{small}
\begin{center}
\begin{tabular}{l}
\hline Algorithm \textbf{mp\_invmod}. \\
\textbf{Input}. mp\_int $a$ and $b$, $(a, b) = 1$, $p \ge 2$, $0 < a < p$. \\
This algorithm performs one trial round of the Miller-Rabin algorithm to the base $b$. It will set $c = 1$ if the algorithm cannot determine
if $b$ is composite or $c = 0$ if $b$ is provably composite. The values of $s$ and $r$ are computed such that $a' = a - 1 = 2^sr$.
If the value $y \equiv b^r$ is congruent to $\pm 1$ then the algorithm cannot prove if $a$ is composite or not. Otherwise, the algorithm will
square $y$ upto $s - 1$ times stopping only when $y \equiv -1$. If $y^2 \equiv 1$ and $y \nequiv \pm 1$ then the algorithm can report that $a$
is provably composite. If the algorithm performs $s - 1$ squarings and $y \nequiv -1$ then $a$ is provably composite. If $a$ is not provably
composite then it is \textit{probably} prime.
EXAM,bn_mp_prime_miller_rabin.c
\backmatter
\appendix
\begin{thebibliography}{ABCDEF}
\bibitem[1]{TAOCPV2}
Donald Knuth, \textit{The Art of Computer Programming}, Third Edition, Volume Two, Seminumerical Algorithms, Addison-Wesley, 1998
\bibitem[2]{HAC}
A. Menezes, P. van Oorschot, S. Vanstone, \textit{Handbook of Applied Cryptography}, CRC Press, 1996
\bibitem[3]{ROSE}
Michael Rosing, \textit{Implementing Elliptic Curve Cryptography}, Manning Publications, 1999
\bibitem[4]{COMBA}
Paul G. Comba, \textit{Exponentiation Cryptosystems on the IBM PC}. IBM Systems Journal 29(4): 526-538 (1990)
\bibitem[5]{KARA}
A. Karatsuba, Doklay Akad. Nauk SSSR 145 (1962), pp.293-294
\bibitem[6]{KARAP}
Andre Weimerskirch and Christof Paar, \textit{Generalizations of the Karatsuba Algorithm for Polynomial Multiplication}, Submitted to Design, Codes and Cryptography, March 2002
\bibitem[7]{BARRETT}
Paul Barrett, \textit{Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor}, Advances in Cryptology, Crypto '86, Springer-Verlag.
\bibitem[8]{MONT}
P.L.Montgomery. \textit{Modular multiplication without trial division}. Mathematics of Computation, 44(170):519-521, April 1985.
\bibitem[9]{DRMET}
Chae Hoon Lim and Pil Joong Lee, \textit{Generating Efficient Primes for Discrete Log Cryptosystems}, POSTECH Information Research Laboratories
\bibitem[10]{MMB}
J. Daemen and R. Govaerts and J. Vandewalle, \textit{Block ciphers based on Modular Arithmetic}, State and {P}rogress in the {R}esearch of {C}ryptography, 1993, pp. 80-89
\bibitem[11]{RSAREF}
R.L. Rivest, A. Shamir, L. Adleman, \textit{A Method for Obtaining Digital Signatures and Public-Key Cryptosystems}
\bibitem[12]{DHREF}
Whitfield Diffie, Martin E. Hellman, \textit{New Directions in Cryptography}, IEEE Transactions on Information Theory, 1976
\bibitem[13]{IEEE}
IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985)
\bibitem[14]{GMP}
GNU Multiple Precision (GMP), \url{http://www.swox.com/gmp/}
\bibitem[15]{MPI}
Multiple Precision Integer Library (MPI), Michael Fromberger, \url{http://thayer.dartmouth.edu/~sting/mpi/}