This is the second part of the square root algorithm. It was developed in one month during the final stages of finishing this web site and reviewing this document for publication.

While Part I dealt with an empirical discovery for a sequential algorithm to find the square root, and its subsequent mathematical proof, Part II deals with a binary search variation that greatly improves the root's calculation speed (even though the sequential search algorithm is the smallest you'll find, if size is what matters to you). Both algorithms have the same accuracy.

The entire document consists of:

Part I

Part II- Binary Search
- Speed Comparison
- Hardware Version
- Floating-Point
- Root of any Index
- Further Work
- Conclusion

NOTES: The "official" address for this document is http://www.pedrofreire.com/sqrt. Please do not link directly to the address that shows up in your browser — this may be changed without notice. Also note that at the time I write this (August 2000) I am not aware of these algorithms being in use (at least the binary search version), but I cannot confirm this.

A binary search works pretty much like searching for a word in a dictionary. When looking for "Green", you don't open the dictionary in the first page and start looking for your word throughout. Instead you take advantage in the knowledge that the words are ordered alphabetically.

Here's how you would describe the search to someone else: First open the dictionary in half, and look for the first word you find. Compare it to the word you're looking for. If it's a match, you found it. Otherwise divide one of the two halves of the dictionary in half again (the right half if the word you're looking for is after the one you found, the left half if it's before, alphabetically), and repeat the process.

Remember the game "Simon Says" from childhood? "Simon Says: Guess the Number". "I'm thinking of a number between 1 and 100 — guess what it is in under 10 attempts". "I can only tell you if your guess is above or bellow the number I'm thinking of".

The best strategy to win at this game is exactly the same. You start
by guessing 50, and then go on to guess half of that, then half
and so on, based on the answers you get. If you do this you will
*always* get the number in under 8 attempts! This is called
a binary search.

For the square root algorithm, we can do a similar job (because both
the square root and raising to the square are continuous growing
operations, and are therefore "ordered"): first start at half the
maximum root that can be calculated. For **n**-bit numbers, this is

`sqrt((2^n)-1)/2 ~~ sqrt(2^n)/2 = 2^(n/2-1)`

For example, for **n**=16, half of the maximum root is 128. Depending
on how the original value to root, **x**, compares to the square of 128,
you cut one of the two halves in half again. One-quarter of the maximum
root (or half of 128) is 64, so if you should select the lower half,
you compare **x** with the square of 64, if you select the upper half,
you compare **x** with the square of 192.

Since 64 is precisely in the middle of the lower half, it can be obtained by either 0+64 or 128-64, and since 192 is precisely in the middle of the upper half, it can be obtained by either 128+64 or 256-64 (i.e., you get these values by adding one-fourth of the maximum root to the bottom edge of the half, or subtracting it from the top edge of the half, as can be seen in the following graphic).

When you want to split the next quarter in half, you'll need to use one-eighth
of the maximum root (32). On the next step, one-sixteenth (16) and so on.
The easiest way to get the value (half of the section you want to move
to) from these maximum root partial values, is to add them to the current
value (if you want the upper section), or subtract them from it (if you
want the lower section). For instance, if you had moved to 192, and **x** was
now lower than the square of 192, then the next value you want to compare
**x** with is the square of 192-32, or 160.

All this is very well, and the square of the first value is relatively easy to calculate — after all,

`(2^(n/2-1))^2 = 2^(n-2)`

and for each of the next halves (for instance if the section selected is always the lower section),

`(2^(n/2-2))^2 = 2^(n-4)`

`(2^(n/2-3))^2 = 2^(n-6)`

`(2^(n/2-4))^2 = 2^(n-8)`

…

`(2^(n/2-k))^2 = 2^(n-2k)`

which means that the square of these powers of 2 can be calculated consecutively by shifting the first square twice to the right, just as their square root (the halves of the maximum root) can be calculated by shifting the first half to the right by one. But what if we need to select an upper section? We'll get to that.

So far we only have integer shifts, additions, subtractions and comparisons, which is good for our purposes. But the entire method is not yet done. We have a way to approximate the roots, but not to approximate the squares. For instance, if you change 128 to 192 by adding 64, what do you do with their respective squares? Do you just calculate a square by doing a multiplication?

Well, not necessarily (which is good since that would be slow!).
You see, if **a** is the current root search value (192 for
example), and **b** the maximum root half you want to add or
subtract from it (32 in this example), what you want to find is

`(a+-b)^2`

and you have

`(a+-b)^2 = a^2 +- 2ab + b^2`

which means that the offset

`+-2ab + b^2`

should be added to the current square, where the sign of 2**ab**
is identical to the operation performed on the roots (addition /
subtraction).

Here we have 3 multiplications, but let's
take a look at them: the first is a multiplication by 2, which can
be replaced by a shift operation. Then you multiply by **b** which
is itself a power of 2, so this multiplication can also be replaced
by a shift operation. Finally, **b**² is easy to calculate
exactly because **b** is a power of 2, as we have just seen!

In other words, the offset is actually

`a * 2^(n/2-k+1) + 2^(n-2k)`

which, for an even **n** and integer **k**>0, always
returns positive integer exponents of 2, which means all
operations are fast shifts and additions, not multiplications!
This also proves mathematically the steps used in the method.

This allows us to build the first C code version of this loop:

#include <math.h> #include <limits.h> /* … */ register unsigned long int x, r, h, sq, x2; register int i; /* no need to make it as big as x, etc */ /* … */ sq = (ULONG_MAX >> 2) + 1; /* (2^(n/2-1))^2 = 2^(n-2) */ r = sqrt(sq); /* 2^(n/2-1) */ i = log(r) / log(2); /* n/2-1 */ /* note: re-code the above 2 lines so that their calculations are done either by you or by the compiler at compile time. smart compilers *may* already do this */ h = r; x2 = sq; do { sq >>= 2; /* sq = b^2 */ h >>= 1; /* h = b */ if( x >= x2 ) /* x == x2 can go anywhere */ { x2 = x2 + (r << i) + sq; /* (r << i) == 2ab */ r += h; } else { x2 = x2 - (r << i) + sq; /* (r << i) == 2ab */ r -= h; } } while( --i != 0 ); /* note: x is between x2 and a previously tested x2. since on the last iteration h=1, either r is the root we want, or one of its adjacent values is */ if( x <= (x2-r) ) r--; else if( x > (x2+r) ) r++;

Here, the variable `i`

has a similar function to the
above formulas' **k**, though they don't match in terms
of values. The last 4 lines of code are a product of the
analysis made in Part I of
this document. Upon entry to the code, `x`

should have
the value to root, and upon exit `r`

will have the
integer result rounded to the nearest integer.

What you can see right away is that instead of the old sequential algorithms that took

`sqrt(x)-1` | to | `sqrt(x)` |

iterations to finish, which means that the number of *iterations*
ranged from

`0 <= sqrt(x)-1 <= 2^(n/2)-1` | to | `0 <= sqrt(x) <= 2^(n/2)` |

the binary search has a fixed number of iterations of

`(n/2)-1`

This is a huge improvement over the prior algorithms (though they
are still faster for a very small range of roots). For instance,
even for a low value of **n** (say **n**=16), the other
algorithms took anywhere between 0 and 256 iterations to find the
root, whereas this one takes exactly 7 iterations to find that
same root!

We don't use `x==x2`

as a loop termination condition
for the same reason since it only improves the root calculation
speed of

`2^(n/2-1) - 1`

values of `x`

and makes it worse for all others!

The next step in optimisation asks for your focus on shifts.
See how the only use of `r`

in the loop is shifted left
by an ever-decreasing number of digits, and it is updated by
a value (`h`

) that is shifted right by 1 on each
iteration? If you were to start the loop with `r`

fully
shifted left and shift it right by 1 on each iteration, `h`

would have to be shifted right by 2 (the same direction and
amount of `sq`

's shift). See how `h`

and `i`

are
not used anywhere else in the code?

This allows us to build the optimised C code version of this loop,
where `h`

no longer has the meaning of "root half", but rather
of "helper" value that merges the old code's `h`

and `sq`

.

#include <limits.h> /* … */ register unsigned long int x, r, h, x2; /* … */ r = (ULONG_MAX >> 2) + 1; /* (2^(n/2-1)) << (n/2-1) */ h = (ULONG_MAX >> 3) + 1; /* (2^(n/2-1)) << (n/2-2) */ x2 = (ULONG_MAX >> 2) + 1; /* (2^(n/2-1))^2 = 2^(n-2) */ do { if( x >= x2 ) /* x == x2 can go anywhere */ { x2 = x2 + r + (h >> 1); r = (r+h) >> 1; } else { x2 = x2 - r + (h >> 1); r = (r-h) >> 1; } } while( (h>>=2) != 0 ); if( x <= (x2-r) ) r--; else if( x > (x2+r) ) r++;

Let's take a look at an Intel APX/IA-32 optimised assembly version of the code
(assuming `EAX`

=`x`

and `EBX`

=`r`

,
both unsigned integers):

; … ; i486 cycles: mov EBX, 40000000h ; EBX = (2^(n/2-1)) << (n/2-1) 1 mov ECX, 20000000h ; ECX = (2^(n/2-1)) << (n/2-2) 1 mov EDX, 40000000h ; EDX = 2^(n-2) 1 loop1: cmp EAX, EDX ; 1 jb bellow ; 3/1 ; add EDX, EBX ; 1 add EBX, ECX ; 1 shr EBX, 1 ; 2 shr ECX, 2 ; 2 lea EDX, [EDX+2*ECX] ; 1 jnz loop1 ; 3/1 jmp done ; 3/1 bellow: sub EDX, EBX ; 1 sub EBX, ECX ; 1 shr EBX, 1 ; 2 shr ECX, 2 ; 2 lea EDX, [EDX+2*ECX] ; 1 jnz loop1 ; 3/1 done: add EDX, 2 ; 1 for loop, 1 to work w/ carry 1 sub EDX, EBX ; EDX = lower bound of EBX^2 1 lea ECX, [EDX+2*EBX] ; ECX = upper bound of EBX^2, +1 1 cmp EAX, EDX ; 1 sbb EBX, 0 ; 1 cmp EAX, ECX ; 1 sbb EBX, -1 ; 1

This is 32-bit code, as can be seen. To convert this code to 16-bit, you should:

- Replace all 32-bit registers by 16-bit registers (i.e., remove the "E");
- Just remove the right-most four zeros (0) from the first three operations;
- Replace the two
`shr ECX, 2`

/`lea EDX, [EDX+2*ECX]`

pairs with`shr CX, 1`

/`add DX, CX`

/`shr CX, 1`

since shift by 2 and that type of`lea`

operations are not supported; - Replace
`add EDX, 2`

just after the label "done" with`inc DX`

; - Replace
`sub EDX, EBX`

/`lea ECX, [EDX+2*EBX]`

just after that with`mov CX, DX`

/`sub DX, BX`

/`add CX, BX`

, since that type of`lea`

operation is not supported.

The clock count for this assembly code is

- Best case:
- 3+12(
**n**/2-1)-2+3+7 = 6**n**-1 - Worst case:
- 3+14(
**n**/2-2)+12-2+3+7 = 7**n**-5

which means that this code can calculate a 32-bit square root
(**n**=32) in 191~219 clock cycles, and a 16-bit square root
(**n**=16) in 95~107 clock cycles, which is 14% to 30%
*faster* than hard-sqrt!

If implemented in hardware, this algorithm could be parallelised, which means even more performance. Both halves of the loop would be calculated in parallel, and then the proper result selected in a multiplexer from the result of the initial comparison. Shifts don't take any time because they can be implemented by connecting the output bits of an operation circuit to the input bits of the register, already shifted.

As long as there is enough time within one clock cycle to propagate the electrical signals from one addition / subtraction circuit, one multiplexer and another addition circuit, the loop can be implemented in 1 clock cycle. If not, using clock-synchronised multiplexers and doing some changes, should allow you to do the loop in 2 clock cycles, at least.

Let's see a diagram of the former circuit:

This diagram only shows the part of the square-root circuit that
runs the loop — it doesn't show prologue or epilogue code.
Inputs that have a note of **>>1** or **>>2**
have the bits of their register word connected in such a way that
they implement the corresponding shift operation before entering
that circuit. For example if you connect output bit **u** to
input bit **u**-1 for all bits of that word, and connect the
most-significant input bit to 0, you're implementing a shift right
by 1 between the two circuits.

All clocks pins of all registers (**clk**)
should be connected to the same common clock. The loop is over
(and **EBX** holds the almost-ready root), when the output
pin **zero** is set, indicating **ECX**=0. The two
multiplexers return value **a** if **sel**=0 (false), and
value **b** if **sel**=1 (true). Finally, the adder
circuit that implements **EBX+ECX** and the right-most adder
circuit can both be replaced by **n**-bit **OR** circuits.

Do note that even though the algorithm is highly optimised, the
circuit that implements the algorithm is far from it. For instance,
circuits that do both an addition or subtraction based on a **sel**
signal could eliminate the two **ADD**/**SUB**/**MUX** pairs
and replace each for a single **ADD+SUB** circuit. In fact, the
**ADD**/**SUB**/**MUX** pair that implements
**EBX**±**ECX** can be replaced by an **OR** with
**ECX** and a specialised **XOR** with the **sel** signal.
An optimised comparison circuit will give the result **EAX<EDX?**
much quicker than a **SUB** circuit, and so on. The diagram shown
here is an illustration of one possibility, not an optimisation.

This easily extends to floating-point (real numbers). Floating-point IEEE numbers are stored in the format

`+-"mantissa" * 2^"xpon",` | `0 <= "mantissa" < 2,` |

`|"xpon"| in NN_0` |

where **xpon** is the exponent and **mantissa**... the mantissa!
If the exponent has a special value named the "denormal" value,
the mantissa is between 0 (inclusive) and 1 (exclusive). If the
exponent has any other value (a "normal" value), the mantissa is
between 1 (inclusive) and 2 (exclusive). This means the left-most
(most significant) bit can be determined by the exponent, which
is why most floating-point formats do not code this bit.

The mantissa can be made into an integer by multiplying it by a power
of 2, which is the same as subtracting the exponent of that power from
**xpon**. This reduces the problem to calculating the square
root of an integer (we already know how to do this), and dividing
**xpon** by 2 (trivial).

`sqrt("mantissa" * 2^"xpon") = sqrt("mantissa") * 2^("xpon"/2)`

In order to return **n** bits of significant digits, you need
2**n** digits in the mantissa to start with and its
most-significant-bit set (1), so you will need to multiply it by yet
another power of 2. This and other little things around this (returning
the mantissa into a value between 0 and 2, handling an odd **xpon**
before its division by 2, handling an odd number of bits in the
mantissa — some registers are set-up to

`2^(n/2)`

and similar values which are not integer if **n** is odd! —
and so on), are what complicates this process.

Since it is beyond the scope of this document, I have written a test program in ANSI C to describe and test two slightly different algorithms that calculate a floating-point square root (these algorithms are best implemented in hardware, and the test program has been designed to ease the understanding of how to lay out the algorithms in that medium). You can download this program (C source only) at the end of this section.

The first of these algorithms is slightly faster and takes less circuitry to implement (which means it is cheaper), but returns values that are not precise, but rather have a small measurable, quantifiable error. This should be more than adequate for graphical applications, for instance. Such algorithm can be implemented in hardware using the following diagram:

The same comments that explained the integer algorithm's hardware
implementation also apply to this diagram. Here, circuit names that
end in ***2** refer to circuits with (or that operate on) 2**n**
bits instead of just **n** bits. Notes of **hi** mean the top
**n** bits of such circuits, whereas **low** mean the bottom
**n** bits. For instance, one input to the right-most adder circuit
are the top **n** bits of **ECX*2** shifted right by 1. The
loop ends when the top **n** bits of **EBX*2** (signal
**hi_zero**) are 0. **low(EBX*2)** will then hold the result's mantissa
and **xpon** its exponent (which was properly modified prior to
loop start). To understand the need for **xpon**'s counter, read
the code's comments. The output of the comparison **SUB** circuit
need not be complemented if you swap the **a** and **b** entries
of the multiplexers, but the diagram was left this way so that the
only structural change to the integer diagram is in this comparison.

The second algorithm will take some more circuitry, but it is
*not* more complex. It will return exact values for the square roots.

As you can see, now nearly all circuits are 2**n** bits in size, which
makes it nearly twice as expensive as the integer algorithm, and about
30% more expensive than the first floating-point algorithm already
described. It is, however, the preferred circuit to implement in FPUs
(floating-point units), due to its precision. This loop ends when
**ECX*2** is 0 or 1 (signal **zero_one**), i.e., when all of its
bits except for the least-significant-bit are 0.

The test program can be compiled to test any of these two algorithms
in just about any ANSI C compliant machine, with just about any type
of floating-point format (please read the section on restrictions at
the head of the code). It will be able to be compiled in many forms,
to test for a variety of floating-point formats and algorithm
variations. It can test single values (displaying the algorithms'
progress) or entire ranges of values for empirical proof of
accuracy. You should take a look at the code of function
**gotest_real()** to see both algorithms' implementations and
comments.

Please read the comments at the head of the code for more details and conclusions (for instance, for 32-bit IEEE floating-point numbers, the hardware algorithm would be nearly 30% faster than a latest-generation (in August/2000) Intel Itanium).

Version 1.00 ftest.zip (21kb) |

When using the binary search algorithm for a root of index **i**, the
root values taken at each iteration (in the first example, 128 followed by 64
or 192, followed by 32 or ..., and so on) are not the same. The first
value to test for the root (we'll call it **h**) is half of the
maximum root we can generate using **n**-bit numbers, or

`h = root(i) (2^n-1) / 2`

which is very close to

`h = 1/2 * root(i) (2^n)`

`h = 2^(n/i-1)`

If this value is not integer, you need to use a **double h**
instead of **int**. **r** and **x2** in the C code
described bellow, follow the type of **h**. For **n**=16
and **i**=2 (the example we've been using), **h**=128,
as expected.

This value is still divided by 2 in each iteration of the loop, but
unlike what happened with **i**=2, where **h** was always
evenly divisible by 2, we cannot guarantee that now, so we need to
use **h**/=2 and **double h** instead of **h**>>=1
in those cases.

Updating the value of **x2** within the C loop is no longer
the same, as you may expect. But the problem is not complex, though.
If you call **XA(r,h)** to

`(r+h)^i - r^i`

and **XS(r,h)** to

`(r-h)^i - r^i`

then you can add these to **x2** when adding or subtracting **h**
from **r**, respectively. For instance, for **i**=2,
**XA(r,h)**=**h***(2***r**+**h**), and for **i**=3,
**XA(r,h)**=**h***(3***r***(**r**+**h**)+**h*****h**).

Finally, we need a loop termination condition. We could use
**h**<0.001 or some other low value, but we would need to
fine-tune it to each index. We could also use **XA(r,h)**<0.5
but that will force too many iterations. The solution found was to
terminate the loop when a) the loop "side" (addition/subtraction)
changed, *and* b) **r** rounded to the nearest integer did not
change as the side changed. This means the root is between the
two previously tested squares, and equals **r** rounded to the
nearest integer. This allows us to dismiss the epilogue code (which
is not good in terms of speed, but does simplify the code), but does
have a drawback — the minimum and maximum values for the **n**-bit
range will cause an infinite loop because the loop will continuously
approach them, but never leap over them. They should be treated as
special cases.

With this we can now build the following C code:

#include <math.h> #include <limits.h> /* … */ #define n ((int) floor( log(ULONG_MAX) / log(2) )) #define i /* define i as the root's index */ /* … */ register unsigned long int x, prev_r; register int prev_side; register double r, h, x2; /* … */ r = pow( 2, (double) n/i-1 ); /* 2^(n/i-1) */ x2 = (ULONG_MAX >> i) + 1; /* r^i = 2^(n-i) */ /* note: re-code the above "r=" line so that its calculations are done either by you or by the compiler at compile time. smart compilers *may* already do this */ h = r; prev_side = 0; /* none; prev_r needn't be set-up because of this */ for(;;) /* forever */ { h /= 2; if( x >= x2 ) /* x == x2 can go anywhere */ { if( prev_side == -1 ) { if( prev_r == ((int) r+.5) ) break; prev_side = +1; } prev_r = ((int) r+.5); x2 +=XA(r,h); r += h; } else { if( prev_side == +1 ) { if( prev_r == ((int) r+.5) ) break; prev_side = -1; } prev_r = ((int) r+.5); x2 +=XS(r,h); r -= h; } } r = (int) r+.5;

This code will only work on 0<**x**<`ULONG_MAX`

(i.e.,
positive numbers except the special cases of 0 and `ULONG_MAX`

).
How efficient is the code? It depends on your ability to optimise code.
Please take a look at the sequential search algorithm's
Root of any Index
section in Part I for more details and example. I will not expand
this issue.

The square root is often used in formulas (especially 3D formulas) scaled (multiplied) by a constant, or another value. Because the binary-search algorithm iterates through each bit of the result and therefore is similar to the operation of a dumb multiplication circuit, one could alter the first (or second) floating-point diagram shown so that a multiplication is done in parallel with the square root calculation, and an integer result is returned (or floating-point, if you model the circuit to it) with very little loss of accuracy (if any).

This is quite feasible if you think about it: as **EBX** is being
built on any variation of the algorithm, the bit represented by
**ECX** will always become set on **EBX**, regardless of whether
you add or subtract **ECX** from it, but the bit prior to that
(and *only* that bit) may be changed (if a subtraction was
used) from 1 to 0. The parallel multiplication may then follow that
prior bit in each cycle, instead of the bit in **ECX**. I will
not get further into that, however — the process is straightforward
and can lead to much simplified and accelerated 3D graphics circuits.

Another direction of work is in the development of a "Guess Search". This search came about due to the first two paragraphs that describe the binary search.

When you look through a dictionary, you don't open it at the middle at first, bit rather take an educated guess to where do you think the word will be. And then, from the comparison with the word you find there, you don't blindly cut one of the two split sections in half, but rather move into it some distance — again an educated guess.

But how well can you guess square roots? Well, as we have already seen, "eating away" half of the least significant digits in a number (the integer part of that half rounded down if the number of significant digits is odd) yields a pretty good approximation. This comes from

`sqrt(2^n) = 2^(n/2)`

So if you find an **n** such that

`2^(n-1) <= x < 2^n`

`n = "int"(log_2 x)+1`

where the logarithm is rounded down, then

`x / 2^(n/2)`

would be a good (and fast) approximation. The error is given by

`sqrt(x) - x / 2^(n/2)`

which has its minimum value at nearly

`sqrt(2^n) - 2^n / 2^(n/2) = 0`

as expected, and its maximum value at

`sqrt(2^(n-1)) - 2^(n-1) / 2^(n/2) = 2^((n-1)/2) - 2^((n-2)/2)`

which for 16-bit words that have the maximum **n** at 16 (this
is a slightly different meaning of **n** than what we've
been giving it so far), the maximum error is 54 (>21%); for
32-bit words, the maximum **n** is 32, and the maximum error
is 13573 (<21%).

The following Intel APX/IA-32 (x86) assembly code sample returns the
approximation in `EBX`

, leaving `EAX`

unchanged but
destroying `ECX`

.

; … ; i486 cycles: bsr ECX, EAX ; ECX = n-1 6-103 inc ECX ; ECX = n 1 shr ECX, 1 ; 2 mov EBX, EAX ; 2^(n-1) <= EAX,EBX < 2^n 1 shr EBX, CL ; EBX = EBX / 2^(n>>1) 2 ; …

Unfortunately, you can't just plug it in at the head of a
sequential search algorithm just like this. The approximation
prevented the sequential search algorithm from reducing `EAX`

as it normally would. Adding it to the binary-search algorithm
is also silly because its maximum error means that only one
iteration of the algorithm would be saved, and too much time is
taken preparing the approximation.
Plus the only way to find this approximation's square (so as to
use it in either of these algorithms) is to do a multiplication
which is not all that fast. Further work on this issue is
therefore left to the reader.

In terms of speed, the binary search is one of the best (or *the*
best) algorithm to use in software on CPUs that only have integer
operations, or to implement quite easily by hardware. In terms of size,
the first sequential search described in Part I has no parallel.

These are not perhaps the best algorithms around, and am I am not (possibly) the first to develop them. I am unaware if their use anywhere, though, or of papers or books describing such algorithms as I describe in this document, but I have not done extensive search for this.

As for my contribution: I hope you enjoy it.

Photography by NASA