R&D: Square Root in Assembly | Part II

This is the second part of the square root algorithm. It was developed in one month during the final stages of finishing this web site and reviewing this document for publication.

While Part I dealt with an empirical discovery for a sequential algorithm to find the square root, and its subsequent mathematical proof, Part II deals with a binary search variation that greatly improves the root's calculation speed (even though the sequential search algorithm is the smallest you'll find, if size is what matters to you). Both algorithms have the same accuracy.

The entire document consists of:

Part I

Part II

NOTES: The "official" address for this document is http://www.pedrofreire.com/sqrt. Please do not link directly to the address that shows up in your browser — this may be changed without notice. Also note that at the time I write this (August 2000) I am not aware of these algorithms being in use (at least the binary search version), but I cannot confirm this.

Binary Search

A binary search works pretty much like searching for a word in a dictionary. When looking for "Green", you don't open the dictionary in the first page and start looking for your word throughout. Instead you take advantage in the knowledge that the words are ordered alphabetically.

Here's how you would describe the search to someone else: First open the dictionary in half, and look for the first word you find. Compare it to the word you're looking for. If it's a match, you found it. Otherwise divide one of the two halves of the dictionary in half again (the right half if the word you're looking for is after the one you found, the left half if it's before, alphabetically), and repeat the process.

Remember the game "Simon Says" from childhood? "Simon Says: Guess the Number". "I'm thinking of a number between 1 and 100 — guess what it is in under 10 attempts". "I can only tell you if your guess is above or bellow the number I'm thinking of".

The best strategy to win at this game is exactly the same. You start by guessing 50, and then go on to guess half of that, then half and so on, based on the answers you get. If you do this you will always get the number in under 8 attempts! This is called a binary search.

For the square root algorithm, we can do a similar job (because both the square root and raising to the square are continuous growing operations, and are therefore "ordered"): first start at half the maximum root that can be calculated. For n-bit numbers, this is

`sqrt((2^n)-1)/2 ~~ sqrt(2^n)/2 = 2^(n/2-1)`

For example, for n=16, half of the maximum root is 128. Depending on how the original value to root, x, compares to the square of 128, you cut one of the two halves in half again. One-quarter of the maximum root (or half of 128) is 64, so if you should select the lower half, you compare x with the square of 64, if you select the upper half, you compare x with the square of 192.

Since 64 is precisely in the middle of the lower half, it can be obtained by either 0+64 or 128-64, and since 192 is precisely in the middle of the upper half, it can be obtained by either 128+64 or 256-64 (i.e., you get these values by adding one-fourth of the maximum root to the bottom edge of the half, or subtracting it from the top edge of the half, as can be seen in the following graphic).

[0------64-----128----192----256]; 192->256 = 128->192 = etc. = 64

When you want to split the next quarter in half, you'll need to use one-eighth of the maximum root (32). On the next step, one-sixteenth (16) and so on. The easiest way to get the value (half of the section you want to move to) from these maximum root partial values, is to add them to the current value (if you want the upper section), or subtract them from it (if you want the lower section). For instance, if you had moved to 192, and x was now lower than the square of 192, then the next value you want to compare x with is the square of 192-32, or 160.

All this is very well, and the square of the first value is relatively easy to calculate — after all,

`(2^(n/2-1))^2 = 2^(n-2)`

and for each of the next halves (for instance if the section selected is always the lower section),

`(2^(n/2-2))^2 = 2^(n-4)`

`(2^(n/2-3))^2 = 2^(n-6)`

`(2^(n/2-4))^2 = 2^(n-8)`

`(2^(n/2-k))^2 = 2^(n-2k)`

which means that the square of these powers of 2 can be calculated consecutively by shifting the first square twice to the right, just as their square root (the halves of the maximum root) can be calculated by shifting the first half to the right by one. But what if we need to select an upper section? We'll get to that.

So far we only have integer shifts, additions, subtractions and comparisons, which is good for our purposes. But the entire method is not yet done. We have a way to approximate the roots, but not to approximate the squares. For instance, if you change 128 to 192 by adding 64, what do you do with their respective squares? Do you just calculate a square by doing a multiplication?

Well, not necessarily (which is good since that would be slow!). You see, if a is the current root search value (192 for example), and b the maximum root half you want to add or subtract from it (32 in this example), what you want to find is


and you have

`(a+-b)^2 = a^2 +- 2ab + b^2`

which means that the offset

`+-2ab + b^2`

should be added to the current square, where the sign of 2ab is identical to the operation performed on the roots (addition / subtraction).

Here we have 3 multiplications, but let's take a look at them: the first is a multiplication by 2, which can be replaced by a shift operation. Then you multiply by b which is itself a power of 2, so this multiplication can also be replaced by a shift operation. Finally, b² is easy to calculate exactly because b is a power of 2, as we have just seen!

In other words, the offset is actually

`a * 2^(n/2-k+1) + 2^(n-2k)`

which, for an even n and integer k>0, always returns positive integer exponents of 2, which means all operations are fast shifts and additions, not multiplications! This also proves mathematically the steps used in the method.

This allows us to build the first C code version of this loop:

#include <math.h>
#include <limits.h>
/* … */
register unsigned long int x, r, h, sq, x2;
register int i;    /* no need to make it as big as x, etc */

/* … */
sq = (ULONG_MAX >> 2) + 1;    /* (2^(n/2-1))^2 = 2^(n-2) */
r  = sqrt(sq);                /* 2^(n/2-1) */
i  = log(r) / log(2);         /* n/2-1 */
/* note: re-code the above 2 lines so that their calculations
   are done either by you or by the compiler at compile time.
   smart compilers *may* already do this */
h  = r;
x2 = sq;
do	{
	sq >>= 2;        /* sq = b^2 */
	h >>= 1;         /* h = b */
	if( x >= x2 )    /* x == x2 can go anywhere */
		x2 = x2 + (r << i) + sq;    /* (r << i) == 2ab */
		r += h;
		x2 = x2 - (r << i) + sq;    /* (r << i) == 2ab */
		r -= h;
	while( --i != 0 );
/* note: x is between x2 and a previously tested x2. since
   on the last iteration h=1, either r is the root we want,
   or one of its adjacent values is */
if( x <= (x2-r) )
else if( x > (x2+r) )

Here, the variable i has a similar function to the above formulas' k, though they don't match in terms of values. The last 4 lines of code are a product of the analysis made in Part I of this document. Upon entry to the code, x should have the value to root, and upon exit r will have the integer result rounded to the nearest integer.

What you can see right away is that instead of the old sequential algorithms that took

`sqrt(x)-1` to `sqrt(x)`

iterations to finish, which means that the number of iterations ranged from

`0 <= sqrt(x)-1 <= 2^(n/2)-1` to `0 <= sqrt(x) <= 2^(n/2)`

the binary search has a fixed number of iterations of


This is a huge improvement over the prior algorithms (though they are still faster for a very small range of roots). For instance, even for a low value of n (say n=16), the other algorithms took anywhere between 0 and 256 iterations to find the root, whereas this one takes exactly 7 iterations to find that same root!

We don't use x==x2 as a loop termination condition for the same reason since it only improves the root calculation speed of

`2^(n/2-1) - 1`

values of x and makes it worse for all others!

The next step in optimisation asks for your focus on shifts. See how the only use of r in the loop is shifted left by an ever-decreasing number of digits, and it is updated by a value (h) that is shifted right by 1 on each iteration? If you were to start the loop with r fully shifted left and shift it right by 1 on each iteration, h would have to be shifted right by 2 (the same direction and amount of sq's shift). See how h and i are not used anywhere else in the code?

This allows us to build the optimised C code version of this loop, where h no longer has the meaning of "root half", but rather of "helper" value that merges the old code's h and sq.

#include <limits.h>
/* … */
register unsigned long int x, r, h, x2;

/* … */
r  = (ULONG_MAX >> 2) + 1;    /* (2^(n/2-1)) << (n/2-1) */
h  = (ULONG_MAX >> 3) + 1;    /* (2^(n/2-1)) << (n/2-2) */
x2 = (ULONG_MAX >> 2) + 1;    /* (2^(n/2-1))^2 = 2^(n-2) */
do	{
	if( x >= x2 )    /* x == x2 can go anywhere */
		x2 = x2 + r + (h >> 1);
		r = (r+h) >> 1;
		x2 = x2 - r + (h >> 1);
		r = (r-h) >> 1;
	while( (h>>=2) != 0 );
if( x <= (x2-r) )
else if( x > (x2+r) )

Speed Comparison

Let's take a look at an Intel APX/IA-32 optimised assembly version of the code (assuming EAX=x and EBX=r, both unsigned integers):

	; …
	;                                                   i486 cycles:
	mov   EBX, 40000000h      ;  EBX = (2^(n/2-1)) << (n/2-1)      1
	mov   ECX, 20000000h      ;  ECX = (2^(n/2-1)) << (n/2-2)      1
	mov   EDX, 40000000h      ;  EDX = 2^(n-2)                     1
	cmp   EAX, EDX            ;                                    1
	jb    bellow              ;                                  3/1
	add   EDX, EBX            ;                                    1
	add   EBX, ECX            ;                                    1
	shr   EBX, 1              ;                                    2
	shr   ECX, 2              ;                                    2
	lea   EDX, [EDX+2*ECX]    ;                                    1
	jnz   loop1               ;                                  3/1
	jmp   done                ;                                  3/1
	sub   EDX, EBX            ;                                    1
	sub   EBX, ECX            ;                                    1
	shr   EBX, 1              ;                                    2
	shr   ECX, 2              ;                                    2
	lea   EDX, [EDX+2*ECX]    ;                                    1
	jnz   loop1               ;                                  3/1
	add   EDX, 2              ;  1 for loop, 1 to work w/ carry    1
	sub   EDX, EBX            ;  EDX = lower bound of EBX^2        1
	lea   ECX, [EDX+2*EBX]    ;  ECX = upper bound of EBX^2, +1    1
	cmp   EAX, EDX            ;                                    1
	sbb   EBX, 0              ;                                    1
	cmp   EAX, ECX            ;                                    1
	sbb   EBX, -1             ;                                    1

This is 32-bit code, as can be seen. To convert this code to 16-bit, you should:

  1. Replace all 32-bit registers by 16-bit registers (i.e., remove the "E");
  2. Just remove the right-most four zeros (0) from the first three operations;
  3. Replace the two shr ECX, 2 / lea EDX, [EDX+2*ECX] pairs with shr CX, 1 / add DX, CX / shr CX, 1 since shift by 2 and that type of lea operations are not supported;
  4. Replace add EDX, 2 just after the label "done" with inc DX;
  5. Replace sub EDX, EBX / lea ECX, [EDX+2*EBX] just after that with mov CX, DX / sub DX, BX / add CX, BX, since that type of lea operation is not supported.

The clock count for this assembly code is

Best case:
3+12(n/2-1)-2+3+7 = 6n-1
Worst case:
3+14(n/2-2)+12-2+3+7 = 7n-5

which means that this code can calculate a 32-bit square root (n=32) in 191~219 clock cycles, and a 16-bit square root (n=16) in 95~107 clock cycles, which is 14% to 30% faster than hard-sqrt!

Hardware Version

If implemented in hardware, this algorithm could be parallelised, which means even more performance. Both halves of the loop would be calculated in parallel, and then the proper result selected in a multiplexer from the result of the initial comparison. Shifts don't take any time because they can be implemented by connecting the output bits of an operation circuit to the input bits of the register, already shifted.

As long as there is enough time within one clock cycle to propagate the electrical signals from one addition / subtraction circuit, one multiplexer and another addition circuit, the loop can be implemented in 1 clock cycle. If not, using clock-synchronised multiplexers and doing some changes, should allow you to do the loop in 2 clock cycles, at least.

Let's see a diagram of the former circuit:

[circuit diagram]

This diagram only shows the part of the square-root circuit that runs the loop — it doesn't show prologue or epilogue code. Inputs that have a note of >>1 or >>2 have the bits of their register word connected in such a way that they implement the corresponding shift operation before entering that circuit. For example if you connect output bit u to input bit u-1 for all bits of that word, and connect the most-significant input bit to 0, you're implementing a shift right by 1 between the two circuits.

All clocks pins of all registers (clk) should be connected to the same common clock. The loop is over (and EBX holds the almost-ready root), when the output pin zero is set, indicating ECX=0. The two multiplexers return value a if sel=0 (false), and value b if sel=1 (true). Finally, the adder circuit that implements EBX+ECX and the right-most adder circuit can both be replaced by n-bit OR circuits.

Do note that even though the algorithm is highly optimised, the circuit that implements the algorithm is far from it. For instance, circuits that do both an addition or subtraction based on a sel signal could eliminate the two ADD/SUB/MUX pairs and replace each for a single ADD+SUB circuit. In fact, the ADD/SUB/MUX pair that implements EBX±ECX can be replaced by an OR with ECX and a specialised XOR with the sel signal. An optimised comparison circuit will give the result EAX<EDX? much quicker than a SUB circuit, and so on. The diagram shown here is an illustration of one possibility, not an optimisation.


This easily extends to floating-point (real numbers). Floating-point IEEE numbers are stored in the format

`+-"mantissa" * 2^"xpon",` `0 <= "mantissa" < 2,`
`|"xpon"| in NN_0`

where xpon is the exponent and mantissa... the mantissa! If the exponent has a special value named the "denormal" value, the mantissa is between 0 (inclusive) and 1 (exclusive). If the exponent has any other value (a "normal" value), the mantissa is between 1 (inclusive) and 2 (exclusive). This means the left-most (most significant) bit can be determined by the exponent, which is why most floating-point formats do not code this bit.

The mantissa can be made into an integer by multiplying it by a power of 2, which is the same as subtracting the exponent of that power from xpon. This reduces the problem to calculating the square root of an integer (we already know how to do this), and dividing xpon by 2 (trivial).

`sqrt("mantissa" * 2^"xpon") = sqrt("mantissa") * 2^("xpon"/2)`

In order to return n bits of significant digits, you need 2n digits in the mantissa to start with and its most-significant-bit set (1), so you will need to multiply it by yet another power of 2. This and other little things around this (returning the mantissa into a value between 0 and 2, handling an odd xpon before its division by 2, handling an odd number of bits in the mantissa — some registers are set-up to


and similar values which are not integer if n is odd! — and so on), are what complicates this process.

Since it is beyond the scope of this document, I have written a test program in ANSI C to describe and test two slightly different algorithms that calculate a floating-point square root (these algorithms are best implemented in hardware, and the test program has been designed to ease the understanding of how to lay out the algorithms in that medium). You can download this program (C source only) at the end of this section.

The first of these algorithms is slightly faster and takes less circuitry to implement (which means it is cheaper), but returns values that are not precise, but rather have a small measurable, quantifiable error. This should be more than adequate for graphical applications, for instance. Such algorithm can be implemented in hardware using the following diagram:

[circuit diagram]

The same comments that explained the integer algorithm's hardware implementation also apply to this diagram. Here, circuit names that end in *2 refer to circuits with (or that operate on) 2n bits instead of just n bits. Notes of hi mean the top n bits of such circuits, whereas low mean the bottom n bits. For instance, one input to the right-most adder circuit are the top n bits of ECX*2 shifted right by 1. The loop ends when the top n bits of EBX*2 (signal hi_zero) are 0. low(EBX*2) will then hold the result's mantissa and xpon its exponent (which was properly modified prior to loop start). To understand the need for xpon's counter, read the code's comments. The output of the comparison SUB circuit need not be complemented if you swap the a and b entries of the multiplexers, but the diagram was left this way so that the only structural change to the integer diagram is in this comparison.

The second algorithm will take some more circuitry, but it is not more complex. It will return exact values for the square roots.

[circuit diagram]

As you can see, now nearly all circuits are 2n bits in size, which makes it nearly twice as expensive as the integer algorithm, and about 30% more expensive than the first floating-point algorithm already described. It is, however, the preferred circuit to implement in FPUs (floating-point units), due to its precision. This loop ends when ECX*2 is 0 or 1 (signal zero_one), i.e., when all of its bits except for the least-significant-bit are 0.

The test program can be compiled to test any of these two algorithms in just about any ANSI C compliant machine, with just about any type of floating-point format (please read the section on restrictions at the head of the code). It will be able to be compiled in many forms, to test for a variety of floating-point formats and algorithm variations. It can test single values (displaying the algorithms' progress) or entire ranges of values for empirical proof of accuracy. You should take a look at the code of function gotest_real() to see both algorithms' implementations and comments.

Please read the comments at the head of the code for more details and conclusions (for instance, for 32-bit IEEE floating-point numbers, the hardware algorithm would be nearly 30% faster than a latest-generation (in August/2000) Intel Itanium).

  Version 1.00
ftest.zip (21kb)

Root of any Index

When using the binary search algorithm for a root of index i, the root values taken at each iteration (in the first example, 128 followed by 64 or 192, followed by 32 or ..., and so on) are not the same. The first value to test for the root (we'll call it h) is half of the maximum root we can generate using n-bit numbers, or

`h = root(i) (2^n-1) / 2`

which is very close to

`h = 1/2 * root(i) (2^n)`

`h = 2^(n/i-1)`

If this value is not integer, you need to use a double h instead of int. r and x2 in the C code described bellow, follow the type of h. For n=16 and i=2 (the example we've been using), h=128, as expected.

This value is still divided by 2 in each iteration of the loop, but unlike what happened with i=2, where h was always evenly divisible by 2, we cannot guarantee that now, so we need to use h/=2 and double h instead of h>>=1 in those cases.

Updating the value of x2 within the C loop is no longer the same, as you may expect. But the problem is not complex, though. If you call XA(r,h) to

`(r+h)^i - r^i`

and XS(r,h) to

`(r-h)^i - r^i`

then you can add these to x2 when adding or subtracting h from r, respectively. For instance, for i=2, XA(r,h)=h*(2*r+h), and for i=3, XA(r,h)=h*(3*r*(r+h)+h*h).

Finally, we need a loop termination condition. We could use h<0.001 or some other low value, but we would need to fine-tune it to each index. We could also use XA(r,h)<0.5 but that will force too many iterations. The solution found was to terminate the loop when a) the loop "side" (addition/subtraction) changed, and b) r rounded to the nearest integer did not change as the side changed. This means the root is between the two previously tested squares, and equals r rounded to the nearest integer. This allows us to dismiss the epilogue code (which is not good in terms of speed, but does simplify the code), but does have a drawback — the minimum and maximum values for the n-bit range will cause an infinite loop because the loop will continuously approach them, but never leap over them. They should be treated as special cases.

With this we can now build the following C code:

#include <math.h>
#include <limits.h>
/* … */
#define  n  ((int) floor( log(ULONG_MAX) / log(2) ))
#define  i  /* define i as the root's index */
/* … */
register unsigned long int x, prev_r;
register int prev_side;
register double r, h, x2;

/* … */
r  = pow( 2, (double) n/i-1 );    /* 2^(n/i-1) */
x2 = (ULONG_MAX >> i) + 1;        /* r^i = 2^(n-i) */
/* note: re-code the above "r=" line so that its calculations
   are done either by you or by the compiler at compile time.
   smart compilers *may* already do this */
h = r;
prev_side = 0;    /* none; prev_r needn't be set-up because of this */
for(;;)    /* forever */
	h /= 2;
	if( x >= x2 )    /* x == x2 can go anywhere */
		if( prev_side == -1 )
			if( prev_r == ((int) r+.5) )
			prev_side = +1;
		prev_r = ((int) r+.5);
		x2 += XA(r,h);
		r += h;
		if( prev_side == +1 )
			if( prev_r == ((int) r+.5) )
			prev_side = -1;
		prev_r = ((int) r+.5);
		x2 += XS(r,h);
		r -= h;
r = (int) r+.5;

This code will only work on 0<x<ULONG_MAX (i.e., positive numbers except the special cases of 0 and ULONG_MAX). How efficient is the code? It depends on your ability to optimise code. Please take a look at the sequential search algorithm's Root of any Index section in Part I for more details and example. I will not expand this issue.

Further Work

The square root is often used in formulas (especially 3D formulas) scaled (multiplied) by a constant, or another value. Because the binary-search algorithm iterates through each bit of the result and therefore is similar to the operation of a dumb multiplication circuit, one could alter the first (or second) floating-point diagram shown so that a multiplication is done in parallel with the square root calculation, and an integer result is returned (or floating-point, if you model the circuit to it) with very little loss of accuracy (if any).

This is quite feasible if you think about it: as EBX is being built on any variation of the algorithm, the bit represented by ECX will always become set on EBX, regardless of whether you add or subtract ECX from it, but the bit prior to that (and only that bit) may be changed (if a subtraction was used) from 1 to 0. The parallel multiplication may then follow that prior bit in each cycle, instead of the bit in ECX. I will not get further into that, however — the process is straightforward and can lead to much simplified and accelerated 3D graphics circuits.

Another direction of work is in the development of a "Guess Search". This search came about due to the first two paragraphs that describe the binary search.

When you look through a dictionary, you don't open it at the middle at first, bit rather take an educated guess to where do you think the word will be. And then, from the comparison with the word you find there, you don't blindly cut one of the two split sections in half, but rather move into it some distance — again an educated guess.

But how well can you guess square roots? Well, as we have already seen, "eating away" half of the least significant digits in a number (the integer part of that half rounded down if the number of significant digits is odd) yields a pretty good approximation. This comes from

`sqrt(2^n) = 2^(n/2)`

So if you find an n such that

`2^(n-1) <= x < 2^n`

`n = "int"(log_2 x)+1`

where the logarithm is rounded down, then

`x / 2^(n/2)`

would be a good (and fast) approximation. The error is given by

`sqrt(x) - x / 2^(n/2)`

which has its minimum value at nearly

`sqrt(2^n) - 2^n / 2^(n/2) = 0`

as expected, and its maximum value at

`sqrt(2^(n-1)) - 2^(n-1) / 2^(n/2) = 2^((n-1)/2) - 2^((n-2)/2)`

which for 16-bit words that have the maximum n at 16 (this is a slightly different meaning of n than what we've been giving it so far), the maximum error is 54 (>21%); for 32-bit words, the maximum n is 32, and the maximum error is 13573 (<21%).

The following Intel APX/IA-32 (x86) assembly code sample returns the approximation in EBX, leaving EAX unchanged but destroying ECX.

	; …
	;                                       i486 cycles:
	bsr   ECX, EAX    ;  ECX = n-1                 6-103
	inc   ECX         ;  ECX = n                       1
	shr   ECX, 1      ;                                2
	mov   EBX, EAX    ;  2^(n-1) <= EAX,EBX < 2^n      1
	shr   EBX, CL     ;  EBX = EBX / 2^(n>>1)          2
	; …

Unfortunately, you can't just plug it in at the head of a sequential search algorithm just like this. The approximation prevented the sequential search algorithm from reducing EAX as it normally would. Adding it to the binary-search algorithm is also silly because its maximum error means that only one iteration of the algorithm would be saved, and too much time is taken preparing the approximation. Plus the only way to find this approximation's square (so as to use it in either of these algorithms) is to do a multiplication which is not all that fast. Further work on this issue is therefore left to the reader.


In terms of speed, the binary search is one of the best (or the best) algorithm to use in software on CPUs that only have integer operations, or to implement quite easily by hardware. In terms of size, the first sequential search described in Part I has no parallel.

These are not perhaps the best algorithms around, and am I am not (possibly) the first to develop them. I am unaware if their use anywhere, though, or of papers or books describing such algorithms as I describe in this document, but I have not done extensive search for this.

As for my contribution: I hope you enjoy it.

Photography by NASA