R&D: Square Root in Assembly | Part I

The purpose of this exercise was to create an algorithm to find integer square roots using only integer operations. This method was initially devised by me in 1990 for the ZX Spectrum +3 which had a Zilog Z80 (Intel APX 8080 compatible) processor which doesn't even have multiply or divide assembly instructions!

This document was started in 1994, before I founded Cynergi, but wasn't finished until July 2000. During the course of finishing the document, I found an improvement over the algorithm (the binary search method) that seems to be novel and if implemented in hardware, could perhaps generate 32-bit floating-point square roots in 25 clock cycles or less, which is even faster than the latest generation of Intel processors (the Itanium, on August 2000)! FPU (floating-point unit) emulators built into operating systems can also use this to implement a quick square root just about twice as slow as the hardware version (on an i486).

Given their simplicity and speed, both algorithms can also be used on

  • Embedded systems,
  • Windows CE machines,
  • 3rd-generation UMTS phones,
  • VRML browsers,
  • 3D graphics accelerators,
  • Wrist watch calculators,

and any other systems where memory is short, an FPU is not available and/or 3D graphics (which most often require square root calculations) are required. However, since the binary search algorithm is faster and cheaper than current square root implementations, it could replace them in every implementation.

I have split the document into two parts, but left it pretty much in the same item order as the ideas were developed. This helps to show the path of discovery. Part I deals with an empirical discovery of a sequential algorithm to find the square root, and its subsequent mathematical proof, and Part II will deal with a binary search variation that greatly improves the root's calculation speed (even though this sequential search algorithm is the smallest you'll find, if size is what matters to you). Both algorithms have the same accuracy.

The entire document consists of:

Part I

Part II

NOTES: The "official" address for this document is http://www.pedrofreire.com/sqrt. Please do not link directly to the address that shows up in your browser — this may be changed without notice. Also note that at the time I write this (August 2000) I am not aware of these algorithms being in use (at least the binary search version), but I cannot confirm this.

Empirical Discovery

Starting with the most used root (the square root or sqrt(x)), we can build a table of square roots (where r is the integer result):

x sqrt(x) r     x sqrt(x) r
0 0.00 0     20 4.47 4
1 1.00 1 21 4.58 5
2 1.41 1
3 1.73 2 30 5.48 5
4 2.00 2 31 5.57 6
5 2.24 2
6 2.45 2 42 6.48 6
7 2.65 3 43 6.56 7
12 3.46 3 56 7.48 7
13 3.61 4 57 7.55 8

From this table, you may notice that for successive numbers, the results repeat themselves with a pattern, namely

r no. of similar results     r no. of similar results
0 1     5 30-21+1 = 10
1 2 6 42-31+1 = 12
2 4 7 56-43+1 = 14
3 12-7+1 = 6 8 16
4 20-13+1 = 8 2r + (1 if r=0)

With this in mind, we can now easily build the following code in C

register unsigned int x, r;

/* … */
x = (x+1) >> 1;    /* init x */
for( r=0; x>r; x-=r++ );

which assumes x unsigned. Upon exit, r holds the integer root, and x holds some sort of remainder (this or any other remainder will not be explored in this document).

Let's try some examples to see if it works:

x     init x   for loop   r
0     0   0   0
1 1 1-0=1 1
2 1 1-0=1 1
3 2 2-0=2, 2-1=1 2
4 2 2-0=2, 2-1=1 2
5 3 3-0=3, 3-1=2 2
6 3 3-0=3, 3-1=2 2
7 4 4-0=4, 4-1=3, 3-2=1 3
12 6 6-0=6, 6-1=5, 5-2=3 3
13 7 7-0=7, 7-1=6, 6-2=4, 4-3=1 4
15 8 8-0=8, 8-1=7, 7-2=5, 5-3=2 4
144 72 72-0, 72-1, 71-2, 69-3, 66-4,
62-5, 57-6, 51-7, 44-8, 36-9,
27-10, 17-11=6

The maximum number of iterations of the for loop for n-bit words is the root of the largest number that can be coded in that many bits, or

`sqrt(2^n - 1) ~~ sqrt(2^n) = 2^(n/2)`

which, for some values of n, might prove the algorithm to be more efficient than integer-to-real conversion, followed by a floating-point square root and by real-to-integer conversion again. Let's find out if that is so.

Speed Comparison

An Intel APX/IA assembly optimisation of the loop could be (assuming AX=x and BX=r, both unsigned integers):

	; …
	inc   AX        ;  i486 cycles:    1
	shr   AX, 1     ;                  2
	mov   BX, -1    ;                  1
	inc   BX        ;                  1
	sub   AX, BX    ;                  1
	ja    loop1     ;                3/1

Upon exit, BX holds the result r, and AX some sort of remainder. The semicolons were simply used to make the comments valid.

The corresponding code using floating-point arithmetic would be something like

ctrlword    DW ?

	; Set rounding to nearest integer
	fstcw  ctrlword
	and    ctrlword, 1111001111111111b
	fldcw  ctrlword

	; …
	fild   WORD PTR [BX]    ;  i486 cycles:    13-16
	fsqrt                   ;                  83-87
	fistp  WORD PTR [BX]    ;                  29-34

assuming WORD PTR [BX] holds x. The result is returned on WORD PTR [BX] (we don't account for remainders in floating-point). 32-bit registers can be used in turn of their 16-bit counterparts in both pieces of code, without slowing them down.

There can be a mismatch between the result of this code and the algorithm's. If a result has a decimal part of exactly .5, the algorithm will always round the result up, whereas the i486's FPU (floating-point unit) will round it to the nearest even integer number. Both results are correct, however, and no integer number has a square root with a decimal part of .5.

To compare the speed of the two methods, we first show that for the algorithm (let's call it soft-sqrt), the clock count is

1+2+1+(1+1+3)(r+1)-2 = 7+5r

where r is again the calculated root. For the method using the FPU (let's call it hard-sqrt), you have

Best case:
13+83+29 = 125
Worst case:
16+87+34 = 137

Therefore, the highest root range that can be calculated faster by soft-sqrt is

`7+5r_min = 125`

`r_min = 23.6`

`x_min = r_min^2 = 556.96`


`7+5r_max = 137`

`r_max = 26`

`x_max = r_max^2 = 676`

This means that soft-sqrt is faster than hard-sqrt for values of x between 0 and 557~676. Similar calculations for other APX/IA (x86) processors yield similar results (the best ratio being of the i286 processor).

If you often need to calculate an integer root in the above range (which has n>9, i.e. approximately 9-bit numbers), then this algorithm is of greatest use for you — remember, only its high range values are as slow as the hardware floating-point solution! Better yet, the other previous processors don't have a built-in FPU so the job is probably being done by some sluggish software algorithm (sluggish since it has to take floating-point into account).

The major drawback of the method is that it lacks the accuracy you have with floating-point (it returns the result rounded to the nearest integer, but further operations with this result may yield a greater rounding error).

Mathematical Proof

The square of a number r is of course


and the square of its successor (next integer number) is

`(r+1)^2 = r^2 + 2r + 1`

that is, it is offset 2r+1 from the original r². This tells us how far apart (in terms of x, which = r²) will a new r+1 result be.

However, the r values you feed to this formula should be chosen with care. The formula works with real numbers, and my algorithm with integer numbers. My algorithm was devised to return the square root result rounded to the nearest integer, which means the switch from r = 0 to r = 1 for instance, will occur in real numbers when r = 0.5. This means we should apply 0.5 to the formula to find out how many r = 1 results exist, since all r = 0.5 to r = 1.49999… results will be rounded to the integer 1, and 0.5 is the lowest value in this range. You want to find out how far apart are (what is the offset between) r = 0.5 and r = 0.5+1.

So instead of applying

r = 0, 1, 2, 3, 4, 5, …

you should actually apply

r = 0, 0.5, 1.5, 2.5, 3.5, 4.5, …

since 0 is the lowest number (greater than or equal to 0) eligible to be rounded to 0, 0.5 to 1, 1.5 to 2 and so on. The above formula will thus yield

r     2r+1   x   sqrt(x)
0     1   0   0
0.5 2 1..2 1
1.5 4 3..6 2
2.5 6 7..12 3
3.5 8 13..20 4
4.5 10 21..30 5

which is obviously the same integer succession as shown in the initial examples. However, if you still doubt, then you should realise that

2×0+1 = 1   for the oddity, 0
(which gives us how many x values return r = 0)
2×0.5+1 = 2 for the first regular offset
(which gives us how many x values return r = 1)
(2(r+1)+1)-(2r+1) =
2r+2+1-2r-1 = 2
for the changing of the offset
(which gives us how many x values return r+1,
relative to how many x values return r)

as we assumed.


Root of any Index

Following the same reasoning as above, we could still resolve for roots of any positive integer index (say i). For that, let's make use of Newton's binomial theorem

`(r+1)^i = sum_(k=0)^i (i!r^(i-k)) / (k!(i-k)!)`

which reduces to

`(r+1)^2 = (2!r^2)/(0!*2!) + (2!r^1)/(1!*1!) + (2!r^0)/(2!*0!) = r^2 + 2r + 1`

for the special case of the square (i=2) root (and is equal to the formula we got above, verifying both). Since the summation index k=0 is

`(i!r^(i-0)) / (0!(i-0)!) = r^i`

the offset can then be generally described as

`(r+1)^i - r^i = sum_(k=1)^i (i!r^(i-k)) / (k!(i-k)!)`

and the changing of the offset, as

`(((r+1)+1)^i - (r+1)^i) - ((r+1)^i - r^i) = (r+2)^i - 2(r+1)^i + r^i`


`sum_(k=1)^i (i!(r+1)^(i-k)) / (k!(i-k)!) - sum_(k=1)^i (i!r^(i-k)) / (k!(i-k)!) = sum_(k=1)^(i-1) (i!((r+1)^(i-k)-r^(i-k))) / (k!(i-k)!)`

where the top summation index is now i-1 because when k=i,

`(i!((r+1)^(i-i)-r^(i-i))) / (i!(i-i)!) = 0`

Applying i=2 to either of these two formulas will return you the constant 2, verifying them and the results previously calculated.

Now you understand why did I start using Newton's binomial: even though the first of these two formulas doesn't show it, the second explicitly shows that the changing of the offset is never of a degree higher than i-2. Why i-2 and not i-1? Well, because if you take

`(r+1)^(i-k) - r^(i-k)`

of the formula with the summation, you'll see it resembles the offset formula, and its resolution shows that you'll lower the result by one more degree.

Since the algorithm explained above needs the changing of the offset to work, high exponents (degrees) means a slow algorithm on high values of i. But we can build generic C code for any root index:

register int x, s;    /* use double if needed */
register int r;

/* … */
r = 0;
if( x > 0 )
	s = FO(0.5);
	for( r=1; (x-=s)>0; r++ )
		s += FCO(R-0.5);

This code only works with positive numbers: if you want it to work with negative numbers just reverse the sign of r if x was negative upon entry, and work with abs(x). The operation x-=s prior to the comparison with 0 is the reason why x isn't unsigned. s and r aren't unsigned to avoid unnecessary conversions.

You should replace FO(0.5) with the result of the offset formula when you make r=0.5, and FCO(R-0.5) by the formula of the changing of the offset, replacing r in that formula by r-0.5. You should also make the variables x, and s double (real) instead of int (integer) if any of these replacements return non-integer numbers. This will slow down the code, but you'll have to optimise it manually, anyway. x should have the value to calculate the root for, upon reaching the code, and r will have the calculated root upon exit.

For example, for the square root, you would make FO(0.5) = 2 and FCO(R-0.5) = 2 since it is constant. x and s remain int.

How fast is this, in comparison with a hardware operation? Well, it depends on four factors:

  1. How fast can your CPU calculate that root in floating-point;
  2. Do you need doubles instead of ints;
  3. How fast is the integer multiply operation on your CPU;
  4. How well can you optimise the code.

The first point tells you if all this work is really worth-wile. On most CPUs, floating-point calculations take tens (if not hundreds) of clock cycles, and the only root they do is the square root (i=2). Other roots need to be calculated using logarithms, which makes that operation even slower and this algorithm even faster (relatively). Also, on high i values, this algorithm's code loop spans more numbers (in terms of x — this means that s grows faster), which means the code quickly gets to a solution. But if your CPU can achieve a fast root operation, all the work described here is useless — just use that operation!

The second point (using double instead of int) can be avoided if you know enough math and know how to optimise the code. For instance an FO(0.5) value of 3.25 can be transformed into 3, and a new counter be set-up to subtract 1 from x on every 4 iterations. Depending on how fast are floating-point (real) additions (and subtractions) on your CPU, this may or may not be worth it.

The third point determines how far up the ladder of i can you climb before the algorithm is no longer fast enough. Some CPUs execute an integer multiply operation in one clock cycle. For those, it really won't matter much what the degree of FCO(R-0.5) is, since you just need to make a sequence of multiplications, and these are cheap enough (in terms of speed) on such CPUs.

Finally, the last point is entirely up to you. For i=2, for instance, the above code turns to the code I showed at the beginning of this document — with no ifs and a very tight loop. Only experience can help you with this stage — it is not the point of this document to give you a series of optimised algorithms for several values of i, specially since optimisation should be done in assembly, and variations of this language abound. I will however focus on the cube root to prepare you for such analysis.

Cube Root

As an exercise and example, let's create a cube root algorithm. For the cube root, we have i=3. Let us apply this value to the offset formula:

`sum_(k=1)^3 (3!r^(3-k)) / (k!(3-k)!) = (3!r^2)/(1!*2!) + (3!r^1)/(2!*1!) + (3!r^0)/(3!*0!) = 3r^2 + 3r + 1`

When you make r=0.5, you get the value 3.25 from this formula. This means that FO(0.5) = 3.25. Now for the changing of the offset formula:

`sum_(k=1)^2 (3!((r+1)^(3-k)-r^(3-k))) / (k!(3-k)!) = (3!((r+1)^2-r^2))/(1!*2!) + (3!((r+1)^1-r^1))/(2!*1!) = 6r + 6`

Here, when you make r=r-0.5, you get 6(r-0.5)+6 = 6r+3. This means FCO(R-0.5) = 6r+3. This gives us the generic algorithm

register double x, s;    /* double *is* needed */
register int r;

/* … */
r = 0;
if( x > 0 )
	s = 3.25;
	for( r=1; (x-=s)>0; r++ )
		s += 6*r+3;

Again, this algorithm only works with positive numbers: if you want it to work with negative numbers just reverse the sign of r if x was negative upon entry, and work with abs(x). We had to use x and s as doubles because even though FCO(R-0.5) doesn't ever return a non-integer number, the start value FO(0.5) is not an integer. Therefore, our fist step in optimising this code is to get rid of the double values, making the entire loop integer-based.

Do note that the following steps only apply to this particular root, and may not apply to others. For those not familiar with C, also note that x-=s means x=x-s, and is a valid expression returning the result of the subtraction.

The .25 decimal part of s never changes: s is incremented by 6r+3, which always returns integers if fed with integer values of r (which it is). The decimal part present in the start value of s does have an impact on each iteration of x, however.

Since .25 = ¼, it turns 1 on every 4 iterations of the loop. This means that 1 should be subtracted from x when its decimal part is .00, prior to the subtraction of s. Why prior to the subtraction? Well, because we would need to check whether this decrement would make x<0, and that never occurs at that point, saving us another comparison.

The decimal part of x is .00 at the start of the first iteration of the loop and at the start of every 4 iterations after that. Let's take a look at an example that starts at x=20 and s=3.25.

iteration     after x-=s int(x-=s) int(x)-int(x-=s)
1     16.75 16 4
2 13.50 13 3
3 10.25 10 3
4 7.00 7 3
5 3.75 3 4
6 0.5 0 3

So you decrement x prior to the first iteration of the loop, and do that again prior to x-=s when the lower two bits of r equal a certain value. This eliminates the need for a counter — after all, r behaves much like a counter, and the lower two bits of a counter loop on every 4 counts, automatically. But which value to choose?

Well, the value for iteration in the table above is the same as r inside the loop (r is incremented at the end of the loop, after everything else). Here you can see that you need to decrement x prior to iteration 1 and prior to iteration 5. This means iterations 0 and 4, or when the lower 2 bits of r=00b.

That helps keep synchronised the values of s and x. Comparing both (in the subtraction x-=s) is now trivial except when they are equal in value. Unlike the code with double, we now cannot dismiss x-=s being equal to 0 as a sign for the loop to end, because even if their integer parts are equal, their decimal parts might not be, so we need a way to figure out what decimal part x had just after the subtraction. For that, let's take a better look at the loop iteration at that point:

lower 2 bits of r     x-int(x) >0?
01b     .75 yes
10b .50 yes
11b .25 yes
00b .00 no

This is valid for the first 4 iterations of the loop, and for every 4 after that. Now, as this table shows, only when the lower 2 bits of r are 00b would the loop actually end in that iteration, otherwise one more iteration would occur. So all the code has to do in such case (one more iteration) is increment r before returning.

Let's take a look at a different version of the code that takes all this into account:

register int x, s;
register int r;

/* … */
r = 0;
if( x > 0 )
	s = 3;
	for( r=1; (x-=s)>=0; r++ )
		if( x==0 )
			if( (r & 3) != 0 )
		s += 6*r + 3;
		if( (r & 3) == 0 )    /* 3 == 11b */

Now all operations are of type int, but the code still only works with positive numbers. There is a double comparison of x with 0, but that can be easily fixed in assembly code. What we could still improve is the multiplication: this can be removed, making the code faster. This can be done by setting up a counter that starts at 6 and increments by 6 on every iteration. It will therefore hold 6r since r is a counter that starts at 1 and increments by 1 on every iteration. This replaces a multiplication with an addition, a very welcome change.

One could also argument that the counter for r would no longer be necessary since we can determine r by dividing the multiplication counter by 6 when the loop ends. This is true, and you could still test bits 1 and 2 of this counter for 00b instead of testing bits 0 and 1 of r for the same value (I will not get into showing this). But the divide operation is still too expensive (24 to 40 clock cycles) versus what you save (1 clock cycle per loop iteration), if you want to use this "soft-cube-root" against a "hard-cube-root". On some CPUs (and generally, if you don't have a hardware FPU available) this may be worth it, however.

Since the C code for that last change is trivial, let's move straight to an assembly version of this algorithm (assuming AX=x and BX=r, both unsigned integers):

	; …
	mov   BX, 0      ;  BX = r         i486 cycles:     1
	sub   AX, 1      ;                                  1
	jb    done       ;                                3/1
	mov   CX, 3      ;  CX = s                          1
	mov   DX, 0      ;  DX = multiplication counter     1
	inc   BX         ;                                  1
	sub   AX, CX     ;                                  1
	jb    done       ;                                3/1
	je    equal      ;                                3/1
	add   DX, 6      ;                                  1
	add   CX, DX     ;                                  1
	add   CX, 3      ;                                  1
	test  BX, 11b    ;                                  1
	jnz   loop1      ;                                3/1
	dec   AX         ;                                  1
	jmp   loop1      ;                                  3
	test  BX, 11b    ;                                  1
	jz    done       ;                                3/1
	inc   BX         ;                                  1

Upon exit, BX holds the result, r. This code is optimised for speed on the i486, and not size. Also, the test operations are not available on processors older than the i386, which would mean you would have to find an alternative using a temporary copy of BX, and using and.

Now let's write the floating-point code for the same processor. The Intel APX/IA-16/32 family of processors (i.e., the x86 up to and including the Pentium-III) does not have a hardware operation for the root of any index. We could then, use

`root(i) x = x^(1/i)`

but they don't even have an operation for raising any base to any power! They do have select logarithms and select bases raised to powers, so you can use the rule

`x = b^(log_(b) x)`

to build the formula

`x^(1/i) = b^(log_(b) x^(1/i))`

`x^(1/i) = b^(1/i log_(b) x)`

which returns the root of index i of x, given any real base, b.

Computers work in number base 2, so it makes sense that the logarithms and powers they have are of base 2. This is true for the i486 (and therefore we make b=2). The i486 has the following relevant operations:

`x*2^y,` `|y| in NN_(0)`
`2^x - 1,` `-1 <= x <= 1`
`y * log_(2) x,` `x in RR_(0)^(+)`
`y * log_(2) (x+1),` `0 <= |x| < (2-sqrt(2))/2`

Each line is a CPU operation: this operation performs more than one mathematical operation at once, which may or may not help us. Some are limited in range (the variables that are not listed in the limitations, are not limited). One thing steps out in plain view — there is no operation to raise 2 (or any other base, for that matter) to an arbitrary power, and we need this! Fortunately you can use

`b^m * b^n = b^(m+n)`

to solve that problem using the first two operations shown above. For that you make m=int(z) (being z the power you want to raise 2 to), and n=z-int(z). You give these values to the powers of each of the first two operations shown above, respectively, and that's it! Since int(z) is always an integer and z-int(z) is always within -1 and 1 (exclusively), regardless of the rounding method used, this is the final link to translate our root to:

`z = 1/i * log_(2) x`

`r = ((2^(z-"int"(z))-1)+1) * 2^("int"(z))`

We will be using rounding to the nearest integer in the code, because we need that type of rounding for the final result, and don't want to waste time changing the rounding type in the middle of a calculation. This means z-int(z) will actually be within -0.5 and 0.5 (inclusively). The final code then looks like:

ctrlword    DW ?
inv_i       REAL8 1/i    ; replace i when assembling
one         REAL8 1

	; Set rounding to nearest integer
	fstcw  ctrlword
	and    ctrlword, 1111001111111111b
	fldcw  ctrlword

	; …
	fld    inv_i            ;                   i486 cycles:       3
	fild   WORD PTR [BX]    ;                                 13- 16
	fyl2x                   ;  ST(1) = (1/i) * log2 [BX]     196-329
	fld    ST               ;  dup ST: ST(1) now = ST              4
	frndint                 ;  ST = int(ST)                   21- 30
	fxch                    ;  ST <-> ST(1)                        4
	fsub   ST, ST(1)        ;  ST = ST - ST(1)                 8- 20 (*)
	f2xm1                   ;  ST = 2^ST - 1                 140-279
	fadd   one              ;  ST++                            8- 20
	fscale                  ;  ST = ST * 2^ST(1)              30- 32
	fistp  WORD PTR [BX]    ;                                 29- 34
	ffree  ST               ;  (see next line)                     3
	fincstp                 ;  pop the old ST(1)                   3
	;                 (*) at this point (after the fsub):
	;                     if z = (1/i) * log2 [BX], then
	;                        ST(1) = int(z)
	;                        ST    = z - int(z)
	;                        ST is within [-0.5 .. 0.5]

The operation f2xm1 will only work on the range [0 .. 0.5] on an 8087 FPU. In this processor you would need to verify if ST was negative. If so, you would feed its absolute value to f2xm1 and find the inverse of the result. The last two operations are required to remove from the stack a temporary calculation value that is not removed by the last operation, fscale. If you ignore the warnings the FPU can give you about the stack being full, then you can ignore the operations and save a little time.

The "fun" part about this code is that it actually raises the integer value stored on WORD PTR [BX] to the real power stored in inv_i. If you stored in the latter location the value of 8.25, for instance, this would still be the optimum code to raise an integer in WORD PTR [BX] to that power. To calculate a root of index i, you must store the inverse of i (as shown) in that location, prior to the first execution of the code. This can be done by the assembler itself. This also means that this same code applies to all roots other than the square root.

Let us now compare the speed of the algorithm versus this hardware-based code.

We first show that for the soft-cube-root, the clock count is 5 if r=0, or

5×1+(8×1+3)(r-1)+(-2+1+3)(¼)(r-1)+5 = 11.5r - 1.5

if r>0, where r is again the calculated root. The clock count of the last 3 operations (between labels "equal" and "done"), occurring on ¼ (25%) of the possible final values for r, or every

`(1/4 * sqrt(2^n))/(2^n) = 2^(-n/2-2)`

random calculations of cube root of n-bit numbers, is negligible and therefore is not added to this formula (e.g.: for n=16 you get those operations being run less than 0.1% of the time, and the value to be added to this formula would be less than 0.004). For the hard-cube-root, you have

Best case:
3+13+196+…+29+3+3 = 462
Worst case:
3+16+329+…+34+3+3 = 777

Therefore, the highest root range that can be calculated faster by soft-cube-root is

`11.5*r_min - 1.5 = 462`

`r_min ~~ 40.3`

`x_min = r_min^2 = 1624.09`


`11.5*r_max - 1.5 = 777`

`r_max ~~ 67.7`

`x_max = r_max^2 = 4583.29`

This means that soft-cube-root is faster than hard-cube-root for values of x between 0 and 1624~4583. Similar calculations for other APX/IA (x86) processors yield similar results (the best ratio being of the i286 processor).

If you often need to calculate an integer root in the above range (which has n>10, i.e. approximately 10 to 13-bit numbers), then you need this algorithm. It is a much more appealing option than even the soft-sqrt algorithm was for square roots!

Further Work

This work can still be extended in at least two areas: precision and speed.

In terms of precision, you can figure out the math to somehow make use of the returned values (other than r) as remainders, so you can do more precise calculations. The application of this is limited, though.

You can also fine-tune the algorithm so that when the decimal part of the real result r were .5, it rounds the result to the nearest even number, and not always up as it does now. Like previously stated, the square roots of integer numbers never have decimal parts of precisely .5, so this would be nothing more than an exercise for soft-sqrt. For other roots, such accuracy is in my opinion pointless since you're dealing with integer numbers after all and precision has gone out the window, but it would make the algorithm perhaps a bit more precise.

In terms of speed, you can always try and use a binary search (the type of search you use when you're looking for a word in a dictionary) if it proves quick enough to find the roots, or simply reverse the search direction for the root (starting at the maximum possible values for x and r and working your way backwards). This would mean that the bigger roots are searched for first, so a greater number of values of x are covered in the first few iterations, and on average a root is found sooner. The worst case is still a full search, though.

For instance, an Intel APX/IA assembly optimisation of this variation could be (assuming AX=x and BX=r, both unsigned integers, and n=16, where n stands for the number of bits being calculated):

	; …
	not   AX         ;  AX = (2^n)-1-AX       i486 cycles:   1
	mov   BX, 256    ;  BX = 2^(n/2)                         1
	sub   AX, 254    ;  AX = AX - ( (2^(n/2))-2 )            1
	jbe   done       ;                                     3/1
	inc   AX         ;                                       1
	shr   AX, 1      ;                                       2
	dec   BX         ;                                       1
	sub   AX, BX     ;                                       1
	ja    loop1      ;                                     3/1

For n=16, this algorithm has a clock count of 6 if r=256, or

7+5×(256-r)-2 = 1285-5r

if r<256, which means it can calculate square roots resulting 230~232 to 256 during the same time it takes to calculate hard-sqrt. This is a span of 12636 values of x (19.3%), compared to only 677 (1%) for the previously shown algorithm. The worst case is identical in both, however (or slightly slower in this one, but the difference is fixed and not proportional to r).

We will also take a look at the binary search variation in Part II of this document, but other work is left for the reader to do.

Photography by NASA on Unsplash