**Comparing basic coding**

*Note that the code here does not necessarily write properly to the screen - the code is intended for speed tests only*

FPU v CPU speeds @ 64 bits

These test compare the speed of various actions using the floating point instructions and also compares the same actions using the ordinary registers. In the early years when the floating point (maths co-processor) was first introduced, it was rather slow but it seems to be speeding up with each new processor.

Dividing a 64 bit number

Converting 64 bits to ascii

Converting 64 bits to ascii using the CPU and DIV

Convert 64 bits to ascii using the CPU bit shifting and bcd addition

Convert 64 bits to ascii using
the FPU instruction FPREM

Convert 64 bits to ascii using
the FPU instruction FRNDINT

**Dividing a 64 bit number**

The first two tests compare the speed of dividing a 64 bit number by 10 using the ordinary registers against the same action using the floating point registers. Here is the coding of these two tests starting with the CPU coding and then the FPU coding:-

Divide 64 bits in edx:eax by 10 (CPU method using DIV)

MOV EDI,500 ;do this 500 times L1: MOV ESI,EAX ;keep low order part in esi MOV EAX,EDX ;ready to divide high order part XOR EDX,EDX MOV ECX,10D DIV ECX ;divide-remainder now in edx XCHG EAX,ESI ;get low order part in eax, quotient in esi DIV ECX MOV EBX,EDX ;remainder now in ebx MOV EDX,ESI ;high order quotient in edx, low in eax DEC EDI JNZ L1

Divide 64 bits by 10 using FIDIV

FINIT ;initialise fpu FLDCW W[INIT] ;0F3Fh initialise precision=64 bits rounding=0 MOV ECX,500 ;do this 500 times L1: FLD T[MEM_64] ;load 64 bit number (in 80 bit real format) FIDIV W[TEN] ;divide by 10 FISTP Q[RESULT] ;get 64 bit long integer result DEC ECX JNZ L1

**
Converting 64 bits to ascii**

The next eight tests compare the speed of various methods of converting a 64 bit number to ascii so that it can be displayed to the user as a decimal number. In these tests the actual screen-write is not carried out, and only the basic coding is compared to avoid obscuring the differences in speed. One way to achieve the conversion to ascii using the CPU is to divide the number by 10 and write the remainder to the screen, then continue this until all 64 bits have been dealt with. Another method described below is to use bcd (binary coded decimal) numbers. The speed tests are carried out both with small and with large numbers. In these tests the small number is 128 (in floating point terms this is a mantissa of +8000000000000000 and an exponent of 8); and the large number is 1234567890123456789 when using the CPU (for the FPU a real number of 1234567890123456789012345.0E-6 is used which gives a mantissa of 891087A3EF4C0860 and an exponent of 61). The CPU DIV
test also drops out after 20 digits have been written. The actual coding of the main part of the CPU DIV test is as follows (where the number to be dealt with is in edx:eax, where ECX is 19):-

Convert 64 bits to ASCII using the CPU and DIV

MOV EDI,10 ;ready to divide by 10 L4: MOV ESI,EAX ;keep low order part in esi MOV EAX,EDX ;ready to divide high order part XOR EDX,EDX DIV EDI ;rem now in edx ready for 2nd div XCHG EAX,ESI ;restore low order part in eax DIV EDI MOV [MEM+ECX],DL ;write remainder to memory MOV EDX,ESI ;high order now in edx, low in eax DEC ECX ;see if 20 digits written already JS >L5 ;yes OR EAX,EAX ;see if number is zero (low) JNZ L4 ;no OR EDX,EDX ;see if number is zero (high) JNZ L4 ;no L5:

**Convert 64 bits to ASCII using the CPU bit shifting and bcd addition.**

The conversion from a 64 bit binary number to ascii is done here in an entirely different way from the divide by 10 method. This method looks at each bit in turn and if the bit is "set" (on) the value represented by that bit is added to a "result" buffer in memory. So for example we know that bit 0 has a value of 1 (decimal), and bit 2 has a value of 4 decimal. If bits 0 to 2 are looked at and only bits 0 and 2 are set these two values would be added together in the result buffer making 5 so far. After all 64 bits have been dealt with in this way the actual value in the result buffer will be a representation of the 64 bit number we started with. The tricky part is to arrange for the numbers in the result buffer to be a string of decimal digits which can then be written to the screen.

This is where the binary coded decimal (bcd) numbers come in. Continuing with my example of addition, bit 3 has a value of 8 decimal and if this is also set, you will have to add 8 and 5 making 13 decimal. What we need to do here is to ensure that this value of 13 decimal is drawn out as 1 then 3 decimal, rather than 13. Bcd numbers are digits with a value of zero to 9 only. There are a series of assembler instructions which use bcd numbers and these ensure that the limit of 9 is maintained in the register being used. For example, if you use ADD AL,AH to add 8 (in ah) to 5 (in al) this will result in a value of 13 decimal in al. However, following the instruction with the special bcd instruction AAA will reduce the value in al to 3 and set the carry flag to signify that there is a carry. Using the bcd instruction DAA allows you to use two bcd digits per byte (hence these are called "packed" bcd numbers). There is one in the first 4 bits (the first nibble) and one in the second 4 bits.

In the coding which follows, two buffers are set up. On is the result buffer, which is held in the first 512 bytes of the memory area. The other is the "2* buffer" which holds the value of the bit currently being looked at. This buffer starts off with the value of 1. This is when bit 0 is being considered. The value in this buffer has to be doubled ready for the next bit to be considered, so it increases as follows: 1 then 2 then 4 then 8 then 1,6 then 3,2 then 6,4 then 1,2,8 and so on. So at any one time the 2* buffer contains a representation of the value of the bit in the 64-bit number which is currently being considered. If a bit in the 64-bit number is set then each "active" nibble in the 2* buffer needs to be added to the corresponding nibble in the results buffer (using bcd instructions), and the carry if present passed to the next digit on the left. At the end of the procedure the result buffer contains a representation of the actual value in the original 64-bit number. All that remains is to take each nibble in turn to write the result to the screen.

Although this method is usually slower than the others it is capable of very accurate results since no rounding occurs whatsover (unlike when using divide), and it can be extended well beyond 64 bits.

The coding is in two parts, firstly to set things up (where the 64 bit number to convert is in edx:eax):-

Setting up the buffers and feeding blocks of 32 bits to the shift and bcd addition procedure

PUSH EDX,EAX ;save 64-bit value MOV EDI,[MEMAREA] ;get memory area to use XOR EAX,EAX MOV ECX,128 REP STOSD ;fill result buffer with zeroes MOV ECX,128 REP STOSD ;fill 2* buffer with zeroes DEC EDI ;back to end of 2* buffer MOV B[EDI],1 ;insert starter there MOV [BUFF],EDI ;keep end of active 2* buffer DEC EDI MOV EBP,EDI ;keep buffer stopper in ebp POP EDX ;get first 32 bits to look at MOV ECX,32 ;ready to look at 32 bits L6: CALL DOABIT ;do one bit only DEC ECX ;see if any more to do JNZ L6 ;yes POP EDX ;get next 32 bits to look at OR EDX,EDX ;see if any are set JZ >L10 ;no, so may finish MOV ECX,32 ;ready to look at 32 bits L8: CALL DOABIT ;do one bit only DEC ECX ;see if any more to do JNZ L8 ;yes L10:This is now the DOABIT procedure. Look at each bit. If set, add value in 2* buffer to result buffer. Then in any case double the value of the 2* buffer.

DOABIT: SHR EDX,1 ;see if bit is set and get to next JNC >L26 ;no, so just double the 2* buffer ;***************** add 2* buffer to result buffer MOV ESI,[BUFF] ;get active end of 2* buffer MOV EDI,ESI SUB EDI,512 ;get to corresp active end of result buffer XOR AH,AH L21: MOV AL,[EDI] ;get current value in result buffer ADD AL,AH ;add previous carry if any ADD AL,[ESI] ;add the 2* value DAA ;adjust to correct to BCD JZ >L23 ;al is zero - so may be finished L22: MOV AH,0 ;no effect on carry flag MOV [EDI],AL ;put value back into result buffer DEC EDI,ESI ;no effect on carry flag JNC L21 INC AH ;give effect to carry flag JMP L21 L23: ;may be finished JC L22 ;if c not all done (yes! 50+50 would do this) CMP ESI,EBP ;see if stopper in 2* buffer reached yet JA L22 ;no, return to loop nc MOV EBX,EDI ;keep earliest current place in result buffer ;******************* double the 2* buffer L26: MOV EDI,[BUFF] ;get active end of 2* buffer XOR AH,AH L28: MOV AL,[EDI] ;get current value ADD AH,AL ;add previous carry if any ADD AL,AH ;double the value, add carry too DAA ;adjust to correct to BCD JZ >L32 ;zero reached - possibly all done L230: MOV AH,0 ;no effect on carry flag MOV [EDI],AL ;put value back into buffer DEC EDI ;no effect on carry flag JNC L28 INC AH ;give effect to carry JMP L28 L32: ;possibly all done JC L30 ;if c not all done (yes! 50+50 would do this) CMP EDI,EBP ;see if reached stopper in 2* buffer yet JA L30 ;no, return to loop nc JZ >L34 ;if same, stopper need not move DEC EBP ;move back stopper by one MOV EAX,[BUFF] ;get end of active buffer so far SUB EAX,EBP ;get distance between them now CMP EAX,42 ;see if 42 done so far (83/4 bcd digits) JNA >L34 ;no, max required precision not reached DEC D[BUFF] ;bring active end of buffer back one L34: RET

**
Convert 64 bits to ASCII using the FPU instruction FPREM**

Since FPREM computes the remainder after a pretended
ST0/ST1 (that is, contents of floating point stack register 0 divided by stack
register 1), it would appear to be ideal to produce an ascii string from
a large number. The idea would be to use FPREM to pretend to divide the number by 10 and get the remainder, put that in the ascii string, then
actually carry out the division by 10 and continue carrying out these
actions until the number goes to less than one. Unfortunately in practice
some complicated code is required to make this work properly and this also
slows the whole process down.

One complication is that because of the way FPREM works, it is not
necessarily a one-shot instruction. FPREM computes the remainder by
carrying out successive subtractions of the divisor. If the difference between the exponents of the dividend and the divisor is less than 64 FPREM needs only one shot, but if the difference is higher than this, it needs to be called more than once. FPREM indicates
that it has not finished by setting the C2 condition code. In the example
below, lines 7 to 11 get the C2 status word into AX after each FPREM
operation and if the C2 condition code is set execution loops back to do
another FPREM.

After the remainder has been calculated it is necessary to get it
out of the floating point register and into memory. This is done in
line 13 using the instruction FISTP with a dword destination. Now comes
the second complication. FPREM did not perform as expected, and it was
found to produce a negative result sometimes. So for example, if the
remainder should be 7, the result might be minus 3 (0FFFFFFFDh), rather than 7. This is dealt with in lines 15 to 18 by testing for the negative and adding 10
if found.

Before the FISTP operation the precision
flag is changed to *round down* (line 12). This is to ensure that another of FPREM's strange results, a *negative zero* becomes -1 after the FISTP and then is changed to 9 by the test carried out in lines 15 to 18. Rounding is changed back to *round near* immediately afterwards. The precision and rounding values are words in data initialised to the correct value.

The next step is to divide the number at the top of the stack
using FDIV (line 19). This is an actual divide instead of a pretended
one as before. Note FDIV simply throws away the remainder, which is why
we are trying FPREM in the first place!

Then lines 20 to 25 check whether the number has yet gone below 1. If
not, the process must be continued until it has, or until 20 digits have been
done.

This procedure is accurate at least to 20 decimal digits, which
represents the accuracy of the 64 bit floating point register.
Beyond that the FDIV starts to lose precision. *Round near* reduces these
inaccuracies. FDIV sets the precision flag if precision has been lost.

In this example I used the instruction FPREM1 which superceded
FPREM after the IEEE finalised the floating point standard. Before any of
this code is executed (as is required in all cases) the instruction FINIT
should be executed and the precision and rounding flags should be set (in
this case) to 64-bit precision and round near. Note also that the
data "TEN" is a word initialised to the value of 10. This is pushed on
the first floating point register ready for each divide by 10 (line 5).

1. MOV ECX,20 ;do 20 digits only (max resolution anyway) 2. MOV ESI,ADDR DWORDNO ;data in memory declared as dword 3. FLD Q[QWORDNO] ;load 64 bit format number to convert 4. L30: ;(number is always at the top of the stack) 5. FILD W[TEN] ;make ST0=10 and PUSH: number now pushed to ST1 6. FLD 1 ;PUSH copy of ST1 at ST0: now ST1=10, ST0=ST2 7. L34: 8. FPREM1 ;get remainder following pretended ST0/ST1 9. FSTSW AX ;get status word into ax 10. TEST AH,4h ;check C2 condition code-see if FPREM1 done 11. JNZ L34 ;no, so need to continue 12. FLDCW W[PREC64_DOWN] ;073Fh=precision=64 bits, rounding=down 13. FISTP D[ESI] ;convert remainder to integer and POP 14. FLDCW W[PREC64_NEAR] ;033Fh=precision=64 bits, rounding=nearest 15. TEST D[ESI],80000000h ;see if negative number 16. JZ >L38 ;no 17. ADD D[ESI],10 18. L38: 19. FDIV ;ST1=ST1/ST0 then POP leaving result at top of stack 20. DEC ECX 21. JZ >L39 22. FICOM W[ONE] ;compare stack top with 1 23. FSTSW AX ;get status word 24. TEST AH,45h ;check C0,C2,C3 condition code - see if number less than 1 25. JZ L30 ;no, need to continue 26. L39: 27. FFREE 0 ;free last stack element 28. FINCSTP ;and increment stack pointer

**
Convert 64 bits to ASCII using the FPU instruction FRNDINT**

This conversion method is to divide the number to be converted by 10,
convert the result of this division to an integer using FRNDINT (ie.
remove the fractional part of the result) then multiply that integer
by 10. Then subtract that from the original number. This produces the
equivalent of the "remainder" of a division of the original number by 10.
This is quite a precise method although not as precise as the method using
FPREM, but it is about 20% quicker. Of course (as before) the
processor must be initialised using FINIT and the precision and rounding
flags must be set.

MOV EDI,ADDR BUFFER MOV ECX,20 ;20 digits only MOV ESI,ADDR DWORDNO FLD Q[QWORDNO] ;load 64 bit format number to convert L30: ;(number is always at the top of the stack) FLD 0 ;PUSH copy of number at top of stack now ST0=ST1 FILD W[TEN] ;ST0=10 and PUSH: number now pushed to ST1 & ST2 FDIV 2,0 ;ST2=ST2/ST0 divided value now at ST2, orig no. at ST1 FLD 2 ;make copy at stack top of divided value FRNDINT ;round stack top up to an integer FMUL ;ST0*ST1 (div value*10) leaving result on stack top FSUB ;subtract mul result from div res result in stack top FISTP D[ESI] ;convert remainder to integer and POP DEC ECX ;see if max digits reached JZ >L36 ;yes - finish FICOM W[ONE] ;compare stack top with 1 FSTSW AX ;get status word TEST AH,45h ;check C0,C2,C3 condition code - see if number less than 1 JZ L30 ;no, need to continue L36: FFREE 0 ;free last stack element FINCSTP ;and increment stack pointer