FUJITSU

Back to Know-How on Development for FR Family

2. How to up the CODE efficient.

2.1 Using the the 20bit address mode

2.2 Option specification at using division instruction (div step)

2.3 Adjust number of local variable in order to not exceed 512 bytes for number of stack use of function

2.4 Avoid to use a lot of signed 1 byte/2 byte data

2.5 Control of loop-unrolling optimization

2.6 Review of necessity for inline expansion

2.7 Control of standard library expansion

2.8 Others


2.1 Using the 20bit address mode

Generally FR processes with following 3 steps at operation.

  1. Resister set of memory address
  2. Load data to resister
  3. Operation

Especially when using a lot of external variable, there is case of large code size, because a lot of instruction, which load 32-bit address, is used

[C source ] [In case of FR]
a=b+c; LDI:32, #_b R12
LD @R12, R0
LDI:32 #_c, R12
LD @R12, R1
ADD R1, R0
LDI:32 #_a, R12
ST R0, @R12

Therefore when the code or data is possible t to locate to RAM/ROM in locating to 20-bit address space (0x0 to 0xFFFFF), set of 20-bit address mode (-K shortaddress option) is recommended. If the location is impossible, the use of external variable should be changed to local variable if possible.

[C source] [default] [-Kshortaddress specifying]
a=b+c; LDI:32 #_b, R12 LDI:20 #_b, R12
LD @R12, R0 LD @R12, R0
LDI:32 #_c, R12 LDI:20 #_c, R12
LD @R12, R1 LD @R12, R1
ADD R1, R0 ADD R1, R0
LDI:32 #_a, R12 LDI:20 #_a, R12
ST R0, @R12 ST R0, @R12
---------------- ----------------
26 byte 20byte

2.2 Option specification at using division instruction (div step)

FR has div step instruction for division. But when this instruction is used, more code size than 72 bytes by division is made, because of 1 division with 36 instructions.
Compiler makes the code in order to call the library at executing at default for division process. Therefore if there are some division instruction, reduced code size is outputted at default set.
However if the optimization of speed priority (-Kspeed) is specified, it is directly expanded div step instruction. When to increase of code size for division process at specifying the optimization of speed priority has a problem, to not specify the optimization of speed priority is recommended.

[C source] [In case of speed priority] [default]
a=b/c; LDI:20 #_b, R12 LDI:20 #_b, R12
LD @R12, R0 LD @R12, R4
LDI:20 #_c, R12 LDI:20 #_c, R12
LD @R12, R1 LD @R12, R5
MOV R0, MDL CALL20 _divi, R12
DIV0S R1 LDI:20 #_a, R12
DIV1 R1 ST R4, @R12
DIV1 1
DIV1 R1
DIV1 R1
IV1 R1
----------------------- -----------------------
74 byte 20 byte*

*:divi function of 78 bytes is made separately.
(When divi function is used as library, to reduce code size at executing some division instructions is possible to expect.)


2.3 Adjust number of local variable in order to not exceed 512 bytes for number of stack use of function

LD/ST instruction is possible to use FP relative address. However the offset, which is possible to specify, is in maximum -512 to +508 (at 4 bytes type), because of restriction of 16-bit instruction length. Therefore when local variable area, which is exceeded 512 bytes, is used, the operation in order to calculate stack address is increased, and code size is larger and access efficiency is decreased.
So in order to not exceed 512 bytes for number of stack use of function, code size is reduced and access efficiency is improved by adjusting number of local variable.

Number of stack use for each function is possible to confirm with SOFTUNE C/C++ Analyzer.

(Note) When local variable is the type of 2 bytes or 1 byte, the offset, which is possible to specify, is -256 to 254 or -128 to 127 for each type. Therefore the size, which is possible to generate of effective code, is different.

[C source] [In case of -520 for offset] [In case of -4 for offset]
(at using larger size than above mention) (at using the size within above mention)
a=10; LDI #10, R0 LDI #10, R0
LDI #-520, R13 ST R0, @(FP,-4)
ST R0, @(R13,FP)
------------------------- -------------------------
8 byte 4 byte

2.4 Avoid to use a lot of signed 1 byte/2 byte data

FR architecture does not have load instruction of signed data. Therefore when loading signed 1 byte/2 bytes data, sign expansion is needed after loading. When using a lot of signed 1 byte/2 bytes data, code size is increased at comparing as unsigned data.
So code size is reduced and access efficiency is improved by using unsigned type as possible.

(Note) For Softune Compiler char type is use as unsigned char type. Therefore char type is possible to use as it is.

[C source] [In case of signed char type] [In case of char type]
a=b+c; LDI:20 #_b, R12 LDI:20 #_b, R12
LDUB @R12, R0 LDUB @R12, R0
EXTSB R0 LDI:20 #_c, R12
LDI:20 #_c, R12 LDUB @R12, R1
LDUB @R12, R1 ADD R1, R0
EXTSB R1 LDI:20 #_a, R12
ADD R1, R0 STB R0, @R12
LDI:20 #_a, R12
STB R0, @R12
---------------------- ----------------------
24 byte 20 byte

2.5 Control of loop-unrolling optimization

Loop-unrolling optimization is improved of execution speed by reducing number of loop. But object size is increased.
How to describe the code in case of speed priority and code size priority should be reviewed as an aim.

[Before unrolling]
for(i=0;i<6;i++){ a[i]=0;}
[After unrolling]
for(i=0;i<6;i+3){
a[i]=0;
a[i+1]=0;
a[i+2]=0;
}

And when unrolling control is not specified even above [Before unrolling] description, code size is larger. Therefore corresponded compiler to code size is possible with specifying size priority optimization (-Ksize) or loop-unrolling control (-Knounroll).

[C source]
for(i=0;i<6;i++){a[i]=0;}
[Loop unrolling optimization] [unrolling determent]
LDI:20 #_a, R6 LDI #0, R4
LDI #0, R4 L_26: LDI #0, R0
LDI #2, R5 LDI:20 #_a, R13
L_32: LDI #0, R0 STB R0, @(R13,R4
MOV R4, R13 ADD #1, R4
STB R0, @(R13,R6) CMP #6, R4
MOV R6, R0 BLT20 L_26, R12
ADD R4, R0 LDI #0, R4
LDI #0, R1
LDI #1, R13
STB R1, @(R13,R0)
MOV R6, R0
ADD R4, R0
LDI #0, R1
LDI #2, R13
STB R1, @(R13,R0)
ADD #3, R4
ADD #-1, R5
CMP #1, R5
BGE20 L_32, R12
--------------------- ---------------------
42 byte 18 byte

2.6 Review of necessity for inline expansion

Inline expansion optimization is expanded the process of function for call ahead instead of function call to defined function in C source. When the process of expanded function is very small, code size after inline expansion may be small. But generally object size is increased.

In case of object size priority, this optimization is not recommended.
(Not use -xauto option, -x option, #pragma inline, inline type qualifier (only C++))

[C source]
unsigned short ADD_sat16(unsigned short a, unsigned short b){
int tmp;
if((tmp=a+b)>0xffff) return 0xffff;
return (unsigned short)tmp;
}
unsigned short a,b,c,d,e,f;
func(){
a=ADD_sat16(b,c);
d=ADD_sat16(e,f);
}
[In-line expansion optimization] [In-line optimization control]
_func: LDI:20 #_b, R12 _func: ST RP, @-SP
LDUH @R12, R4 LDI:20 #_b, R12
LDI:20 #_c, R12 LDUH @R12, R4
LDUH @R12, R5 LDI:20 #_c, R12
ADD R5, R4 LDUH @R12, R5
LDI #65535, R0 CALL20 _ADD_sat16, R12
CMP R0, R4 LDI:20 #_a, R12
BLE20 L_32, R12 STH R4, @R12
LDI #65535, R4 LDI:20 #_e, R12
BRA20 L_28, R12 LDUH @R12, R4
L_32: EXTUH R4 LDI:20 #_f, R12
L_28: LDI:20 #_a, R12 LDUH @R12, R5
STH R4, @R12 CALL20 _ADD_sat16, R12
LDI:20 #_e, R12 LDI:20 #_d, R12
LDUH @R12, R4 STH R4, @R12
LDI:20 #_f, R12 LD @SP+, RP
LDUH @R12, R5 RET
ADD R5, R4
LDI #65535, R0
CMP R0, R4
BLE20 L_36, R12
LDI #65535, R4
BRA20 L_34, R12
L_36: EXTUH R4
L_34: LDI:20 #_d, R12
STH R4, @R12
RET
----------------------- -----------------------
74 byte 46 byte

However with argument and by using static function for the function of small code size or by specifying #plagma inline, code size is possible to reduce. To use "inline candidate selection function" in Softune C Analyzer is recommended.


2.7 Control of standard library expansion

Standard library expansion replaces to standard function of higher speed, which inline expansion of standard function and same operating is performed, with recognizing the operating of standard function. In case of object size priority, not use this optimization. Use standard library inline expansion control (-Knolib).


2.8 Others

Locate structure member, which number of reference is large, to head.

Access of structure member is fixed actual location address by calculating head address + offset.
Head member is not needed the calculation because of offset=0. When there is high member for static access frequency, review whether it is possible to locate to head.

Void of function, which is returned structure.

The function, which is returned structure, is occurred structure transfer into work area. address of structure to substitution destination is handled with argument, and it is possible to make void function by directly substituting.

Within 4 argument

Within 4 argument, it is not needed the code for access because of handling with resister. Therefore execution speed is improved. When there is argument, which is handled uselessly, review to reduce it.