Optimising 6502 Machine Code
----------------------------
by Steven Flintham
------------------
Optimising machine code is one of those things that doesn't seem
immediately useful, but you never know when it might be useful. For
instance, its surprising just how many short pieces of code manage to
be just a few bytes over 256 bytes (one page).
The JSR/RTS method
------------------
One of the best known methods of cutting a few bytes out of a piece of
code involves replacing a JSR followed by an RTS with a JMP. For
example:
JSR &FFEE
RTS
can be replaced with:
JMP &FFEE
This works because the RTS at the end of the subroutine replaces the
original RTS.
The devious branch
------------------
The 65C02 and its immediate relatives contain the additional
instruction BRA, which provides an unconditional branch to nearby
code, taking only two bytes compared to three for an equivalent JMP.
The standard 6502, however, contains no such instruction, but in some
situations a different branch instruction can be used to provide the
same effect. For example:
LDA flag
CMP #10
BEQ set_A_to_0
LDA #1
JMP skip
.set_A_to_0
LDA #0
.skip
(which sets A to 0 if flag contains 10, and 1 otherwise) can be
repaced with:
LDA flag
CMP #10
BEQ set_A_to_0
LDA #1
BNE skip
.set_A_to_0
LDA #0
.skip
This saves one byte by replacing the JMP with a BNE. This can be done
because after loading A with 1, the Z flag will definitely be cleared.
However, don't fall into the trap of contriving such a condition. For
example, replacing:
JMP code
with
SEC:BCS code
provides no advantage as it occupies the same number of bytes (three)
AND corrupts the C flag. In the first example, the instruction which
had to be carried out anyway (LDA #1) cleared the Z flag, enabling us
to take advantage. If in doubt, don't use this method as it can lead
to problems - there are suggestions that the reported bug in some
versions of ViewSpell when used with the ADFS is due to some sort of
assumption being incorrectly made about a particular flag being set or
cleared.
Don't write more code than you need
-----------------------------------
If you're writing an interrupt driven routine or a ROM service call
handler, you'll have to save the processor registers and status flags
at some point. However, when you're deciding whether you want to
accept the call or if its the right interrupt, you often only need to
save one or two registers. For instance, if you're writing a ROM
service call handler:
.service_call_handler
PHP:PHA:STA temp:TXA:PHA:TYA:PHA:LDA temp
CMP #4
BEQ command
CMP #9
BEQ help
PLA:TAY:PLA:TAX:PLA:PLP
RTS
you don't need to save the X and Y registers unless you actually have
to service a call. This not only avoids the need for the
TXA:PHA:TYA:PHA, but also avoids having to store the accumulator
temporarily. The rewritten code is:
.service_call_handler
PHP:PHA
CMP #4
BEQ command
CMP #9
BEQ help
PLA:PLP
RTS
This is a total saving of 14 bytes, assuming that temp is not in zero
page.
Make the most of post-indexed indirect
--------------------------------------
On the 6502, the instructions:
LDY #0
LDA (zero_page),Y
frequently occur. If this happens, try to make use of the Y
assignment. I won't give an example because this is of most use in
complex situations, which can't be demonstrated easily. As an
illustration, if you are using this instruction to zero an area of
memory, you could use TYA to zero the accumulator instead of LDA #0,
saving one byte, but the example would become convoluted because it
would be easier to LDA #0 at the beginning and USE the Y register to
scan through the area of memory.
Use your processor
------------------
If you're writing software which will definitely be used on a Master
or second processor, both of which contain a 65C02 compatible chip,
use the extra instructions.
[ Editor's note :
From my limited knowledge of BBC machine code, it seems that most of the
following ways of optimising code are only compatible with Master machines,
or BBC B with 6502 second processor. If you are considering using any of the
following methods, consider very carefully the meaning of the word
"definitely". There is very little point in making a program incompatible
with much of the 6502 world simply for the sake of a few bytes (or because
you only use a Master and have forgotten that certain instructions are
unavailable on the BBC B).
Steven himself has in the past laid down a lack of 65C02-specific
instructions as a requirement for code submitted to him for inclusion in his
ADFSUtils ROM, and this seems, in my opinion, a sensible step. You may feel
that the particular program you are writing is so specific as to be of use
only for yourself, but it is more than likely that other users will also
find your software useful, and I would very much like to feature a wide
selection of fully-compatible software in 8-Bit Software.
My views on compatibility are based largely upon the fact that it is very
easy to ensure and at the same time advantageous to everyone. I spent around
twelve months loading PD software onto Econet (or rather, getting other
people to do it for me) just to find that many otherwise excellent pieces of
software did not follow the Reference Manual's recommendations that
"programs which might possibly be run in a network environment" should not
use certain (small) areas of memory.
All PD software will probably find its way to a network eventually because
site licences for commercial software are so expensive! More to the point,
networks may seem few enough in number to be virtually irrelevant (in the
view of many programmers!) when compared with the large number of individual
users, but the particular network in question had upwards of 250 users,
which makes 8-Bit Software seem fairly small potatoes by comparison.
Anyway, enough of my digression about compatibility - the rest of it is in
the "Submission Requirements" section. If your program contains some other
fundamental incompatibility with the BBC B, then obviously you might as well
use as many 65C02 instructions as you wish. - D.G.S. ]
The stack operations
--------------------
One very obvious example, which you might miss if you're not used to
the extra instructions, is replacing lines like:
PHP:PHA:TXA:PHA:TYA:PHA
with
PHP:PHA:PHX:PHY
This saves two bytes and avoids corrupting the accumulator.
Post-indexed indirect optimisation
----------------------------------
It's surprising how often code of the form:
LDY #0
LDA (&70),Y
appears in 6502 machine code. The 65C02 and family can provide
indirect addressing without post indexing, allowing you to write:
LDA (&70)
This saves two bytes and prevents you having to corrupt the Y
register. However, if you haven't got a 65C02 you can try to modify
the code so that setting Y to 0 performs some other useful function.
This is mentioned elsewhere.
Use BRA
-------
If your code uses JMP to skip over short sections of code, use BRA
instead, which saves one byte.
Use STZ
-------
Don't forget that if you have to zero an area of memory, the 65C02 and
family support an STZ instruction. For example:
LDA #0
STA address
can be replaced with:
STZ address
This is two bytes shorter and also avoids corrupting the accumulator
and setting the Z flag.
Postscript
----------
I'm sure there are many more techniques than those listed here, but
these are the ones that I find most useful. A final word of warning,
however - if you're optimising a program you've already written, keep
a copy of the original source code!