This article is very interesting. The only thing that I would love to see on it, is an example about how, using the registers according to the original design, would lead to more compression in contrast to using the registers freely.
There's a handy appendix in the NASM manual that's good for answering questions like this. It doesn't tell you anything that isn't in the data sheet, but it's more concise (the x86 data "sheet" is, like, 5 volumes, and must be 2,500 pages).
From this you can see that many arithmetic and logic instructions have special encodings that use AL/AX/EAX as the destination. (You can also use the more general encoding to produce exactly the same instruction, but in more bytes.)
If you're working with AL/AX/EAX, then if you arrange for ESI and EDI to point into your source and dest buffers respectively, you can use LODS and STOS to fetch and store bytes. If your processing is working in a streaming-like fashion (and this isn't a bad idea, if you can arrange it) then you can fetch or store a bytes or dword, and bump the pointer, using one instruction that's 1 byte. Doing this with a MOV followed by an INC would be 2 instrutions and 3 bytes.
There is no short form for the [EBP] addressing mode, nor [EBP+reg], because you virtually always use fixed offsets with a base pointer. I think both of these end up as [EBP+0] or [EBP+reg+0], so you'll alwas waste an extra byte for the fixed displacement if you try to use EBP for something more general.
The rotate instructions have special encodings that take the rotate count in CL, since C is for count after all.
There's probably also a few more bits and pieces like the above, which was just what stuck in my mind from when I wrote code to generate x86 machine code directly. I just skimmed through the NASM doc to double check my memory.
I couldn't say whether using the shorter forms makes any realistic difference on real code these days, aside from the general 'fewer bytes = probably quicker' rule. But If you're trying to make EXEs that are 4K or less, as this guy seems to, I bet the little savings would add up.
Your other replies illustrated how preferring registers according to design will shorten the code; but that doesn't really address compression. If you use the instructions for their design purposes you're likely to end up with common sequences of instructions and arguments, which will compress better than if you didn't use the registers consistently. If you can compress your code so much that you have room to make it self-extracting, you get to cheat your size limit a bit.
The author gives an example, you just have to fill in some blanks. To follow along, on Linux you can use the NASM assembler (sudo apt-get install nasm on Debian-like or Ubuntu-like systems).
Then you can copy-paste the assembly language from the article (I've changed the spacing for readability and added dummy definitions for the names so it will compile):
;demo1.asm
source_address equ 0x100
destination_address equ 0x200
loop_count equ 0x10
mov esi, source_address
mov edi, destination_address
mov ecx, loop_count
my_loop:
lodsd
;Do some calculations with eax here.
stosd
loop my_loop
And assemble it, in your favorite shell type:
nasm -l demo1.lst demo1.asm
Then here's my alternative implementation that uses different registers. You can no longer use LODSD, STOSD or LOOP instructions since these instructions only work if you chose the same registers as demo1.
;demo2.asm
source_address equ 0x100
destination_address equ 0x200
loop_count equ 0x10
mov ebx, source_address
mov edx, destination_address
mov esi, loop_count
my_loop:
mov eax,[ebx] ; these two instructions instead of stosd
add ebx,4
;Do some calculations with eax here.
mov [edx],eax ; these two instructions instead of lodsd
add edx,4
dec esi ; these two instructions instead of loop
jnz my_loop
I get a demo1 of 24 bytes and a demo2 of 38 bytes. The demo1.lst and demo2.lst files produced show how many bytes are taken up by each instruction. (And if you get addresses in a crash dump, they can be used to track down the corresponding source code line.)
If you want to actually run these programs, nasm's default output (raw machine language instructions) cannot be used by most OS's. (In DOS, you can -- just rename to .COM. But a DOS target needs to tell NASM 'bits 16', to have it emit the proper prefixes for those new-fangled 32-bit instructions, and will crash without the DOS exit syscall, INT 0x20.) The magic incantations for standalone Linux assembly language programs are here:
The .o file produced by an intermediate step of the instructions at the above link can be linked with C code (unless you use a Microsoft toolchain, in which case you have to instruct nasm to output .obj instead). Then you can call assembly language functions from C and vice versa. (Figuring out how to retrieve your function's arguments in assembly language is very interesting and will enlighten you about the implementation of high-level languages.) Most "real-world" assembly code does this: Most of the program is written in C, and only the functions that need the particular advantages of assembly language are written in it.