Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I still don’t understand how you can use basic data structure in assembly. I know strings are just allocating a blob of bytes and usually reserving the last for null terminator, but what about arrays? Is it just like in C where you allocate a blob of bytes but now you manually calculate where the index is based on item size? What about hash tables, that seems even more confusing... are they just not needed due to the nature of assembly?


If you know how to construct data structures in C, it's pretty straightforward to construct them in assembly language. In asm you tend to lose the benefit of arrays being a type. In x86 you have powerful addressing modes (the different ways you can calculate a memory address) which are normally encoded with modrm and sib bytes after the opcode. An example where eax, rbx, and rcx are registers:

  ; eax = rbx[rcx] where rbx points to the base of an array of dwords
  mov eax, [rcx*4 + rbx]
  ; => 8B 04 8B
  ; where the first 8B is the opcode for this version of mov, 04 is the modrm byte saying
  ; to mov the value into eax as well as that there will be a sib (scale/index/base)
  ; byte following specifying the data source, and then 8B specifying everything in the []'s.
  ; If we were moving into ecx, the modrm byte would be 0C because a the 3-bit 
  ; register specification portion of modrm would change to specify ecx.
If you are optimizing for size (which can be a fun exercise) there are other ways of iterating through "arrays" of bytes such as the various string instructions [1] (though these tend to be slower on modern processors because they are microcoded). This gets into designing your code to use registers which are used implicitly by small instructions reducing the need for modrm/sib bytes in the instruction bytes [2].

Another example might be:

  lodsd      ; eax = *rsi++  : "AD" (one byte opcode using implicit src/dst operands)
  shl eax, 2 ; eax *= 4      : "C1 E0 02" where C1 specifying shl instruction
             ;                 E0 specifying eax and immediate constant 
             ;                 02 the constant to shift eax by (this is a mul by 4)
  stosd      ; *rdi++ = eax  : "AB" store into destination (one byte opcode)
It's not hard to imagine that being in a loop to iterate over the contents of an array. Hash tables are just the way they are in C really.

[1] http://www.felixcloutier.com/x86/LODS:LODSB:LODSW:LODSD:LODS... (there are others)

[2] https://www.swansontec.com/sregisters.html


Everything is just a blob of bytes when you get right down to it. Strings and arrays are just as you say. As another commenter notes, hash tables are big arrays and a function that tells you how to index into it. Linked lists are a bunch of different blobs, each of which has a sub-blob that tells you the address of the next blob. And so on.

They have this representation whether you're using assembly or not... it's just that if you're writing assembly, then you have to pay attention to their blobby nature. If you're writing in a higher level language, then that language will do the work for you of translating, say, my_object["my_key"] = my_value into the assembly instructions that deal with indexing into your giant array.


Yeah, seems like you have the right idea, for arrays. So if you think about a hash table, (a very simple one) that is just a big array, paired with a function that turns keys into indices. So you would define your function. It would spit out an index. And you would calculate the location of that index based on the start point of the array, and the size of the values.


Besides all the other comments.

At least on PC/Amiga/Atari world, macro assemblers were quite powerful, so you could kind of invent your own "C" using their macros.


>Is it just like in C where you allocate a blob of bytes but now you manually calculate where the index is based on item size?

You iterate by address, so on a 32 bit system 4 bytes at a time. Then just reference whatever that address is pointing to.


>I know strings are just allocating a blob of bytes and usually reserving the last for null terminator, but what about arrays? Is it just like in C where you allocate a blob of bytes but now you manually calculate where the index is based on item size?

Yes, that is roughly it, although there is more to the topic, related to arrays and pointers, etc. In fact C's way of doing it is pretty low-level too and similar to assembly - it just adds the address of the origin of the array (i.e. the array name. which is a pointer to the start of the array) and the index times the size of the array element, to get to the memory location of the indexed item in the array.

A surprising point I read a while ago is that due to this, in C, the expression a[i] is equivalent to i[a], where a is the array name and i is the index (int variable). This is because C resolves a[i] to:

  *(a + i),
meaning, dereference the pointer calculated by adding the offset i to the pointer a, and that expression:

  *(a + i),
by the commutativity law of arithmetic, is the same as:

  *(i + a),
which can be converted to i[a] by the same C resolution rule in reverse. I actually tried this out earlier when I first read it, on either Microsoft C compilers on Windows or gcc on Linux, or both, and it worked. Not sure, it might have changed for some reason now, maybe due to later versions of the C standard. Should try it out again.

The Kernighan and Ritchie book "The C Programming Language" is be a good introduction to how arrays and pointers work (and their inter-relationship). (Seeing your profile, I guess you may have read it, but just saying.) I had read the original version and the 2nd (ANSI C) version. Not sure if there is an updated version after that.

In the versions I read, they explain arrays and pointers in C pretty well (although you have to do some work yourself, it is not dumbed-down teaching like some you see nowadays).

>What about hash tables, that seems even more confusing...

The K&R book had a nice simple implementation of a hash table in C too. The hash function (from memory) was something like just adding up all the characters's int values (in the string used as the hash key) and doing a division modulo some prime number (like 31). There was other stuff for handling collisions and buckets and so on. And that was of course only a demo version (though it did work), there are more advanced versions.

(Sorry for a few multiple edits, I was not familiar with how asterisks are used in HN as markup, so had to edit it a few times.)

Here are a few links for others who might not know about how they are used:

Google search for:

how to disable meaning of asterisk in hacker news

https://www.google.com/search?q=how+to+disable+meaning+of+as...

and from it:

https://news.ycombinator.com/item?id=12521497


> I know strings are just allocating a blob of bytes and usually reserving the last for null terminator

I know this is what string literals in C translate to, but is it actually how anyone does it these days? I thought there was general agreement that a size field is mostly superior to an elephant in Cairo.


Most languages now have a length field, yes. Assembly could be done either way, of course.


The comparison with C is not too bad. Explore what your C compiler does out of simple C code (Array access, allocation, string access, linked lists), for instance online in https://godbolt.org/


if you write recursive macros and/or use computed offsets from some template structure someplace...it ends up looking pretty much like C. except you have to assign your own registers.


if you write recursive macros and/or use computed offsets from some template structure someplace...it ends up looking pretty much like C


Well, Hash tables are much more complicated. I think the closest analog is going to be a jump table, just with fancy math.

I feel like the hardest part of Assembly is all of the hex math required...


Just use a programmer's calculator or a software calculator (e.g. bc on Unix, RealCalc on Android) for the hex math (and for math in other number bases and conversions between bases too). Or even write one of your own as an exercise and then use it.


Do you even do any of the hex math when writing actual code in assembly?

I understand when you use something along the lines of a disassembler that requires you to actually see the addresses of the values. But even these are sophisticated enough to give you symbolic values instead.


A lot of my hobby programs are emulators or parsers for old file formats (think DOS-era games). Bitfields become important, and they're more convenient to express in hex than in decimal.

There are times where it feels convenient to express things in decimal and times where it seems better to use hex.


>Do you even do any of the hex math when writing actual code in assembly?

There are multiple reasons why it is useful or convenient to deal with data (whether memory addresses, data values, colors, etc.) in hex (including doing math on such hex values).

As I said in another comment in this thread, I have not done a lot of work in assembly language; but have done enough and read enough about it to know that it is useful, and not just a cool or retro thing that developers do.

The fact that two hex digits compactly represent a byte, the fact that one hex digit can represent a nibble/nybble [1], the fact that machine registers on devices attached to computers (such as printers, scanners, any other electro-mechanical equipment with a computer interface) are used to programmatically read and write data to and from those devices, and thereby get their status and manipulate and control them, are some of the reasons why hex and hex math and bit-twiddling in hex and binary is useful and still done.

Depends on what kind of applications you write in assembly (or even C), I guess.

[1] https://en.wikipedia.org/wiki/Nibble

Also see comment by khedoros1.


In those days it wasn’t uncommon to have a calculator like the HP 16C on your desk




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: