STM32 USB and Rust - Packet Memory Area
In this, our next exciting installment of STM32 and Rust for USB device drivers, we're going to look at what the STM32 calls the 'packet memory area'. If you've been reading along with the course, including reading up on the datasheet content then you'll be aware that as well as the STM32's normal SRAM, there's a 512 byte SRAM dedicated to the USB peripheral. This SRAM is called the 'packet memory area' and is shared between the main bus and the USB peripheral core. Its purpose is, simply, to store packets in transit. Both those IN to the host (so stored queued for transmission) or OUT from the host (so stored, queued for the application to extract and consume).
It's time to actually put hand to keyboard on some Rust code, and the PMA is
the perfect starting point, since it involves two basic structures. Packets
are the obvious first structure, and they are contiguous sets of bytes which
for the purpose of our work we shall assume are one to sixty-four bytes long.
The second is what the STM32 datasheet refers to as the BTABLE or Buffer
Descriptor Table. Let's consider the BTABLE first.
The Buffer Descriptor Table
The BTABLE is arranged in quads of 16bit words. For "normal" endpoints this
is a pair of descriptors, each consisting of two words, one for transmission,
and one for reception. The STM32 also has a concept of double buffered
endpoints, but we're not going to consider those in our proof-of-concept work.
The STM32 allows for up to eight endpoints (EP0 through EP7) in internal
register naming, though they support endpoints numbered from zero to fifteen in
the sense of the endpoint address numbering. As such there're eight
descriptors each four 16bit words long (eight bytes) making for a buffer
descriptor table which is 64 bytes in size at most.
| Byte offset in PMA | Field name | Description |
|---|---|---|
(EPn * 8) + 0 | USB_ADDRn_TX | The address (inside the PMA) of the TX buffer for EPn |
(EPn * 8) + 2 | USB_COUNTn_TX | The number of bytes present in the TX buffer for EPn |
(EPn * 8) + 4 | USB_ADDRn_RX | The address (inside the PMA) of the RX buffer for EPn |
(EPn * 8) + 6 | USB_COUNTn_RX | The number of bytes of space available for the RX buffer for EPn (and once received, the number of bytes received) |
The TX entries are trivial to comprehend. To transmit a packet, part of the
process involves writing the packet into the PMA, putting the address into the
appropriate USB_ADDRn_TX entry, and the length into the corresponding
USB_COUNTn_TX entry, before marking the endpoint as ready to transmit.
To receive a packet though is slightly more complex. The application must
allocate some space in the PMA, setting the address into the USB_ADDRn_RX
entry of the BTABLE before filling out the top half of the USB_COUNTn_RX
entry. For ease of bit sizing, the STM32 only supports space allocations of
two to sixty-two bytes in steps of two bytes at a time, or thirty-two to
five-hundred-twelve bytes in steps of thirty-two bytes at a time. Once the
packet is received, the USB peripheral will fill out the lower bits of the
USB_COUNTn_RX entry with the actual number of bytes filled out in the buffer.
Packets themselves
Since packets are, typically, a maximum of 64 bytes long (for USB 2.0) and are
simply sequences of bytes with no useful structure to them (as far as the USB
peripheral itself is concerned) the PMA simply requires that they be present
and contiguous in PMA memory space. Addresses of packets are relative to the
base of the PMA and are byte-addressed, however they cannot start on an odd
byte, so essentially they are 16bit addressed. Since the BTABLE can be
anywhere within the PMA, as can the packets, the application will have to do
some memory management (either statically, or dynamically) to manage the
packets in the PMA.
Accessing the PMA
The PMA is accessed in 16bit word sections. It's not possible to access single bytes of the PMA, nor is it conveniently structured as far as the CPU is concerned. Instead the PMA's 16bit words are spread on 32bit word boundaries as far as the CPU knows. This is done for convenience and simplicity of hardware, but it means that we need to ensure our library code knows how to deal with this.
First up, to convert an address in the PMA into something which the CPU can use
we need to know where in the CPU's address space the PMA is. Fortunately this
is fixed at 0x4000_6000. Secondly we need to know what address in the PMA we
wish to access, so we can determine which 16bit word that is, and thus what the
address is as far as the CPU is concerned. If we assume we only ever want to
access 16bit entries, we can just multiply the PMA offset by two before adding
it to the PMA base address. So, to access the 16bit word at byte-offset 8 in
the PMA, we'd look for the 16bit word at 0x4000_6000 + (0x08 * 2) => 0x4000_6010.
Bundling the PMA into something we can use
I said we'd do some Rust, and so we shall…
// Thanks to the work by Jorge Aparicio, we have a convenient wrapper
// for peripherals which means we can declare a PMA peripheral:
pub const PMA: Peripheral<PMA> = unsafe { Peripheral::new(0x4000_6000) };
// The PMA struct type which the peripheral will return a ref to
pub struct PMA {
pma_area: PMA_Area,
}
// And the way we turn that ref into something we can put a useful impl on
impl Deref for PMA {
type Target = PMA_Area;
fn deref(&self) -> &PMA_Area {
&self.pma_area
}
}
// This is the actual representation of the peripheral, we use the C repr
// in order to ensure it ends up packed nicely together
#[repr(C)]
pub struct PMA_Area {
// The PMA consists of 256 u16 words separated by u16 gaps, so lets
// represent that as 512 u16 words which we'll only use every other of.
words: [VolatileCell<u16>; 512],
}
That block of code gives us three important things. Firstly a peripheral object which we will be able to (later) manage nicely as part of the set of peripherals which RTFM will look after for us. Secondly we get a convenient packed array of u16s which will be considered volatile (the compiler won't optimise around the ordering of writes etc). Finally we get a struct on which we can hang an implementation to give our PMA more complex functionality.
A useful first pair of functions would be to simply let us get and put u16s in and out of that word array, since we're only using every other word…
impl PMA_Area {
pub fn get_u16(&self, offset: usize) -> u16 {
assert!((offset & 0x01) == 0);
self.words[offset].get()
}
pub fn set_u16(&self, offset: usize, val: u16) {
assert!((offset & 0x01) == 0);
self.words[offset].set(val);
}
}
These two functions take an offset in the PMA and return the u16 word at that offset. They only work on u16 boundaries and as such they assert that the bottom bit of the offset is unset. In a release build, that will go away, but during debugging this might be essential. Since we're only using 16bit boundaries, this means that the first word in the PMA will be at offset zero, and the second at offset two, then four, then six, etc. Since we allocated our words array to expect to use every other entry, this automatically converts into the addresses we desire.
If we pop (and please don't worry about the unsafe{} stuff for now):
unsafe { (&*usb::pma::PMA.get()).set_u16(4, 64); }
into our main function somewhere, and then build and objdump our test binary we can see the following set of instructions added:
80001e4: f246 0008 movw r0, #24584 ; 0x6008
80001e8: 2140 movs r1, #64 ; 0x40
80001ea: f2c4 0000 movt r0, #16384 ; 0x4000
80001ee: 8001 strh r1, [r0, #0]
This boils down to a u16 write of 0x0040 (64) to the address 0x4006008
which is the third 32 bit word in the CPU's view of the PMA memory space (where
offset 4 is the third 16bit word) which is exactly what we'd expect to see.
We can, from here, build up some functions for manipulating a BTABLE, though
the most useful ones for us to take a look at are the RX counter functions:
pub fn get_rxcount(&self, ep: usize) -> u16 {
self.get_u16(BTABLE + (ep * 8) + 6) & 0x3ff
}
pub fn set_rxcount(&self, ep: usize, val: u16) {
assert!(val <= 1024);
let rval: u16 = {
if val > 62 {
assert!((val & 0x1f) == 0);
(((val >> 5) - 1) << 10) | 0x8000
} else {
assert!((val & 1) == 0);
(val >> 1) << 10
}
};
self.set_u16(BTABLE + (ep * 8) + 6, rval)
}
The getter is fairly clean and clear, we need the BTABLE base in the PMA,
add the address of the USB_COUNTn_RX entry to that, retrieve the u16 and
then mask off the bottom ten bits since that's the size of the relevant field.
The setter is a little more complex, since it has to deal with the two possible
cases, this isn't pretty and we might be able to write some better peripheral
structs in the future, but for now, if the length we're setting is 62 or less,
and is divisible by two, then we put a zero in the top bit, and the number of
2-byte lumps in at bits 14:10, and if it's 64 or more, we mask off the bottom
to check it's divisible by 32, and then put the count (minus one) of those
blocks in, instead, and set the top bit to mark it as such.
Fortunately, when we set constants, Rust's compiler manages to optimise all
this very quickly. For a BTABLE at the bottom of the PMA, and an
initialisation statement of:
unsafe { (&*usb::pma::PMA.get()).set_rxcount(1, 64); }
then we end up with the simple instruction sequence:
80001e4: f246 001c movw r0, #24604 ; 0x601c
80001e8: f44f 4104 mov.w r1, #33792 ; 0x8400
80001ec: f2c4 0000 movt r0, #16384 ; 0x4000
80001f0: 8001 strh r1, [r0, #0]
We can decompose that into a C like *((u16*)0x4000601c) = 0x8400 and from
there we can see that it's writing to the u16 at 0x1c bytes into the CPU's view
of the PMA, which is 14 bytes into the PMA itself. Since we know we set the
BTABLE at the start of the PMA, it's 14 bytes into the BTABLE which is firmly
in the EP1 entries. Specifically it's USB_COUNT1_RX which is what we were
hoping for. To confirm this, check out page 651 of the datasheet. The
value set was 0x8400 which we can decompose into 0x8000 and 0x0400. The
first is the top bit and tells us that BL_SIZE is one, and thus the blocks
are 32 bytes long. Next the 0x4000 if we shift it right ten places, we get
the value 2 for the field NUM_BLOCK and multiplying 2 by 32 we get the 64
bytes we asked it to set as the size of the RX buffer. It has done exactly
what we hoped it would, but the compiler managed to optimise it into a single
16 bit store of a constant value to a constant location. Nice and efficient.
Finally, let's look at what happens if we want to write a packet into the PMA. For now, let's assume packets come as slices of u16s because that'll make our life a little simpler:
pub fn write_buffer(&self, base: usize, buf: &[u16]) {
for (ofs, v) in buf.iter().enumerate() {
self.set_u16(base + (ofs * 2), *v);
}
}
Yes, even though we're deep in no_std territory, we can still get an iterator
over the slice, and enumerate it, getting a nice iterator of (index, value)
though in this case, the value is a ref to the content of the slice, so we end
up with *v to deref it. I am sure I could get that automatically happening
but for now it's there.
Amazingly, despite using iterators, enumerators, high level for loops, function calls, etc, if we pop:
unsafe { (&*usb::pma::PMA.get()).write_buffer(0, &[0x1000, 0x2000, 0x3000]); }
into our main function and compile it, we end up with the instruction sequence:
80001e4: f246 0000 movw r0, #24576 ; 0x6000
80001e8: f44f 5180 mov.w r1, #4096 ; 0x1000
80001ec: f2c4 0000 movt r0, #16384 ; 0x4000
80001f0: 8001 strh r1, [r0, #0]
80001f2: f44f 5100 mov.w r1, #8192 ; 0x2000
80001f6: 8081 strh r1, [r0, #4]
80001f8: f44f 5140 mov.w r1, #12288 ; 0x3000
80001fc: 8101 strh r1, [r0, #8]
which, as you can see, ends up being three sequential halfword stores directly to the right locations in the CPU's view of the PMA. You have to love seriously aggressive compile-time optimisation :-)
Hopefully, by next time, we'll have layered some more pleasant routines on our PMA code, and begun a foray into the setup necessary before we can begin handling interrupts and start turning up on a USB port.