Practical Guide to Bare Metal C++

Overview

Once in a while I encounter a question whether C++ is suitable for embedded development and bare metal development in particular. There are multiple articles of how C++ is superior to C, that everything you can do in C you can do in C++ with a lot of extras, and that it should be used even with bare metal development. However, I haven’t found many practical guides or tutorials of how to use C++ superiority and boost development process compared to conventional approach of using “C” programming language. With this book I hope to explain and show examples of how to implement soft real time systems without prioritising interrupts and without any need for complex real time task scheduling. Hopefully it will help someone to get started with using C++ in embedded bare metal development.

This work is licensed under a Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Audience

The primary intended audience of this document is professional C++ developers who want to understand bare metal development a little bit better, get to know how to use their favourite programming language in an embedded environment, and probably bring their C++ skills to an “expert” level. Why professional? Because bare metal platform has lots of limitations. In most cases no exceptions and no runtime type information (RTTI) support will be available. In many cases the dynamic memory allocation will also be excluded. In order to be able to use C++ effectively you will have to have deep knowledge of existing C++ idioms, constructs and STL contents. You must know how your favourite data structures are implemented and whether it is possible to reuse them in your environment. If it is not possible to use the STL (or any other library) code “as is”, you will have to implement a reduced version of it, and it is better to know how the library developers implemented the feature and how to make it work with the constrains of your environment.

The professional embedded developers with intermediate knowledge of C++ may also find this document useful. They will probably benefit from lots of C++ insights and will have several “eureka” moments with “I didn’t know I could do that!!!” kind of thoughts.

If your C++ knowledge doesn’t go much beyond polymorphism and virtual functions, if template meta-programming doesn’t mean anything to you, probably you are not ready to use C++ in the embedded environment and this document will probably be too complex to understand.

I’d like to emphasise the fact that this is NOT a C++ tutorial. There are lots of resources on the web that teach conventional C++ with OS services, exceptions and RTTI. My personal opinion is that you have to master C++ in regular environment before using it effectively in the bare metal world.

C++ Popularity

C++ is quite popular in the embedded world of Linux-based embedded systems. However, it is not that popular in bare metal development. Why? Probably because of its complexity. Knowing C++ syntax is not enough. To use it effectively the developer must know what Standard Template Library (STL) provides, what can and what cannot be used when developing for specific platform. STL mastery is also not enough, the developer should have some level of proficiency in template meta-programming. Although there is an opinion that templates are dangerous because of executable code bloating, I think that templates are developer’s friends, but the one must know the dangers and know how to use templates effectively. But again, it requires time and effort to get to know how to do it right.

Another reason why C++ is not used in bare metal development is that software in significant number (if not majority) of projects gets written by hardware developers, at least in its first stages just to make sure the hardware works as expected. The “C” programming language is a natural choice for them. And of course majority of hardware developers lack proficiency in software development. They may have some difficulties writing code of good quality in “C”, not to mention “C++”. After software reaches certain level of complexity it is handed over to software engineers who are not allowed to re-implement it from scratch. They are told something like: “This code almost works, just fix a couple of bugs, implement this short set of features and we’re good to go. Throwing away the existing code is a waste, we do not have time to re-implement it.”

The last reason, I think, is psychological one. People prefer to be wrong in a group than right by themselves. When majority of bare metal products being developed using “C”, it feels risky and unnatural to choose “C++”, even though the latter is better choice from the technological perspective.

Benefits of C++

The primary reason to prefer C++ over C is code reuse. Thanks to templates, it is much easier to implement generic piece of code that can be reused between projects in C++ than in C. When implementing everything from scratch, then probably using C++ instead of C won’t give any significant advantage in terms of development effort, maybe even extend it. However, once generic components have been developed, the whole development process for next projects will be much easier and faster, thanks to reuse of the former.

Contents of This Book

This document introduces several concepts that can be used in bare-metal development as well as shows how they can be implemented using features of latest (at the time of writing) C++11 standard.

The code of generic components is implemented as part of “Embedded C++ Library” project called “embxx” and can be found at https://github.com/arobenko/embxx. It has GPLv3 licence.

There is also a project that implements multiple simple bare metal applications using embxx which can run on RaspberryPi platform. The source code can be found at https://github.com/arobenko/embxx_on_rpi. It also has GPLv3 licence.

Both projects require gcc version 4.7 or higher, because of C++11 support requirement. They also use CMake as their build system. The code has been tested with following free toolchains:

GNU Tools for ARM Embedded Processors on Launchpad
Sourcery CodeBench Lite Edition

The whole document is ARM platform centric. At this moment I do not try to cover anything else.

To compile Raspberry Pi example applications in Linux environment use the following steps:

Checkout embxx_on_rpi project

> git clone https://github.com/arobenko/embxx_on_rpi.git
> cd embxx_on_rpi

Create separate build directory and cd to it

> mkdir build
> cd build

Generate makefiles

> cmake ..

Note that last parameter to cmake is relative or absolute path to the root of the source tree. Also note that embxx library will be checked out as external git submodule during this process.

Build the applications

> make

Take the generated image from <build_dir>/image/<app_name>/kernel.img

The CMake provides the following build types, which I believe are self-explanatory:

None (default)
Debug
Release
MinSizeRel
RelWithDebInfo

To specify the required build type use -DCMAKE_BUILD_TYPE=<value> option of cmake utility:

> cmake -DCMAKE_BUILD_TYPE=Release ..

If no build type is specified, the default one is None, which is similar to Debug, but without -g compilation option, i.e. no optimisations and no debugging information is generated.

It is possible to specify the cross-compilation toolchain prefix. By default arm-none-eabi- is expected, i.e. arm-none-eabi-gcc, arm-none-eabi-g++ and arm-none-eabi-as are used to compile the sources. If these utilities cannot be found in environment search paths, then you should specify the prefix passing -DCROSS_COMPILE=<prefix> option to cmake:

> cmake -DCROSS_COMPILE=/opt/arm-none-eabi-2013.05/bin/arm-none-eabi- ..

To see the commands used to compile the sources, prefix make with VERBOSE=1:

> VERBOSE=1 make

The embxx library has doxygen generated documentation. It can be found at release artifacts.

Contribution

If you have any suggestions, requests, bug fixes, spelling mistakes fixes, or maybe you feel that some things are not explained properly, please feel free to e-mail me to arobenko@gmail.com.

Reading Offline

The source code of this book is hosted on github and both PDF and HTML versions of this book can be downloaded from the release_artifacts.

Know Your Compiler Output

To successfully use C++ language and its libraries in bare metal development it is important to know what binary code compiler generates from the C++ source code. This section will lead you through the process of building simple testing applications and analysis of their binary code.

Test Applications

The embxx_on_rpi project contains several simple test application, which are intended to be used for binary code analysis only and not to be executed on the target platform. This applications reside in src/test_cpp directory. In order to properly analyse the code that compiler produces for production environment, let’s compile all the applications in Release mode:

> git clone https://github.com/arobenko/embxx_on_rpi.git
> mkdir -p <build_dir_somewhere>
> cd <build_dir_somewhere>
> cmake -DCMAKE_BUILD_TYPE=Release <path/to/embxx_on_rpi>
> VERBOSE=1 make

The listing file of every application will be <build_dir_somewhere>/src/test_cpp/<app_name>/kernel.list.

Get Simple Application Compiled

Let’s try to compile simple application of infinite loop, called test_cpp_simple.

A linker script is required to get all the generated objects successfully linked. It states what code/data sections need to be loaded at what addresses as well as defines several symbols that may be required by the sources. Here is a good manual of linker script syntax and here is the linker script I use to get applications linked for Raspberry Pi platform.

Depending on your compiler, the link may fail because some symbols are missing. For example __exidx_start and __exidx_end are needed when the application is compiled with exceptions support, or __bss_start__ and __bss_end__ may be required by standard library if it contains the code for zeroing .bss section.

Every application must have a startup code usually written in Assembler. This startup code must perform the following steps:

Write the interrupt vector table at appropriate location (usually at address 0x0000).
Set the stack pointers for every runtime mode.
Zero the .bss section
Call constructors of global (static) objects (applicable only to C++)
Call the main function.

It may happen that compiler generates some startup code for you, especially if you haven’t excluded standard library (stdlib) from compilation. To check whether this is the case, we need to analyse assembler listing of the successfully compiled and linked image binary. All the generated files for a test application will reside in <build_dir>/src/test_cpp/<app_name>. The assembler listing file will have kernel.list name.

Side note: the assembler listing can be generated using the following command:

> arm-none-eabi-objdump -D -S app_binary > app.list

Open the listing file and look for function with CRT string in it. CRT stands for “C Run-Time”. When using this compiler, the function that compiler has generated, is called _mainCRTStartup. Let’s take closer look what this function does.

00008198 <_mainCRTStartup>:

Load the address of the end of the RAM and assign its value to stack pointer (sp).

    8198:	e59f30f0 	ldr	r3, [pc, #240]	; 8290 <_mainCRTStartup+0xf8>
    819c:	e3530000 	cmp	r3, #0
    81a0:	059f30e4 	ldreq	r3, [pc, #228]	; 828c <_mainCRTStartup+0xf4>
    81a4:	e1a0d003 	mov	sp, r3

Set the value of sp for various modes, the sizes of the stacks are determined by the compiler itself.

    81a8:	e10f2000 	mrs	r2, CPSR
    81ac:	e312000f 	tst	r2, #15
    81b0:	0a000015 	beq	820c <_mainCRTStartup+0x74>
    81b4:	e321f0d1 	msr	CPSR_c, #209	; 0xd1
    81b8:	e1a0d003 	mov	sp, r3
    81bc:	e24daa01 	sub	sl, sp, #4096	; 0x1000
    81c0:	e1a0300a 	mov	r3, sl
    81c4:	e321f0d7 	msr	CPSR_c, #215	; 0xd7
    81c8:	e1a0d003 	mov	sp, r3
    81cc:	e2433a01 	sub	r3, r3, #4096	; 0x1000
    81d0:	e321f0db 	msr	CPSR_c, #219	; 0xdb
    81d4:	e1a0d003 	mov	sp, r3
    81d8:	e2433a01 	sub	r3, r3, #4096	; 0x1000
    81dc:	e321f0d2 	msr	CPSR_c, #210	; 0xd2
    81e0:	e1a0d003 	mov	sp, r3
    81e4:	e2433a02 	sub	r3, r3, #8192	; 0x2000
    81e8:	e321f0d3 	msr	CPSR_c, #211	; 0xd3
    81ec:	e1a0d003 	mov	sp, r3
    81f0:	e2433902 	sub	r3, r3, #32768	; 0x8000
    81f4:	e3c330ff 	bic	r3, r3, #255	; 0xff
    81f8:	e3c33cff 	bic	r3, r3, #65280	; 0xff00
    81fc:	e5033004 	str	r3, [r3, #-4]
    8200:	e9532000 	ldmdb	r3, {sp}^
    8204:	e38220c0 	orr	r2, r2, #192	; 0xc0
    8208:	e121f002 	msr	CPSR_c, r2
    820c:	e243a801 	sub	sl, r3, #65536	; 0x10000
    8210:	e3b01000 	movs	r1, #0
    8214:	e1a0b001 	mov	fp, r1
    8218:	e1a07001 	mov	r7, r1

Load the addresses of __bss_start__ and __bss_end__ symbols and zero all the area in between.

    821c:	e59f0078 	ldr	r0, [pc, #120]	; 829c <_mainCRTStartup+0x104>
    8220:	e59f2078 	ldr	r2, [pc, #120]	; 82a0 <_mainCRTStartup+0x108>
    8224:	e0522000 	subs	r2, r2, r0
    8228:	eb00004a 	bl	8358 <memset>

   ... Then comes some code, purpose of which is not clear

Call the __libc_init_array function provided by standard library which will initialise all the global objects. It will treat the area between __init_array_start and __init_array_end as list of pointers to initialisation functions and call them one by one.

    8278:	eb000014 	bl	82d0 <__libc_init_array>

Call the main function.

    8284:	eb000010 	bl	82cc <main>

If main function returns for some reason, call the exit function, which probably must be implemented as infinite loop or jumping back to the beginning of the startup code.

    8288:	eb000008 	bl	82b0 <exit>

Here comes local data

    828c:	00080000 	andeq	r0, r8, r0
    8290:	04008000 	streq	r8, [r0], #-0
	...
    829c:	00008458 	andeq	r8, r0, r8, asr r4
    82a0:	00008474 	andeq	r8, r0, r4, ror r4

The only missing stage in the startup process is updating the interrupt vector table. After the latter is updated properly, it is possible to call the provided _mainCRTStartup function. However, if your compiler doesn’t provide such function you have no other choice but to write the whole startup code yourself. Here is an example of such code.

Please note, that .bss section by definition contains uninitialised data that must be zeroed at startup. Even if you don’t have uninitialised variables in your code, zeroing .bss is a must have operation. This is because compiler might put variables that are explicitly initialised to 0 into the .bss for performance reasons and count on this section being zeroed at startup.

Also note, that pointers to initialisation functions of global variables reside in .init.array section. To initialise your global objects you just iterate over all entries in this section and call them one by one.

To implement the missing stage for use the following assembler instructions:

_entry:
    ldr pc,reset_handler_ptr        ;@  Processor Reset handler
    ldr pc,undefined_handler_ptr    ;@  Undefined instruction handler
    ldr pc,swi_handler_ptr          ;@  Software interrupt
    ldr pc,prefetch_handler_ptr     ;@  Prefetch/abort handler.
    ldr pc,data_handler_ptr         ;@  Data abort handler/
    ldr pc,unused_handler_ptr       ;@
    ldr pc,irq_handler_ptr          ;@  IRQ handler
    ldr pc,fiq_handler_ptr          ;@  Fast interrupt handler.

    ;@ Set the branch addresses
    reset_handler_ptr:      .word reset
    undefined_handler_ptr:  .word hang
    swi_handler_ptr:        .word hang
    prefetch_handler_ptr:   .word hang
    data_handler_ptr:       .word hang
    unused_handler_ptr:     .word hang
    irq_handler_ptr:        .word irq_handler
    fiq_handler_ptr:        .word hang

reset:
    ;@ Disable interrupts
    cpsid if

    ;@ Copy interrupt vector to its place
    ldr r0,=_entry
    mov r1,#0x0000

    ;@  Here we copy the branching instructions
    ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
    stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}

    ;@  Here we copy the branching addresses
    ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
    stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}

Please note that at interrupt vector table that resides at address 0x0000 contains branch instructions to the appropriate handlers, not just addresses of the handlers. Let’s take a closer look how these branching instructions look in our assembler listing file:

_entry:
    800c:	e59ff018 	ldr	pc, [pc, #24]	; 802c <reset_handler_ptr>
    8010:	e59ff018 	ldr	pc, [pc, #24]	; 8030 <undefined_handler_ptr>
    8014:	e59ff018 	ldr	pc, [pc, #24]	; 8034 <swi_handler_ptr>
    8018:	e59ff018 	ldr	pc, [pc, #24]	; 8038 <prefetch_handler_ptr>
    801c:	e59ff018 	ldr	pc, [pc, #24]	; 803c <data_handler_ptr>
    8020:	e59ff018 	ldr	pc, [pc, #24]	; 8040 <unused_handler_ptr>
    8024:	e59ff018 	ldr	pc, [pc, #24]	; 8044 <irq_handler_ptr>
    8028:	e59ff018 	ldr	pc, [pc, #24]	; 8048 <fiq_handler_ptr>

0000802c <reset_handler_ptr>:
    802c:	0000804c 	andeq	r8, r0, ip, asr #32

00008030 <undefined_handler_ptr>:
    8030:	000082b4 			; <UNDEFINED> instruction: 0x000082b4

00008034 <swi_handler_ptr>:
    8034:	000082b4 			; <UNDEFINED> instruction: 0x000082b4

00008038 <prefetch_handler_ptr>:
    8038:	000082b4 			; <UNDEFINED> instruction: 0x000082b4

0000803c <data_handler_ptr>:
    803c:	000082b4 			; <UNDEFINED> instruction: 0x000082b4

00008040 <unused_handler_ptr>:
    8040:	000082b4 			; <UNDEFINED> instruction: 0x000082b4

00008044 <irq_handler_ptr>:
    8044:	000082b8 			; <UNDEFINED> instruction: 0x000082b8

00008048 <fiq_handler_ptr>:
    8048:	000082b4 			; <UNDEFINED> instruction: 0x000082b4

The branching instructions load address of the interrupt function to “pc” register. However, the address of the function is stored somewhere and compiler generates access to this storage using relative offset to current “pc” register. This is the reason why we have to copy not just the branching instructions, but also the storage area where addresses of interrupt routines are stored:

    ;@ Copy interrupt vector to its place
    ldr r0,=_entry
    mov r1,#0x0000

    ;@  Here we copy the branching instructions
    ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
    stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}

    ;@  Here we copy the branching addresses
    ldmia r0!,{r2,r3,r4,r5,r6,r7,r8,r9}
    stmia r1!,{r2,r3,r4,r5,r6,r7,r8,r9}

Dynamic Memory Allocation

Let’s try to compile simple application that uses dynamic memory allocation. The test_cpp_vector application contains the following code:

std::vector<int> v;
static const int MaxVecSize = 256;
for (int i = 0; i < MaxVecSize; ++i) {
    v.push_back(i);
}

It may happen that linking operation will fail with multiple referenced symbols being undefined:

unwind-arm.c:(.text+0x224): undefined reference to `__exidx_end'
unwind-arm.c:(.text+0x228): undefined reference to `__exidx_start'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-abort.o): In function `abort':
abort.c:(.text.abort+0x10): undefined reference to `_exit'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-sbrkr.o): In function `_sbrk_r':
sbrkr.c:(.text._sbrk_r+0x18): undefined reference to `_sbrk'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-signalr.o): In function `_kill_r':
signalr.c:(.text._kill_r+0x1c): undefined reference to `_kill'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-signalr.o): In function `_getpid_r':
signalr.c:(.text._getpid_r+0x4): undefined reference to `_getpid'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-writer.o): In function `_write_r':
writer.c:(.text._write_r+0x20): undefined reference to `_write'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-closer.o): In function `_close_r':
closer.c:(.text._close_r+0x18): undefined reference to `_close'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-fstatr.o): In function `_fstat_r':
fstatr.c:(.text._fstat_r+0x1c): undefined reference to `_fstat'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-isattyr.o): In function `_isatty_r':
isattyr.c:(.text._isatty_r+0x18): undefined reference to `_isatty'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-lseekr.o): In function `_lseek_r':
lseekr.c:(.text._lseek_r+0x20): undefined reference to `_lseek'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-readr.o): In function `_read_r':
readr.c:(.text._read_r+0x20): undefined reference to `_read'
collect2: error: ld returned 1 exit status

The symbols __exidx_start and __exidx_end are required to indicate start and end of .ARM.exidx section. It is used for exception handling. They must be defined in the linker script:

.ARM.exidx :
{
    __exidx_start = .;
    *(.ARM.exidx* .gnu.linkonce.armexidx.*)
    __exidx_end = .;
} >RAM

The dynamic memory allocation will require implementation of _sbrk function which will be used to allocate chunks of memory for the C/C++ heap management.

All other symbols will be required to properly support exceptions which are used by C++ heap management system. Here is a good resource, that lists all the system calls, the developer may need to implement, to get the application compiled.

Now, after successful compilation, take a good look at the size of the images of two sample applications we compiled. The paths are <build_dir>/src/test_cpp/test_cpp_simple/kernel.img and <build_dir>/src/test_cpp/test_cpp_vector/kernel.img.

Side note: The image can be generated out of elf binary using the following instruction: > arm-none-eabi-objcopy <elf_executable> -O binary <binary_image_path>

You may notice that size of test_cpp_vector image is greater by approximately 100K than test_cpp_simple. It is due to C++ heap management and exceptions handling. Let’s try to see what happens to the size of the application if "C++" heap is replaced with “C” one without exceptions. You will have to override all the global C++ operators responsible for memory allocation/deallocation:

#include <cstdlib>
#include <new>

void* operator new(size_t size) noexcept
{
    return malloc(size);
}

void operator delete(void *p) noexcept
{
    free(p);
}

void* operator new[](size_t size) noexcept
{
    return operator new(size); // Same as regular new
}

void operator delete[](void *p) noexcept
{
    operator delete(p); // Same as regular delete
}

void* operator new(size_t size, std::nothrow_t) noexcept
{
    return operator new(size); // Same as regular new
}

void operator delete(void *p,  std::nothrow_t) noexcept
{
    operator delete(p); // Same as regular delete
}

void* operator new[](size_t size, std::nothrow_t) noexcept
{
    return operator new(size); // Same as regular new
}

void operator delete[](void *p,  std::nothrow_t) noexcept
{
    operator delete(p); // Same as regular delete
}

Please compile the test_cpp_vector application again, create its image and take a look at its size. It will be much closer to the size of the test_cpp_simple image. In fact, you may not even need majority of the system call functions you have implemented before. Try to remove them one by one and see whether linker still reports “undefined reference” to these symbols.

CONCLUSION: Usage of C++ heap brings a significant code size overhead. It is a good practice to override implementation of new and delete operators with usage of malloc and free when using C++ in bare metal development. Note that in this case, if memory allocation fails nullptr will be returned instead of throwing std::bad_alloc exception, so beware of third party C++ libraries that count on exception been thrown and do not check the returned value form operator new.

Excluding Usage of Dynamic Memory

The dynamic memory allocation is a core part of conventional C++. However, in some bare-metal products the usage of dynamic memory may be problematic and/or forbidden. The only way (I know of) to make to compilation fail, if dynamic memory is used, is to exclude standard library altogether. With gcc compiler it is achieved by using -nostdlib compilation option.

Excluding standard library from the compilation will remove the whole C++ run-time environment, which includes dynamic memory (heap) management and exception handling. The implication of using this compilation option will be described later in Removing Standard Library and C++ Runtime.

Exceptions

Exception handling is also a core feature of the conventional C++. However, this feature is considered to be too dangerous, because of unpredictable code execution time and too expensive (in terms of code size) for bare metal platforms. The usage of single throw statement in the source code will result in more than 120KB of extra binary code in the final binary image. Just try it yourself with your compiler and see the difference in size of the produced binary images.

It is possible to forbid usage of throw statements by providing certain options to the compiler. For GNU compiler (gcc) please use -fno-exceptions in conjunction with -fno-unwind-tables options. According to this page of gcc manual, all the throw statements are supposed to be replaced with call to abort(). Unfortunately this information seems to be outdated. The behaviour I see with my latest (at the moment of writing) gcc version 4.8 is a bit different.

When the compilation is performed with the options specified above and there is a throw statement in the code (for example throw std::runtime_error("Some error")), the compilation fails with error message:

main.cpp:34:42: error: exception handling disabled, use -fexceptions to enable
     throw std::runtime_error("Some error");

However, all the throw statements from standard library are compiled in and cause the whole exception handling support code overhead to be included in the final binary image, despite the compilation options forbidding the exceptions. The test application test_cpp_exceptions has simple code that causes the exceptions to be thrown:

std::vector<int> v;
v.at(100) = 0;

The generated code of the main function looks like this:

00015f60 <main>:
   15f60:	e92d4008 	push	{r3, lr}
   15f64:	e59f0000 	ldr	r0, [pc]	; 15f6c <main+0xc>
   15f68:	eb0000a8 	bl	16210 <_ZSt20__throw_out_of_rangePKc>
   15f6c:	00013868 	andeq	r3, r1, r8, ror #16

We also can see there are multiple exception related functions in the produced listing, such as __cxa_allocate_exception, __cxa_throw, _ZSt20__throw_out_of_rangePKc, _ZSt21__throw_bad_exceptionv, etc… The size of the binary image will also be huge (about 125KB) due to exceptions handling.

If you would like to use STL classes that may throw exceptions, such as std::string, std::vector, but refuse to pay the expensive price of extra code space for exceptions handling, you’ll have to do two things. First, make sure that exception conditions never occur in your code run, i.e. if throw statement is about to get executed, it means there is a bug in your code. Second, override the definition of all the "__throw_*" functions the compiler tries to use. In order to identify all these functions you’ll have to temporarily disable usage of standard library by passing -nostdlib compilation option to your gcc compiler. For the code example above the compilation without standard library will fail with error message:

main.cpp.o: In function `main':
main.cpp:(.text.startup+0x8): undefined reference to `std::__throw_out_of_range(char const*)'
collect2: error: ld returned 1 exit status

Let’s try to override std::__throw_out_of_range(char const*):

namespace std
{

void __throw_out_of_range(char const*)
{
    while (true) {}
}

}

This time the compilation will succeed. Let’s now compile the result code with standard library included (without using -nostdlib option) and check the binary image size. With my compiler the size is 1.3KB, which is much much better than 120KB when exception handling is used.

CONCLUSION: Excluding exception handling support is a well known and widely used practice in C++ bare metal development. Even when relevant compilation options are used (-fno-exceptions and -fno-unwind-tables in GNU compiler), there is still a need to override various __throw_* functions used by the compiler and provided by the standard library.

RTTI

Run Time Type Information is also one of the core features of conventional C++. It allows retrieval of the object type information (using typeid operator) as well as checking the inheritance hierarchy (using dynamic_cast) at run time. The RTTI is available only when there is a polymorphic behaviour, i.e. the classes have at least one virtual function.

Let’s try to analyse the generated code when RTTI is in use. The test_cpp_rtti application in embxx_on_rpi project contains the code listed below.

struct SomeClass
{
    virtual void someFunc();
};

Somewhere in *.cpp file:

void SomeClass::someFunc()
{
}

Somewhere in main function:

SomeClass someClass;
someClass.someFunc();

Let’s open the listing file and see what’s going on in there. The address of SomeClass::someFunc() seems to be 0x8300:

00008300 <_ZN9SomeClass8someFuncEv>:
    8300:	e12fff1e 	bx	lr

The virtual table for SomeClass class must be somewhere in .rodata section and contain address of SomeClass::someFunc(), i.e. it must have 0x8300 value inside:

Disassembly of section .rodata:

...
00009c10 <_ZTV9SomeClass>:
    9c10:	00000000 	andeq	r0, r0, r0
    9c14:	00009c04 	andeq	r9, r0, r4, lsl #24
    9c18:	00008300 	andeq	r8, r0, r0, lsl #6
    9c1c:	00000000 	andeq	r0, r0, r0

It is visible that compiler added some more entries to the virtual table in addition to the single virtual function we implemented. The address 0x9c04 is also located in .rodata section. It is some type related table:

00009c04 <_ZTI9SomeClass>:
    9c04:	00009c28 	andeq	r9, r0, r8, lsr #24
    9c08:	00009bf8 	strdeq	r9, [r0], -r8
    9c0c:	00000000 	andeq	r0, r0, r0

Both 0x9c28 and 0x9bf8 are addresses in .rodata* section(s). The 0x9bf8 address seems to contain some data:

00009bf8 <_ZTS9SomeClass>:
    9bf8:	6d6f5339 	stclvs	3, cr5, [pc, #-228]!	; 9b1c <strcmp+0x180>
    9bfc:	616c4365 	cmnvs	ip, r5, ror #6
    9c00:	00007373 	andeq	r7, r0, r3, ror r3

After a closer look we may decode this data to be 9SomeClass ascii string.

Address 0x9c28 is in the middle of some type related information table:

00009c20 <_ZTVN10__cxxabiv117__class_type_infoE>:
    9c20:	00000000 	andeq	r0, r0, r0
    9c24:	00009c50 	andeq	r9, r0, r0, asr ip
    9c28:	00009dc0 	andeq	r9, r0, r0, asr #27
    9c2c:	00009de4 	andeq	r9, r0, r4, ror #27
    9c30:	0000a114 	andeq	sl, r0, r4, lsl r1
    9c34:	0000a11c 	andeq	sl, r0, ip, lsl r1
    9c38:	00009e40 	andeq	r9, r0, r0, asr #28
    9c3c:	00009d48 	andeq	r9, r0, r8, asr #26
    9c40:	00009e10 	andeq	r9, r0, r0, lsl lr
    9c44:	00009e94 	muleq	r0, r4, lr
    9c48:	00009dac 	andeq	r9, r0, ip, lsr #27
    9c4c:	00000000 	andeq	r0, r0, r0

How these tables are used by the compiler is of little interest to us. What is interesting is a code size overhead. Lets check the size of the binary image. With my compiler it is a bit more than 13KB.

For some bare metal platforms it may be undesirable or even impossible to have this amount of extra binary code added to the binary image. The GNU compiler (gcc) provides an ability to disable RTTI by using -no-rtti option. Let’s check the virtual table of SomeClass class when this option is used:

Disassembly of section .rodata:

00008320 <_ZTV9SomeClass>:
	...
    8328:	00008300 	andeq	r8, r0, r0, lsl #6
    832c:	00000000 	andeq	r0, r0, r0

The virtual table looks much simpler now with single pointer to the SomeClass::someFunc() virtual function. There is no extra code size overhead needed to maintain type information. If the application above is compiled without exceptions (using -fno-exceptions and -fno-unwind-tables) as well as without RTTI support (using -no-rtti) the binary image size will be about 1.3KB which is much better.

However, if -no-rtti option is used, the compiler won’t allow usage of typeid operator as well as dynamic_cast. In this case the developer needs to come up with other solutions to differentiate between objects of different types (but having the same 'ancestor') at run time. There are multiple idioms that can be used, such as using simple C-like approach of switch-ing on some type enumerator member, or using polymorphic behaviour of the objects to perform double dispatch.

CONCLUSION: Disabling Run Time Type Information (RTTI) in addition to eliminating exception handling is very common in bare metal C++ development. It allows to save about 10KB of space overhead in final binary image.

Removing Standard Library and C++ Runtime

Due to platform RAM/ROM limitations it may be required to exclude not just support for exceptions and RTTI (compiling with -fno-exceptions -fno-unwind-tables -fno-rtti), but for dynamic memory allocation too. The latter includes passing -nostdlib option to the compiler. In case when standard library is excluded, there is no startup code help provided by the compiler, the developer will have to implement all the startup stages:

updating the interrupt vector table
setting up correct stack pointers for all the modes of execution
zeroing .bss section
calling initialisation functions for global objects
calling “main” function.

Here is an example of such startup code.

There also may be a need to provide an implementation of some functions or definition of some global symbols. For example, if std::copy algorithm is used to copy multiple objects from place to place, the compiler might decide to use memcpy function provided by the standard library, and as the result the build process will fail with “undefined reference” error. The same way, usage of std::fill algorithm may require memset function. Be ready to implement them when needed.

Another example is having call to std::bind function with std::placeholders::_1, std::placeholders::_2, etc. There will be a need to define these placeholders as global symbols:

#include <functional>
namespace std
{
namespace placeholders
{

decltype(std::placeholders::_1) _1;
decltype(std::placeholders::_2) _2;
decltype(std::placeholders::_3) _3;
decltype(std::placeholders::_4) _4;

}  // namespace placeholders
}  // namespace std

Even if there is a need for the standard library in the product being developed, it may be a good exercise as well as good debugging technique to temporarily exclude it from the compilation. The compilation will probably fail in the linking stage. The list of missing symbols and/or functions will provide a good indication of what missing functionality is provided by the library. The developer may notice that some components still require exceptions handling, for example, resulting int the binary image being too big.

Static Objects

Let’s analyse the code that initialises static objects. test_cpp_statics is a simple application that has two static objects, one is in the global scope, the other is in the function scope.

class SomeObj
{
public:
   static SomeObj& instanceGlobal();
   static SomeObj& instanceLocal();

private:
    SomeObj(int v1, int v2);
    int m_v1;
    int m_v2;

    static SomeObj globalObj;
};

SomeObj SomeObj::globalObj(1, 2);

SomeObj& SomeObj::instanceGlobal()
{
    return globalObj;
}

SomeObj& SomeObj::instanceLocal()
{
    static SomeObj localObj(3, 4);
    return localObj;
}

int main(int argc, const char** argv)
{
    static_cast<void>(argc);
    static_cast<void>(argv);

    auto& glob = SomeObj::instanceGlobal();
    auto& local = SomeObj::instanceLocal();
    static_cast<void>(glob);
    static_cast<void>(local);

    while (true) {};
    return 0;
}

Note, that compiler will try to inline the code above if implemented in the same file. To properly analyse the code that initialises global variables, you should put implementation of constructor and instanceGlobal()/instanceLocal() functions into separate files. If -nostdlib option is passed to the compiler to exclude linking with standard library, the compilation of the code above will fail with following error:

main.cpp:(.text.startup+0x1c): undefined reference to `__cxa_guard_acquire'
main.cpp:(.text.startup+0x3c): undefined reference to `__cxa_guard_release'

It means that compiler attempts to make static variables initialisation thread-safe. The get it compiled you have to either implement the locking functionality yourself or allow compiler to do it in an unsafe way by adding -fno-threadsafe-statics compilation option. I think it is quite safe to use this option in the bare-metal development if you make sure the statics are not accessed in the interrupt context or have been initialised at the beginning of main() function before any interrupts are enabled. To grab a reference to such object without any use is enough:

    auto& local = SomeObj::instanceLocal();
    static_cast<void>(local);

Now, let’s analyse the initialisation of globalObj. The .init.array section contains pointer to initialisation function _GLOBAL__sub_I__ZN7SomeObj9globalObjE.

Disassembly of section .init.array:

00008180 <__init_array_start>:
    8180:	00008154 	andeq	r8, r0, r4, asr r1

The initialisation function loads the address of the object and passes it to the constructor of SomeObj together with the initialisation parameters (“1” and “2” integer values).

00008154 <_GLOBAL__sub_I__ZN7SomeObj9globalObjE>:
    8154:	e59f0008 	ldr	r0, [pc, #8]	; 8164 <_GLOBAL__sub_I__ZN7SomeObj9globalObjE+0x10>
    8158:	e3a01001 	mov	r1, #1
    815c:	e3a02002 	mov	r2, #2
    8160:	eaffffee 	b	8120 <_ZN7SomeObjC1Eii>
    8164:	00008168 	andeq	r8, r0, r8, ror #2

00008168 <_ZN7SomeObj9globalObjE>:
	...

The code above loads the address of the global object (0x00008168) into r0, and initialisation parameters into r1 and r2, then invokes the constructor of SomeObj.

Please remember to call all the initialisation functions from .init.array section in your startup code before calling the main() function.

In the linker file:

    .init.array :
    {
        __init_array_start = .;
        *(.init_array)
        *(.init_array.*)
        __init_array_end = .;
    } > RAM

In the startup code:

    ;@ Call constructors of all global objects
    ldr r0, =__init_array_start
    ldr r1, =__init_array_end

globals_init_loop:
    cmp     r0,r1
    it      lt
    ldrlt   r2, [r0], #4
    blxlt   r2
    blt     globals_init_loop

    ;@ Main function
    bl main
    b reset ;@ restart if main function returns

However, if standard library is NOT excluded explicitly from the compilation, the __libc_init_array provided by the standard library may be used:

    ;@ Call constructors of all global objects
    bl	__libc_init_array

    ;@ Main function
    bl main
    b reset ;@ restart if main function returns

Let’s also perform analysis of initialisation of localObj in SomeObj::instanceLocal().

000080e4 <_ZN7SomeObj13instanceLocalEv>:
    80e4:	e92d4010 	push	{r4, lr}
    80e8:	e59f4028 	ldr	r4, [pc, #40]	; 8118 <_ZN7SomeObj13instanceLocalEv+0x34>
    80ec:	e5943008 	ldr	r3, [r4, #8]
    80f0:	e3130001 	tst	r3, #1
    80f4:	1a000005 	bne	8110 <_ZN7SomeObj13instanceLocalEv+0x2c>
    80f8:	e284000c 	add	r0, r4, #12
    80fc:	e3a01003 	mov	r1, #3
    8100:	e3a02004 	mov	r2, #4
    8104:	eb000005 	bl	8120 <_ZN7SomeObjC1Eii>
    8108:	e3a03001 	mov	r3, #1
    810c:	e5843008 	str	r3, [r4, #8]
    8110:	e59f0004 	ldr	r0, [pc, #4]	; 811c <_ZN7SomeObj13instanceLocalEv+0x38>
    8114:	e8bd8010 	pop	{r4, pc}
    8118:	00008168 	andeq	r8, r0, r8, ror #2
    811c:	00008174 	andeq	r8, r0, r4, ror r1

The code above loads the address of the flag that indicates that the object was already initialised into r4, then loads the value into r3 and checks it using tst instruction. If the flag indicates that the object wasn’t initialised, the constructor of the object is called and the flag value is updated prior to returning address of the object. Note that tst r3, #1 instruction performs binary AND between value r3 and integer value #1, then next bne instruction performs branch if result is not 0, i.e. the object was already initialised.

CONCLUSION: Access to global objects are a bit cheaper than access to local static ones, because access to the latter involves a check whether the object was already initialised.

Custom Destructors

And what about destruction of static objects with non-trivial destructors? Let’s add a destructor to the above class and try to compile:

class SomeObj
{
public:
   ~SomeObj();
    …
}

Somewhere in *.cpp file:

SomeObj::~SomeObj() {}

This time the compilation will fail with following errors:

CMakeFiles/03_test_statics.dir/SomeObj.cpp.o: In function `SomeObj::instanceLocal()':
SomeObj.cpp:(.text+0x44): undefined reference to `__aeabi_atexit'
SomeObj.cpp:(.text+0x58): undefined reference to `__dso_handle'
CMakeFiles/03_test_statics.dir/SomeObj.cpp.o: In function `_GLOBAL__sub_I__ZN7SomeObj9globalObjE':
SomeObj.cpp:(.text.startup+0x28): undefined reference to `__aeabi_atexit'
SomeObj.cpp:(.text.startup+0x34): undefined reference to `__dso_handle'

According to this document, the __aeabi_atexit function is used to register pointer to the destructor function together with pointer to the relevant static object to be destructed after main function returns. The reason for this behaviour is that these objects must be destructed in the opposite order to which they were constructed. The compiler cannot know the exact construction order for local static objects. There may even be some static objects are not constructed at all. The __dso_handle is a global pointer to the current address where the next {destructor_ptr, object_ptr} pair will be stored. The main function of most bare metal applications is not supposed to return and global/static objects will not be destructed. In this case it will be enough to implement the required function the following way:

extern "C" int __aeabi_atexit(
    void *object,
    void (*destructor)(void *),
    void *dso_handle)
{
    static_cast<void>(object);
    static_cast<void>(destructor);
    static_cast<void>(dso_handle);
    return 0;
}

void* __dso_handle = nullptr;

However, if your main function returns and then the code jumps back to the initialisation/reset routine, there is a need to properly perform destruction of global/static objects. You’ll have to allocate enough space to store all the necessary {destructor_ptr, object_ptr} pairs, then in __aeabi_atexit function store the pair in the area pointed by __dso_handle, while incrementing value of later. Note, that dso_handle parameter to the __aeabi_atexit function is actually a pointer to the global __dso_handle value. Then, when the main function returns, invoke the stored destructors in the opposite order while passing addresses of the relevant objects as their first arguments.

To verify all the stated above let’s take a look again at the generated code of initialisation function (after the destructor was added):

00008170 <_GLOBAL__sub_I__ZN7SomeObj9globalObjE>:
    8170:	e92d4010 	push	{r4, lr}
    8174:	e59f4020 	ldr	r4, [pc, #32]	; 819c <_GLOBAL__sub_I__ZN7SomeObj9globalObjE+0x2c>
    8178:	e3a01001 	mov	r1, #1
    817c:	e1a00004 	mov	r0, r4
    8180:	e3a02002 	mov	r2, #2
    8184:	ebffffeb 	bl	8138 <_ZN7SomeObjC1Eii>
    8188:	e1a00004 	mov	r0, r4
    818c:	e59f100c 	ldr	r1, [pc, #12]	; 81a0 <_GLOBAL__sub_I__ZN7SomeObj9globalObjE+0x30>
    8190:	e59f200c 	ldr	r2, [pc, #12]	; 81a4 <_GLOBAL__sub_I__ZN7SomeObj9globalObjE+0x34>
    8194:	e8bd4010 	pop	{r4, lr}
    8198:	eaffffe9 	b	8144 <__aeabi_atexit>
    819c:	000081a8 	andeq	r8, r0, r8, lsr #3
    81a0:	00008140 	andeq	r8, r0, r0, asr #2
    81a4:	000081bc 			; <UNDEFINED> instruction: 0x000081bc

00008140 <_ZN7SomeObjD1Ev>:
    8140:	e12fff1e 	bx	lr

000081bc <__dso_handle>:
    81bc:	00000000 	andeq	r0, r0, r0

Indeed, the call to the constructor immediately followed by the call to __aeabi_atexit with address of the object in r0 (first parameter), address of the destructor in r1 (second parameter) and address of __dso_handle in r2 (third parameter).

CONCLUSION: It is better to design the “main” function to contain infinite loop and never return to save the implementation of destructing global/static objects functionality.

Abstract Classes

The next thing to test is having abstract classes with pure virtual functions while excluding linkage to standard library (using -nostdlib compilation option). Below is an excerpt from test_cpp_abstract_class application.

class AbstractBase
{
public:
    virtual ~AbstractBase();
    virtual void func() = 0;
    virtual void nonOverridenFunc() final;
};

class Derived : public AbstractBase
{
public:
    virtual ~Derived();
    virtual void func() override;
};

AbstractBase::~AbstractBase()
{
}

void AbstractBase::nonOverridenFunc()
{
}

Derived::~Derived()
{
}

void Derived::func()
{
}
[source, c++]

Somewhere in the “main” function:

Derived obj;
AbstractBase* basePtr = &obj;
basePtr->func();

The compilation will fail with following errors:

CMakeFiles/04_test_abstract_class.dir/AbstractBase.cpp.o: In function `AbstractBase::~AbstractBase()':
AbstractBase.cpp:(.text+0x24): undefined reference to `operator delete(void*)'
CMakeFiles/04_test_abstract_class.dir/AbstractBase.cpp.o:(.rodata+0x10): undefined reference to `__cxa_pure_virtual'
CMakeFiles/04_test_abstract_class.dir/Derived.cpp.o: In function `Derived::~Derived()':
Derived.cpp:(.text+0x3c): undefined reference to `operator delete(void*)'

The __cxa_pure_virtual is a function, address of which compiler writes in the virtual table when the function is pure virtual. It may be called due to some unnatural pointer abuse or when trying to invoke pure virtual function in the destructor of the abstract base class. The call to this function should never happen in the normal application run. If it happens it means there is a bug. It is quite safe to implement this function with infinite loop or some way to report the error to the developer, by flashing leds for example.

extern "C" void __cxa_pure_virtual()
{
    while (true) {}
}

The requirement for operator delete(void*) is quite strange though, there is no dynamic memory allocation in the source code. It has to be investigated. Let’s stub the function and check the output of the compiler:

void operator delete(void *)
{
}

The virtual tables for the classes reside in .rodata section:

Disassembly of section .rodata:

000081a0 <_ZTV12AbstractBase>:
	...
    81a8:	000080d8 	ldrdeq	r8, [r0], -r8	; <UNPREDICTABLE>
    81ac:	000080ec 	andeq	r8, r0, ip, ror #1
    81b0:	0000815c 	andeq	r8, r0, ip, asr r1
    81b4:	000080e8 	andeq	r8, r0, r8, ror #1

000081b8 <_ZTV7Derived>:
	...
    81c0:	00008110 	andeq	r8, r0, r0, lsl r1
    81c4:	00008130 	andeq	r8, r0, r0, lsr r1
    81c8:	0000810c 	andeq	r8, r0, ip, lsl #2
    81cc:	000080e8 	andeq	r8, r0, r8, ror #1

The last entry for both classes has the address of AbstractBase::nonOverridenFunc function:

000080e8 <_ZN12AbstractBase16nonOverridenFuncEv>:
    80e8:	e12fff1e 	bx	lr

The third entry in the virtual table of Derived class has the address of Derived::func function, while the third entry in the virtual table of AbstractBase class has the address of __cxa_pure_virtual, just like expected.

0000810c <_ZN7Derived4funcEv>:
    810c:	e12fff1e 	bx	lr

0000815c <__cxa_pure_virtual>:
    815c:	eafffffe 	b	815c <__cxa_pure_virtual>

The first two entries in the virtual tables point to two different implementations of the destructor. The first entry has the address of normal destructor implementation, and the second one has an address of the second destructor implementation, that invokes operator delete (has _ZdlPv symbol) after the destruction of the object:

000080d8 <_ZN12AbstractBaseD1Ev>:
    80d8:	e59f3004 	ldr	r3, [pc, #4]	; 80e4 <_ZN12AbstractBaseD1Ev+0xc>
    80dc:	e5803000 	str	r3, [r0]
    80e0:	e12fff1e 	bx	lr
    80e4:	000081a8 	andeq	r8, r0, r8, lsr #3

000080ec <_ZN12AbstractBaseD0Ev>:
    80ec:	e59f3014 	ldr	r3, [pc, #20]	; 8108 <_ZN12AbstractBaseD0Ev+0x1c>
    80f0:	e92d4010 	push	{r4, lr}
    80f4:	e1a04000 	mov	r4, r0
    80f8:	e5803000 	str	r3, [r0]
    80fc:	eb000015 	bl	8158 <_ZdlPv>
    8100:	e1a00004 	mov	r0, r4
    8104:	e8bd8010 	pop	{r4, pc}
    8108:	000081a8 	andeq	r8, r0, r8, lsr #3

00008110 <_ZN7DerivedD1Ev>:
    8110:	e59f3014 	ldr	r3, [pc, #20]	; 812c <_ZN7DerivedD1Ev+0x1c>
    8114:	e92d4010 	push	{r4, lr}
    8118:	e1a04000 	mov	r4, r0
    811c:	e5803000 	str	r3, [r0]
    8120:	ebffffec 	bl	80d8 <_ZN12AbstractBaseD1Ev>
    8124:	e1a00004 	mov	r0, r4
    8128:	e8bd8010 	pop	{r4, pc}
    812c:	000081c0 	andeq	r8, r0, r0, asr #3

00008130 <_ZN7DerivedD0Ev>:
    8130:	e59f301c 	ldr	r3, [pc, #28]	; 8154 <_ZN7DerivedD0Ev+0x24>
    8134:	e92d4010 	push	{r4, lr}
    8138:	e1a04000 	mov	r4, r0
    813c:	e5803000 	str	r3, [r0]
    8140:	ebffffe4 	bl	80d8 <_ZN12AbstractBaseD1Ev>
    8144:	e1a00004 	mov	r0, r4
    8148:	eb000002 	bl	8158 <_ZdlPv>
    814c:	e1a00004 	mov	r0, r4
    8150:	e8bd8010 	pop	{r4, pc}
    8154:	000081c0 	andeq	r8, r0, r0, asr #3

00008158 <_ZdlPv>:
    8158:	e12fff1e 	bx	lr

It seems that when there is a virtual destructor, the compiler will have to support direct invocation of the destructor as well as usage of operator delete. In case of the former the compiler will use the first entry in the virtual table for the destructor invocation, and in case of the latter the compiler will use the second entry. Let’s try to add the following lines to our main function:

basePtr->~AbstractBase();
delete basePtr;

The compiler will add the following instructions to the main function:

    8190:	e59d3004 	ldr	r3, [sp, #4]
    8194:	e1a00004 	mov	r0, r4
    8198:	e5933000 	ldr	r3, [r3]
    819c:	e12fff33 	blx	r3
    81a0:	e59d3004 	ldr	r3, [sp, #4]
    81a4:	e1a00004 	mov	r0, r4
    81a8:	e5933004 	ldr	r3, [r3, #4]
    81ac:	e12fff33 	blx	r3

The address of the virtual table is written into r3, then value of r3 is overwritten with address of the destructor function to call, and the call is executed using blx instruction. The first invocation takes the address of destructor function from the first entry of virtual table, while the second invocation takes the address from second entry (offseted by #4). This is just like expected.

CONCLUSION: Having virtual destructor may require an implementation of operator delete(void*) even if there is no dynamic memory allocation.

Templates

Templates are notorious for the code bloating they produce. Some organisations explicitly forbid usage of templates in their internal C++ coding standards. However, templates is a very powerful tool, it is very difficult (if not impossible) to write generic source code, that can be reused in multiple independent projects/platforms without using templates, and without incurring any significant performance penalties. I think developers, who are afraid or not allowed to use templates, will have to implement the same concepts/modules over and over again with minor differences, which are project/platform specific. To properly master the templates we have to see the Assembler code duplication, that is generated by the compiler when templates are used. Let’s try to compile a simple application test_cpp_templates that uses templated function with different type of input parameters:

template <typename T>
void func(T startValue)
{
    for (volatile T i = startValue; i < startValue * 2; i += 1) {}
    for (volatile T i = startValue; i < startValue * 2; i += 2) {}
    for (volatile T i = startValue; i < startValue * 2; i += 3) {}
    for (volatile T i = startValue; i < startValue * 2; i += 4) {}
    for (volatile T i = startValue; i < startValue * 2; i += 5) {}
    for (volatile T i = startValue; i < startValue * 2; i += 6) {}
}

int main(int argc, const char** argv)
{
    static_cast<void>(argc);
    static_cast<void>(argv);

    int start1 = 100;
    unsigned start2 = 200;

    func(start1);
    func(start2);

    while (true) {};
    return 0;
}

You may notice that function func is called with two parameters, one of type int the other of type unsigned. These types have both the same size and should generate more or less identical code. Let’s take a look at the generated code of main function:

00008504 <main>:
    8504:	e92d4008 	push	{r3, lr}
    8508:	e3a00064 	mov	r0, #100	; 0x64
    850c:	ebfffefc 	bl	8104 <_Z4funcIiEvT_>
    8510:	e3a000c8 	mov	r0, #200	; 0xc8
    8514:	ebffff3a 	bl	8204 <_Z4funcIjEvT_>
    ...

Yes, indeed, there are two calls to two different functions. However, the assembler code of these functions is almost identical. Let’s also try to reuse the same function with the same types but from different source file:

void other()
{
    int start1 = 300;
    unsigned start2 = 500;

    func(start1);
    func(start2);
}

The generated code is:

000080d8 <_Z5otherv>:
    80d8:	e92d4008 	push	{r3, lr}
    80dc:	e3a00f4b 	mov	r0, #300	; 0x12c
    80e0:	eb000007 	bl	8104 <_Z4funcIiEvT_>
    80e4:	e3a00f7d 	mov	r0, #500	; 0x1f4
    80e8:	eb000045 	bl	8204 <_Z4funcIjEvT_>
    80ec:	e8bd8008 	pop	{r3, pc}

We see that the same functions at the same addresses are called, i. e. the linker does its job of removing duplicates of the same functions from different object files.

Let’s also try to wrap the same function with a class and add one more template argument:

template <typename T, std::size_t TDummy>
struct SomeTemplateClass
{
    static void func(T startValue)
    {
        for (volatile T i = startValue; i < startValue * 2; i += 1) {}
        for (volatile T i = startValue; i < startValue * 2; i += 2) {}
        for (volatile T i = startValue; i < startValue * 2; i += 3) {}
        for (volatile T i = startValue; i < startValue * 2; i += 4) {}
        for (volatile T i = startValue; i < startValue * 2; i += 5) {}
        for (volatile T i = startValue; i < startValue * 2; i += 6) {}
    }
};

Please note the dummy template parameter TDummy that is not used. Now, we add two more calls to the main function:

int main(int argc, const char** argv)
{
    ...
    SomeTemplateClass<int, 5>::func(500);
    SomeTemplateClass<int, 10>::func(500);

    while (true) {};
    return 0;
}

Note, that the functionality of the calls is identical. The only difference is the dummy template argument. Let’s take a look at the generated code:

00008504 <main>:
    ...
    8518:	e3a00f7d 	mov	r0, #500	; 0x1f4
    851c:	ebffff78 	bl	8304 <_ZN17SomeTemplateClassIiLj5EE4funcEi>
    8520:	e3a00f7d 	mov	r0, #500	; 0x1f4
    8524:	ebffffb6 	bl	8404 <_ZN17SomeTemplateClassIiLj10EE4funcEi>
    8528:	eafffffe 	b	8528 <main+0x24>

The compiler generated calls to two different functions, binary code of which is identical.

CONCLUSION: The templates indeed require extra care and consideration. It is also important not to overthink things. The well known notion of “Do not do premature optimisations. It is much easier to make correct code faster, than fast code correct.” is also applicable to code size. Do not try to optimise your template code before the need arises. Make it work and work correctly first.

Tag Dispatching

The tag dispatching is a widely used idiom in C++ development. It used extensively in the following chapters of this book.

Let’s try to compile test_cpp_tag_dispatch application in embxx_on_rpi project and take a look at the code generated by the compiler.

struct Tag1 {};
struct Tag2 {};

class Dispatcher
{
public:

    template <typename TTag>
    static void func()
    {
        funcInternal(TTag());
    }

private:
    static void funcInternal(Tag1 tag);
    static void funcInternal(Tag2 tag);

};

Somewhere in the main function:

Dispatcher::func<Tag1>();
Dispatcher::func<Tag2>();

The code generated by the compiler looks like this:

000080fc <main>:
    80fc:	e92d4008 	push	{r3, lr}
    8100:	e3a00000 	mov	r0, #0
    8104:	ebfffff3 	bl	80d8 <_ZN10Dispatcher12funcInternalE4Tag1>
    8108:	e3a00000 	mov	r0, #0
    810c:	ebfffff2 	bl	80dc <_ZN10Dispatcher12funcInternalE4Tag2>
    ...

Although the Tag1 and Tag2 are empty classes, the compiler still uses integer value 0 as a first parameter to the function.

Let’s try to optimise this redundant mov r0, #0 instruction away by making it visible to the compiler that the tag parameter is not used:

class Dispatcher
{
public:

    template <typename TTag>
    static void otherFunc()
    {
        otherFuncInternal(TTag());
    }

private:

    static void otherFuncInternal(Tag1 tag)
    {
        static_cast<void>(tag);
        otherFuncTag1();
    }

    static void otherFuncInternal(Tag2 tag)
    {
        static_cast<void>(tag);
        otherFuncTag2();
    }


    static void otherFuncTag1();
    static void otherFuncTag2();
};

Somewhere in the main function:

Dispatcher::otherFunc<Tag1>();
Dispatcher::otherFunc<Tag2>();

The code generated by the compiler looks like this:

000080fc <main>:
    ...
    8110:	ebfffff2 	bl	80e0 <_ZN10Dispatcher13otherFuncTag1Ev>
    8114:	ebfffff2 	bl	80e4 <_ZN10Dispatcher13otherFuncTag2Ev>

In this case the compiler optimises away the tag parameter.

Based on the above we may make a CONCLUSION: When the tag dispatching idiom is used, the function that receives a dummy (tag) parameter should be a simple inline wrapper around other function that implements the required functionality. In this case the compiler will optimise away the creation of tag object and will call the wrapped function directly.

Basic Needs

Prior to describing various embedded (bare metal) development concepts I’d like to cover several basic needs that, I think, most developers will have to use in their products.

Assertion

One of the basic needs during the development is having an ability to test various assumptions and invariants in runtime when compiling the application in DEBUG mode and remove the checks when compiling the application in RELEASE mode. The standard C++ reuses assert() macro from standard C library.

#include <cassert>
…
assert(some_condition);

The assert() macro evaluates to nothing in case NDEBUG symbol is defined, otherwise it evaluates the condition. If the condition doesn’t return true, it calls the __assert_fail function, provided by standard library, which in turn calls printf to print error message to standard output followed by the call to abort function, which is supposed to terminate an application.

Both printf and abort functions are provided by standard library. However, printf will require the implementation of _write function to print characters to the debug output terminal, and abort will require implementation of _exit function to terminate the application.

If standard library is excluded from the compilation (using -nostdlib compilation option), the compilation will fail with undefined reference to __assert_func error message. The developer will have to implement this function with correct signature. To retrieve the correct signature you will have to open assert.h standard header provided by your compiler. It will be something like this:

void __assert_fail (const char *expr, const char *file, unsigned int line, const char *function) __attribute__ ((__noreturn__));

The attribute specifies that this function doesn’t return, so the compiler will generate a call to it without setting any address to return to.

The conclusion from all the stated above is that using standard assert() macro is possible, but somewhat inflexible. It is possible to access only global variables from the functions described above, i.e. if there is a need to flash a led to indicate assertion failure, then its control must be accessible through global variables, which is a bit ugly. Another disadvantage of this approach is that there are no convenient means to change the behaviour of the assert failure functionality and after a while restore the original behaviour. Such behaviour may be helpful to better identify the location of the assert that has failed. For example, override the default assert failure behaviour with activating a specific led at the entrance of some function, and restore the original assertion failure behaviour when function returns.

Below is a short description of a better way to handle assert checks and failures. The code is in embxx library and can be reviewed here.

To resolve the problems described above and to handle the assertions C++ way we will have to create generic assertion failure handling abstract class:

class Assert
{
public:
    virtual void fail(
        const char* expr,
        const char* file,
        unsigned int line,
        const char* function) = 0;
};

When implementing custom project specific assertion failure behaviour inherit from the class above:

#include "embxx/util/Assert.h"

typedef ... Led;
class LedOnAssert : public embxx::util::Assert
{
public:

    LedOnAssert(Led& led)
        : led_(led)
    {
    }

    virtual void fail(
        const char* expr,
        const char* file,
        unsigned int line,
        const char* function)
    {
        led_.on();
        while (true) {;}
    }

private:
    Led& led_;
};

To manage an object of the class above, we will have to create a singleton class with static instance. It will store a pointer to the currently registered assertion failure behaviour:

class AssertManager
{
public:
   static AssertManager& instance()
    {
        static AssertManager mgr;
        return mgr;
    }

    Assert* reset(Assert* newAssert = nullptr)
    {
        auto prevAssert = assert_;
        assert_ = newAssert;
        return prevAssert;
    }

    Assert* getAssert()
    {
        return assert_;
    }

    bool hasAssertRegistered() const
    {
        return assert_ != nullptr;
    }

    void infiniteLoop()
    {
        while (true) {};
    }

private:
    AssertManager() : assert_(nullptr) {}

    Assert* assert_;
};

The reset member function registers new object that manages assertion failure behaviour and returns previous one, which can be used later to restore original behaviour.

We will require a new macro to check assertion condition and invoke registered failing behaviour:

#ifndef NDEBUG

#define GASSERT(expr) \
    ((expr)                               \
      ? static_cast<void>(0)                     \
      : (embxx::util::AssertManager::instance().hasAssertRegistered() \
            ? embxx::util::AssertManager::instance().getAssert()->fail( \
                #expr, __FILE__, __LINE__, GASSERT_FUNCTION_STR) \
            : embxx::util::AssertManager::instance().infiniteLoop()))

#else // #ifndef NDEBUG

#define GASSERT(expr) static_cast<void>(0)

#endif // #ifndef NDEBUG

Then in case of condition check failure, the GASSERT() macro checks whether any custom assertion failure functionality registered and invokes its virtual fail function. If not, then infinite loop is executed.

To complete the whole picture we have to provide a convenient way to register new assertion failure behaviours:

template < typename TAssert>
class EnableAssert
{
    static_assert(std::is_base_of<Assert, TAssert>::value,
        "TAssert class must be derived class of Assert");
public:
    typedef TAssert AssertType;

    template<typename... Params>
    EnableAssert(Params&&... args)
        : assert_(std::forward<Params>(args)...),
          prevAssert_(AssertManager::instance().reset(&assert_))
    {
    }

    ~EnableAssert()
    {
        AssertManager::instance().reset(prevAssert_);
    }

private:
    AssertType assert_;
    Assert* prevAssert_;
};

From now on, all we have do is to instantiate object of EnableAssert with the behaviour that we want. Note that constructor of EnableAssert class can receive any number of parameters and forwards them to the constructor of the internal assert_ object.

int main (int argc, const char* argv[])
{
    ...
    Led led;
    embxx::util::EnableAssert<LedOnAssert> assertion(led);

    ... // Rest of the code
}

If there is a need to temporarily override the previous assertion failure behaviour, just create another EnableAssert object. Once the latter is out of scope (the object is destructed), previous behaviour will be restored.

int main (int argc, const char* argv[])
{
    ...
    Led led;
    embxx::util::EnableAssert<LedOnAssert> assertion(led);

    ...
    {
        embxx::util::EnableAssert<OtherAssert> otherAssertion(.../* some params */);
        ...
    }  // restore previous registered behaviour – LedOnAssert.
}

SUMMARY: The approach described above provides a flexible and convenient way to control how the failures of various debug mode checks are reported to the developer. All the modules in embxx library use the GASSERT() macro to verify their pre- and post-conditions as well as internal assumptions.

Callback

As has been mentioned in the Benefits of C++ chapter, the main reason for choosing C++ over C is code reuse. When having some generic piece of code that tries to use platform specific code and needs to receive some kind of notifications from the latter, the need for some generic callback facility arises. C++ provides std::function class for this purpose, it is possible to provide any callable object, such as lambda function or std::bind expression:

class LowLevelPeripheral {
public:
    template <typename TFunc>
    void setEventCallback(TFunc&& func)
    {
        eventCallback_ = std::forward<TFunc>(func);
    }

    void eventHandler()
    {
        if (eventCallback_) {
            eventCallback_(); // invoke registered callback object
        }
    }
private:
    std::function<void ()> eventCallback_;
};

class SomeGenericControl
{
public:
    SomeGenericControl()
    {
        periph_.setEventCallback(
            std::bind(&SomeGenericControl::eventCallbackHandler, this));
    }

    void eventCallbackHandler()
    {
        … // Handle the reported event.
    }

private:
     LowLevelPeripheral periph_;
};

There are two problems with using std::function. It uses dynamic memory allocation and throws exception in case the function is invoked without assigning callable object to it first. As a result std::function may be not suitable for use in most of the bare metal projects. We will have to implement something similar, but without dynamic memory allocations and without exceptions. Below is some short explanation of how to implement such a function class. The implementation of the StaticFunction class is part of embxx library and its full code listing can be viewed here.

The restriction of inability to use dynamic memory allocation requires to use additional parameter of storage size:

template <typename TSignature, std::size_t TSize = sizeof(void*) * 3>
class StaticFunction;

It seems that in most cases the callback object will contain pointer to member function, pointer to handling object and some additional single parameter. This is the reason for specifying the default storage space as equal to the size of 3 pointers. The “signature” template parameter is exactly the same as with std::function plus an optional storage area size template parameter:

typedef embxx::util::StaticFunction<void (int)> MyCallback;
typedef embxx::util::StaticFunction<
    void (int, int), sizeof(void*) * 4> MyOtherCallback;

To properly implement operator(), there is a need to split the signature into the return type and rest of parameters. To achieve this the following template specialisation trick is used:

template <std::size_t TSize, typename TRet, typename... TArgs>
class StaticFunction<TRet (TArgs...), TSize>
{
public:
    ...
    TRet operator()(TArgs... args) const {...}
    ...
private:
    typedef … StorageType; // Type of the storage area,
                           // will be explained later.
    StorageType handler_; // Storage area where the callback object
                          // is stored
    bool valid_; // flag indicating whether storage are contains
                 // valid callback, initialised to false in
                 // default constructor
};

The StaticFunction object needs an ability to store any type of callable object as its internal data member and then invoke it in its operator() member function. To support this functionality we will require additional helper classes:

class StaticFunction<TRet (TArgs...), TSize>
{
    ...
private:

    class Invoker
    {
    public:
        virtual ~Invoker() {}

        // virtual invocation function
        virtual TRet exec(TArgs... args) const = 0;
    };


    template <typename TBound>
    class InvokerBound : public Invoker
    {
    public:

        template <typename TFunc>
        InvokerBound(TFunc&& func)
            : func_(std::forward<TFunc>(func))
        {
        }

        virtual ~InvokerBound() {}

        virtual TRet exec(TArgs... args) const
        {
            return func_(std::forward<TArgs>(args)...);
        }

    private:
        TBound func_;
    };

    ...
};

The callable object that will be stored in handler_ data area and it will be of type InvokerBound<…> while invoked through interface of its base class Invoker.

There is a need to properly define StorageType for the handler_ data member:

static const std::size_t StorageAreaSize = TSize + sizeof(Invoker);
typedef typename
    std::aligned_storage<
        StorageAreaSize,
        std::alignment_of<Invoker>::value
   >::type StorageType;

Note that StorageType is an uninitialised storage with alignment required to be able to store object of type Invoker. The InvokerBound<…> class will have the same alignment requirements as its base class Invoker, so it is safe to store any object of type InvokerBound<…> in the same area, as long as its size doesn’t exceed the size of the StorageType.

Also note that the actual size of the storage area is the requested TSize plus the area required to store the object of Invoker class. The size of InvokerBound<…> object is size of its private member plus the size of its base class Invoker, which will contain a single (hidden) pointer to its virtual table.

Any callable object may be assigned to StaticFunction using either constructor or assignment operator:

template <std::size_t TSize, typename TRet, typename... TArgs>
class StaticFunction<TRet (TArgs...), TSize>
{
public:
    ...

    template <typename TFunc>
    StaticFunction(TFunc&& func)
        : valid_(true)
    {
        assignHandler(std::forward<TFunc>(func));
    }

    StaticFunction& operator=(TFunc&& func)
    {
        destroyHandler();
        assignHandler(std::forward<TFunc>(func));
        valid_ = true;
        return *this;
    }

    ...

private:
    template <typename TFunc>
    void assignHandler(TFunc&& func)
    {
        typedef typename std::decay<TFunc>::type DecayedFuncType;
        typedef InvokerBound<DecayedFuncType> InvokerBoundType;

        static_assert(sizeof(InvokerBoundType) <= StorageAreaSize,
            "Increase the TSize template argument of the StaticFucntion");

        static_assert(alignof(Invoker) == alignof(InvokerBoundType),
            "Alignment requirement for Invoker object must be the same "
            "as alignment requirement for InvokerBoundType type object");

        new (&handler_) InvokerBoundType(std::forward<TFunc>(func));
    }

    void destroyHandler()
    {
        if (valid_) {
            auto invoker = reinterpret_cast<Invoker*>(&handler_);
            invoker->~Invoker();
        }
    }
};

Please pay attention that assignment operator has to call the destructor of previous function, that was assigned to it, before storing a new callable object in its place.

Also note that there are compile time checks using static_assert that the size of the object to store in the storage area doesn’t exceed the allocated size as well as alignment requirements still hold.

The invocation of the function will be implemented like this:

template <std::size_t TSize, typename TRet, typename... TArgs>
class StaticFunction<TRet (TArgs...), TSize>
{
public:
    ...
    TRet operator()(TArgs... args) const
    {
        GASSERT(valid_);
        auto invoker = reinterpret_cast<Invoker*>(&handler_);
        return invoker->exec(std::forward<TArgs>(args)...);
    }
    ...
};

Note that there are no exceptions in use and then the “must have” pre-condition for function invocation is that a valid callable object has been assigned to it. That is the reason for assertion check in the body of the function.

To complete the implementation of StaticFunction class the following logic must also be implemented:

Check whether the StaticFunction object is valid, i.e has any callable object assigned to it.
Default construction - the function is invalid and cannot be invoked.
Copy/move construction + copy/move assignment functionality.
Clearing the function (invalidating).
Supporting both const and non-const operator() in the assigned callable object. It requires both const and non-const operator() implementation of StaticFunction as well as its internal Invoker and InvokerBound<…> classes.

All this I leave as an exercise to to the reader. To see the complete implementation of the functionality described above open this link.

Data Serialisation

Another essential need in embedded development is an ability to serialise data. Most embedded products read data from some kind of sensors and/or communicate with the control centre via some wired or wireless serial interface.

Before data is sent via a communication link, it must be serialised into a buffer, and when received, deserialised from bytes also in a different buffer on the other end. The data may be serialised using big or little endian, based on the communication protocol used. The embxx library provides a generic code with an ability to read and write integral values from/to any buffer. Here is the source code for the functions described below.

The functions below (defined in namespace embxx::io) support read and write of an integral value using any type of iterator:

template <typename T, typename TIter>
void writeBig(T value, TIter& iter);

template <typename T, typename TIter>
T readBig(TIter& iter);

template <typename T, typename TIter>
void writeLittle(T value, TIter& iter);

template <typename T, typename TIter>
T readLittle(TIter& iter);

These functions receive reference to iterator of a buffer/container. When bytes are read/written from/to the buffer, the iterator is incremented. The iterator can be of any type as long as it supports dereferencing (operator*()), pre-increment (operator++) and assignment to dereferenced object. For example, serialising several values of various lengths into the array using big endian:

std::uint8_t buf[128];
auto iter = &buf[0];

std::uint16_t value1 = 0x0102;
std::uint32_t value2 = 0x03040506;
std::uint64_t value3 = 0x0708090a0b0c0d0e;

embxx::io::writeBig(value1, iter);
embxx::io::writeBig(value2, iter);
embxx::io::writeBig(value3, iter);

The contents of the buffer will be: {0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b, 0x0c 0x0d, 0x0e, …}

Similar code of reading values from the buffer would be:

std::uint8_t buf[128];
auto iter = &buf[0];

auto value1 = embxx::io::readBig<std::uint16_t>(iter);
auto value2 = embxx::io::readBig<std::uint32_t>(iter);
auto value3 = embxx::io::readBig<std::uint64_t>(iter);

Another example is serialising data into a container that has push_back() member functions, such as std::vector or circular buffer. The data will be added at the end of the existing one:

std::vector<std::uint8_t> buf;
auto iter = std::back_inserter(buf); // Will call push_back
                                     // on assignment
…
// The writes below will use push_back for every byte.
embxx::io::writeBig(value1, iter);
embxx::io::writeBig(value2, iter);
embxx::io::writeBig(value3, iter);

Depending on a communication protocol there may be a need to serialise only part of the value. For example some field of communication protocol is defined having only 3 bytes. In this case the value will probably be stored in a variable of std::uint32_t type. There is similar set of functions, but with additional template parameter that specifies how many bytes to read/write:

template <std::size_t TSize, typename T, typename TIter>
void writeBig(T value, TIter& iter);

template <typename T, std::size_t TSize, typename TIter>
T readBig(TIter& iter);

template <std::size_t TSize, typename T, typename TIter>
void writeLittle(T value, TIter& iter);

template <typename T, std::size_t TSize, typename TIter>
T readLittle(TIter& iter);

So to read/write 3 bytes will look like the following:

auto value = embxx::io::readBig<std::uint32_t, 3>(iter);
embxx::io::writeBig<3>(value, iter);

Sometimes the endianness of data serialisation may depend on some traits class parameters. In order to be able to choose “Little” or “Big” variant functions at compile time instead of runtime the tag parameter dispatch idiom must be used.

There are similar read/write functions, but instead of being differentiated by name they have additional tag parameter to specify the endianness of serialisation:

/// Same as writeBig<T, TIter>(value, iter);
template <typename T, typename TIter>
void writeData(
    T value,
    TIter& iter,
    const traits::endian::Big& endian);

/// Same as writeBig<TSize, T, TIter>(value, iter)
template <std::size_t TSize, typename T, typename TIter>
void writeData(
    T value,
    TIter& iter,
    const traits::endian::Big& endian);

/// Same as writeLittle<T, TIter>(value, iter)
template <typename T, typename TIter>
void writeData(
    T value,
    TIter& iter,
    const traits::endian::Little& endian);

/// Same as writeLittle<TSize, T, TIter>(value, iter)
template <std::size_t TSize, typename T, typename TIter>
void writeData(
    T value,
    TIter& iter,
    const traits::endian::Little& endian);

/// Same as readBig<T, TIter>(iter)
template <typename T, typename TIter>
T readData(TIter& iter, const traits::endian::Big& endian);

/// Same as readBig<TSize, T, TIter>(iter)
template <typename T, std::size_t TSize, typename TIter>
T readData(TIter& iter, const traits::endian::Big& endian);

/// Same as readLittle<T, TIter>(iter)
template <typename T, typename TIter>
T readData(TIter& iter, const traits::endian::Little& endian);

/// Same as readLittle<TSize, T, TIter>(iter)
template <typename T, std::size_t TSize, typename TIter>
T readData(TIter& iter, const traits::endian::Little& endian);

The traits::endian::Big and traits::endian::Little are defined as empty tag classes:

namespace traits
{

namespace endian
{

struct Big {};

struct Little {};

}  // namespace endian

}  // namespace traits

For example:

template <typename TTraits>
class SomeClass
{
public:
    typedef typename TTraits::Endianness Endianness;

    template <typename TIter>
    void serialise(TIter& iter) const
    {
        embxx::io::writeData(data_, iter, Endianness());
    }

private:
    std::uint32_t data_;
};

So the code above is not aware what endianness is used to serialise the data. It is provided as internal type of Traits class named Endianness. The compiler will generate the call to appropriate writeData() function, which in turn forward it to writeBig() or writeLittle().

To serialise data using big endian the traits should be defined as following:

struct MyTraits
{
    typedef embxx::io::traits::endian::Big Endianness;
};

SomeClass<MyTraits> someClassObj;
…
someClassObj.serialise(iter); // Will serialise using big endian

The interface described above is very easy and convenient to use and quite easy to implement using straightforward approach. However, any variation of template parameters create an instantiation of new binary code which may create significant code bloat if not used carefully. Consider the following:

Read/write of signed vs unsigned integer values. The serialisation/deserialisation code is identical for both cases, but won’t be considered as such when instantiating the functions. To optimise this case, there is a need to implement read/write operations only for unsigned value, while the “signed” functions become wrappers around the former. Don’t forget a sign extension operation when retrieving partial signed value.
The read/write operations are more or less the same for any length of the values, i.e of any types: (unsigned) char, (unsigned) short, (unsigned) int, etc… To optimise this case, there is a need for internal function that receives length of serialised value as a run time parameter, while the functions described above are mere wrappers around it.
Usage of the iterators also require caution. For example reading values may be performed using regular iterator as well as const_iterator, i.e. iterator pointing to const values. These are two different iterator types that will duplicate the “read” functionality if both of them are used:

char buf[128] = {…};
const char* iter1 = &buf[0];
char* iter2 = &buf[0];

// Instantiation 1
auto value1 = embxx::io::readBig<std::uint16_t>(iter1);

// Instantiation 2
auto value2 = embxx::io::readBig<std::uint16_t>(iter2);

It is possible to optimise the case above for random access iterator by using temporary pointers to unsigned characters to read the required value. After retrieval is complete, just increment the value of the passed iterator with number of characters read.

All the consideration points stated above require quite complex implementation of the serialisation/deserialisation functionality with multiple levels of abstraction which is beyond the scope of this book. It would be a nice exercise to try and implement it yourself. Another option is to use the code as is from embxx library.

Static (Fixed Size) Queue

There is almost always a need to have some kind of a queuing functionality. A circular buffer is a good compromise between speed of execution and memory consumption (vs std::deque for example). If your product allows usage of dynamic memory allocation and/or exceptions than boost::circular_buffer can be a good choice. However, if using dynamic memory allocation is not an option, then there is no other choice but to implement a circular buffer with maximum length known at compile time over C array or std::array. Here is the implementation of StaticQueue functionality from embxx library. I won’t go into too much details or explain every line of code. Instead I will emphasise several important points that must be taken into consideration.

Invalid operations

There can always be an attempt to perform an invalid operation, such as access an element outside the queue boundaries, or inserting new element when the queue is full, or popping an element when queue is empty, etc… The conventional way in C++ to handle these cases is to throw an exception. However, in embedded and especially in bare metal programming it’s not an option. The right way to handle these errors would be asserting on pre-conditions. The StaticQueue implementation in embxx library uses GASSERT() macro described earlier. The checks will be compiled only in non-Release mode (NDEBUG not defined) and in case of the failure it will invoke the project specific code the developer has written to report assertion failure.

template <typename T, std::size_t TSize>
class StaticQueue
{
public:
    ...
    void popFront()
    {
        GASSERT(!empty());
        ...
    }
};

Construction/Destruction of the elements

When the queue is created it doesn’t contain any elements. However, it must contain uninitialised space where elements can be created in the future. The space must be of sufficient size and be properly aligned.

template <typename T, std::size_t TSize>
class StaticQueue
{
public:
    typedef T ValueType;
    ...
private:
    typedef
        typename std::aligned_storage<
            sizeof(ValueType),
            std::alignment_of<ValueType>::value
        >::type StorageType;

    typedef std::array<StorageType, TSize> ArrayType;

    ArrayType array_;
    ...
};

When adding a new element to the queue, the “in-place” construction must be performed:

template <typename T, std::size_t TSize>
class StaticQueue
{
public:
    ...
    typedef T ValueType;
    ...

    template <typename U>
    void pushBack(U&& newElem)
    {
        auto* spacePtr = ...; // get pointer to the right place
        new (spacePtr) ValueType(std::forward<U>(newElem));
        ...
    }
};

When an element removed from the queue, explicit destruction must be performed:

template <typename T, std::size_t TSize>
class StaticQueue
{
public:
    ...
    typedef T ValueType;
    ...
   void popBack()
    {
        auto* spacePtr = ...; // get pointer to the right place
        auto* elemPtr = reinterpret_cast<ValueType*>(spacePtr);
        elemPtr->~T(); // call the destructor;
        ...
    }
};

Iteration

There is often a need to iterate over the elements of the queue. The standard sequential random access containers such as std::array, std::vector or std::deque may use a simple pointer (or a wrapper class around it) as iterator because address of every element is greater than address of its predecessor. Incrementing a pointer during the iteration would be enough to get an access to the next element. However, in circular queue/buffer there may be a case when address of the beginning of the queue is greater than address of the end of the queue:

In this case having a simple pointer as iterator is not enough. There is a need to check a wrap-around case when incrementing an iterator. However always using this kind of iterator may incur undesired performance penalties. That is when “leniarisation” concept pops up. When the queue is linearised, address of every element is greater than the address of its predecessor and simple pointer (linearised iterator) may be used to iterate over all the elements in the queue:

When the queue is not linearised, it either must be linearised (may be a bit expensive, depending on the size of the queue) or iterate over all the elements in two stages: first on the first (top) part, then on the second (bottom) part. The StaticQueue implementation in embxx library provides two functions arrayOne() and arrayTwo() that return these two ranges.

However, there may be a need to read/write data from/to the queue without worrying about the wrap-around case. Good example of such case would be having such circular queue/buffer to contain data read from some communication interface, such as serial port, and there is a need to deserialise 4 byte value from this buffer. The most convenient way would be to use embxx::io::readBig<4>(iter) described previously. To properly support this case we will need to have a bit more expensive iterator that properly handles wrap-around when incremented and/or dereferenced. This is the reason for having two types of iterators for StaticQueue: LinearisedIterator and Iterator. The former is a simple typedef for a pointer which can be used only on the linearised part of the queue and the latter may be used when iterating without any knowledge whether there is a wrap-around case during the iteration.

When defining a new custom iterator class, there is a need to properly support std::iterator_traits for it. The traits are used to implement functions such as std::advance or std::distance). The requirement is to define the following internal types:

template <typename T, std::size_t TSize>
class StaticQueue
{
public:
    class Iterator
    {
    public:
        typedef std::random_access_iterator_tag iterator_category;
        typedef T value_type;
        typedef T* pointer;
        typedef T& reference;
        typedef typename std::iterator_traits<pointer>::difference_type difference_type;
        ...
    };

    ...
};

Copying queues

Care must be taken when copying/moving elements between the queues. The compiler is not aware of the right type of the elements that are stored in the queue as well as number of valid elements in the queue is unknown at compile time. When using default copy/move constructor and/or assignment operator the compiler will generate a code that copies raw bytes in the storage space between the queues. It may work for the basic type or POD structs, but it is not the right way to do the copying. There is a need to use copy/move constructors in case of constructions or copy/move assignment operator in case of assignment of the valid elements and not copy/move garbage data from unused space.

In addition to regular copy/move constructors and assignment operators, there may also be a need to provide copy/move construction and/or copy/move assignment from the queue that contains elements of the same type, but has different capacity:

template <typename T, std::size_t TSize>
class StaticQueue
{
public:
    ...

    template <std::size_t TAnySize>
    StaticQueue(const StaticQueue<T, TAnySize>& queue)
        : Base(&array_[0], TSize)
    {
        ... // Copy all the elements from other queue
    }

    template <std::size_t TAnySize>
    StaticQueue(StaticQueue<T, TAnySize>&& queue)
        : Base(&array_[0], TSize)
    {
        ... // Move all the elements from other queue
    }

    template <std::size_t TAnySize>
    StaticQueue& operator=(const StaticQueue<T, TAnySize>& queue)
    {
        ... // Copy all the elements from other queueu
    }

    template <std::size_t TAnySize>
    StaticQueue& operator=(StaticQueue<T, TAnySize>&& queue)
    {
        ... // Move all the elements from other queue
    }
    ...
};

Optimising code generation

As we all know and confirmed in Templates chapter, any difference in the value of template parameter will create new instantiation of executable code. It means that having multiple queues of the same type, but different sizes may bloat the executable in an unacceptable way. The best way to solve this problem would be defining a base class that is templated only on the type of the stored values and implements the whole logic of the queue while the derived StaticQueue class will just provide the necessary storage area and reuse (wrap) all the functions implemented in the base class:

namespace details
{

template <typename T>
class StaticQueueBase
{
protected:
    typedef T ValueType;
    typedef
        typename std::aligned_storage<
            sizeof(ValueType),
            std::alignment_of<ValueType>::value
        >::type StorageType;
    typedef StorageType* StorageTypePtr;

    StaticQueueBase(StorageTypePtr data, std::size_t capacity)
        : data_(data),
          capacity_(capacity),
          startIdx_(0),
          count_(0)
    {
    }

    template <typename U>
    void pushBack(U&& value) {...}

    ... // All other API functions

private:
    StorageTypePtr data_; // Pointer to storage area
    std::size_t capacity_; // Capacity of the storage area
    std::size_t startIdx_; // Index of the beginning of the queue
    std::size_t count_; // Number of elements in the queue
};

} // namespace details

template <typename T, std::size_t TSize>
class StaticQueue : public details::StaticQueueBase<T>
{
    typedef details::StaticQueueBaseOptimised<T> Base;
    typedef typename Base::StorageType StorageType;

public:
   StaticQueue()
        : Base(&array_[0], TSize)
    {
    }

    template <typename U>
    void pushBack(U&& value)
    {
        Base::pushBack(std::forward<U>(value));
    }

    ... // Wrap all other API functions

private:
    typedef std::array<StorageType, TSize> ArrayType;
    ArrayType array_;
};

There are ways to optimise even more. Let’s take queues of int and unsigned values for example. They have the same size and from the queue implementation perspective there is no difference in handling them, so it would be a waste of code space to allow the instantiation of the same binary code for the queue to handle both of these types. Using template specialisation tricks we may implement queues of signed integral types to be a mere wrappers around queues that contain unsigned integral types. Additional example would be storage of the pointers to any types. It would be wise to specialise StaticQueue of pointers to be a wrapper around queue of void* pointers or even integral unsigned values of the same size as pointers (such as std::uint32_t on 32 bit architecture or std::uint64_t on 64 bit architecture).

Thanks to the template specialisation there are virtually no limits to optimisations we may apply. However I would like to remind you the well known saying “Premature optimisations are the root of all evil”. Please avoid optimising your StaticQueue implementation until the need arises.

Basic Concepts

As already mentioned in Overview, this book explains and shows examples of how to implement soft real time systems. This chapter will explain basic concepts of asynchronous event handling as well as how to implement required functionality without complex state machines, and/or task scheduing.

Event Loop

Most bare-metal embedded products require only two modes of operation:

Interrupt (or service) mode
Non-interrupt (or user) mode.

The job of the code, that is executed in interrupt mode, is to respond to hardware events (interrupts) by performing minimal job of updating various status registers and schedule proper handling of event (if applicable) to be executed in non-interrupt mode. In most projects the interrupt handlers are not prioritised, and the next hardware event (interrupt) won’t be handled until the previously called interrupt handler returns, i.e. CPU is ready to return to non-interrupt mode. Therefore, it is important for the interrupt handler to do its job as quickly as possible.

There are multiple ways to schedule the execution of event handling code in non-interrupt mode from code being executed in interrupt mode. One of the easiest and straightforward ones is to have some kind of global flag that indicates that event has occurred and the processing is required:

bool g_buttonPressed = false;

void gpioInterruptHandler()
{
    ...
    if (/*button_gpio_recognised*/) {
        g_buttonPressed = true;
    }
}

int main(int argc, const char* argv[])
{
    ...
    while (true) { // infinite event processing loop
        enableInterrupts();
        ...
        if (g_buttonPressed) {
            disableInterrupt(); // avoid races
            g_buttonPressed = false;
            enableInterrupts();
            ... // Handle button press
        }
        ...
        disableInterrupts();
        if (/* no_more_events */) {
            WFI(); // “Wait for interrupt” assembler instruction,
                   // instruction will exit when there is pending
                   // interrupt.
        }
    }
}

It is quite clear that this approach is not scalable, i.e. will quickly become a mess when number of hardware events the code needs to handle grows. The events may also be handled not in the same order they occurred, which may create undesired races and side effects on some systems.

Another widely used approach is to create a queue-like container (linked list or circular buffer) of event IDs which are handled in the similar event loop:

enum EventId
{
    EventId_ClockTick,
    EventId_ButtonPress,
    ....
}

Queue<EventId> events;


void gpioInterruptHandler()
{
    ...
    if (/*button_gpio_recognised*/) {
        events.push_back(EventId_ButtonPress);
    }
}


int main(int argc, const char* argv[])
{
    ...
    while (true) { // infinite event processing loop
        enableInterrupts();
        ...
        switch (events.front()) {
        case EventId_ClockTick:
            ... // handle clock tick
            break;

        case EventId_ButtonPress:
            ... // handle button press
            break;
            ...
        }
        ...
        disableInterrupts();
        events.pop_front(); // Remove processed event from queue
       if (events.empty()) {
            WFI(); // “Wait for interrupt” assembler instruction,
                   // instruction will exit when there is pending interrupt.
        }
    }
}

The approach above is a bit better, it processes events in the same order they occur, but still has its own disadvantages. Sometimes there is a need to attach some extra information for the processing of the event. Usually it is done using global variables, which introduces some extra complexity to the code and possibility for races. The handling of some events may have several internal stages and require busy wait(s) during the processing. These busy waits may significantly delay the processing of other pending events. The usual way to resolve this kind of problem is to create several state machines, that process this kind of events in stages. Most of Real-Time OSes provide an ability to create independent tasks (threads), that can be used to perform independent complex multiple staged workflows while the OS performs context switching between them. Still, the code can very quickly become too complex and difficult to maintain.

The approaches above are widely used in bare metal projects developed using C programming language. Using C++ language built-in features as well as ready to use classes from STL it is possible to simplify the complexity of the code and implement proper asynchronous handling of events, which is easier to debug and maintain.

I would recommend using a queue of callable objects created by std::bind() expressions or lambda functions. The conventional C++ way would be using std::list of std::function objects. However, these classes use dynamic memory allocation and throw exceptions, which may be not suitable for every bare metal project. Anyway, let’s just demonstrate the idea using these two classes:

typedef std::list<std::function<void ()> > Queue;
Queue handlers;

template <typename TFunc>
void addHandlerFromInterrupt(TFunc&& func)
{
    // No need to disable interrupts.
    handlers.push_back(std::forward<TFunc>(func));
}

template <typename TFunc>
void addHandler(TFunc&& func)
{
   // Protect against races with interrupt handlers
    disableInterrupts();
    handlers.push_back(std::forward<TFunc>(func));
    enableInterrupts();
}

void handleButtonPressStart()
{
    ...// Start handling of button press event
    handleButtonPressBusyWait();
}

void handleButtonPressBusyWait()
{
    if (/* some_condition */) {
        handleButtonPressFinish();
        return;
    }

    // The condition is not true, need to wait,
    // reschedule the execution of the same function.
    addHandler(
        []()
        {
             handleButtonPressBusyWait();
        });
}

void handleButtonPressFinish()
{
    ...// Finalise handling of button press event.
}

void gpioInterruptHandler()
{
    ...
    if (/*button_gpio_recognised*/) {
        addHandlerFromInterrupt(
            []()
            {
		 // Will be executed in non-interrupt event loop.
                 handleButtonPressStart();
            });
    }
}

int main(int argc, const char* argv[])
{
    ...
    while (true) { // infinite event processing loop
        enableInterrupts();
        ...
        auto& firstHandler = handlers.front();
        firstHandler(); // Execute scheduled callable object
        ...
        disableInterrupts();
        handlers.pop_front(); // Remove executed callable object
                              // (function) from queue of handlers.

        if (handlers.empty()) {
            WFI(); // “Wait for interrupt” assembler instruction,
                   // instruction will exit when there is pending
                   // interrupt.
        }
    }
}

This approach allows having complex processing of some events with many sub-stages and busy waits while still allowing other independent events being processed. All the handlers are executed in the same order they were pushed to the queue. There is an ability to bind multiples additional parameters together with the function call, which reduces a necessity to have global variables to pass values around. There is no need to maintain a list of various event IDs, explicitly define stages of state machine(s) or implement complex task switching between independent threads (tasks).

Now, let’s try to get rid of dynamic memory allocation and possible exceptions. The only way to achieve this is to have a compile time constant that specifies the maximal size of the queue. The naive implementation would be using StaticQueue of StaticFunction objects described in Basic Needs chapter. However, the StaticFunction class definition requires compile time constant to specify the size of the area to store all the data of the callable object. It must be big enough to contain any possible callable object that will be pushed to the queue. For example:

typedef embxx::util::StaticFunction<void (), sizeof(void*) * 10>  Func;
typedef embxx::container::StaticQueue<Func, 1024> Queue;

Queue handlers;
…
handlers.push_back(std::bind(&func1, param1, param2)); // Will require size of only 3 values
...
handlers.push_back(
    std::bind(
        &func2,
        param1,
        param2,
        param3,
        param4)); // Will require size of only 5 values

handlers.push_back(
    std::bind(
        &func3,
        param1,
        param2,
        param3,
        param4,
        param5,
        param6,
        param7,
        param8,
        param9)); // Will consume the whole available space.

The queue will look like this:

It is quite clear that lots of space may be wasted and this approach must be optimised. What if we could push the callable object to the queue one after another regardless of their actual size with a bit of extra space overhead (such as pointer to v-table), that will help us to retrieve size of the object at runtime and remove appropriate number of bytes from such queue after the callable object did its job?

It looks much better. The space consumption is much more efficient.

To properly support this type of queue we must:

implement polymorphic behaviour when calling every handler with same interface.
implement polymorphic behaviour to retrieve the size of single handler in order to know how many bytes are to be removed from the queue after the handler has been called.
properly handle wrap-around cases when the pushed handler cannot fit into the area between the end of the queue and end of the allocated space.

The code of required classes will be like this:

class Task
{
public:
    virtual ~Task() {}

    virtual std::size_t getSize() const
    {
        return 1U;
    }

    virtual void exec() {}
};

template <typename TTask>
class TaskBound : public Task
{
public:

    // Size is minimal number of elements of size equal to sizeof(Task)
    // that will be able to store this TaskBound object
    static const std::size_t Size =
        ((sizeof(TaskBound<typename std::decay<TTask>::type>) - 1) /
                                                     sizeof(Task)) + 1;

    explicit TaskBound(const TTask& task)
      : task_(task)
    {
    }

    explicit TaskBound(TTask&& task)
      : task_(std::move(task))
    {
    }

    virtual ~TaskBound() {}

    virtual std::size_t getSize() const
    {
        return Size;
    }

    virtual void exec()
    {
        task_();
    }

private:
    TTask task_;
};

The definition of the Queue type will be:

typedef typename
    std::aligned_storage<
        sizeof(Task),
        std::alignment_of<Task>::value
   >::type ArrayElemType;

static const std::size_t ArraySize = TSize / sizeof(Task);
typedef embxx::container::StaticQueue<ArrayElemType, ArraySize> Queue;

TSize is a template parameter that specifies maximum size (in bytes) of the queue storage area.

The code of pushing new handler to the queue will look like this:

template <typename TTask>
bool addHandler(TTask&& task)
{
    typedef TaskBound<typename std::decay<TTask>::type> TaskBoundType;
    static_assert(
        std::alignment_of<Task>::value == std::alignment_of<TaskBoundType>::value,
        "Alignment of TaskBound must be same as alignment of Task");

    static const std::size_t requiredQueueSize = TaskBoundType::Size;

    auto placePtr = getAllocPlace(requiredQueueSize);
    if (placePtr == nullptr) {
        return false;
    }

    new (placePtr) TaskBoundType(std::forward<TTask>(task));
    return true;
}

Note, that job of getAllocPlace() function is to make sure that continuous storage area that is able to store the required callable object is created (by resizing the queue) and return pointer to this area.

ArrayElemType* getAllocPlace(std::size_t requiredQueueSize)
{
    auto invalidIter = queue_.invalidIter();
    while (true)
    {
        if ((queue_.capacity() - queue_.size()) < requiredQueueSize) {
            return nullptr;
        }

        auto curSize = queue_.size();
        if (queue_.isLinearised()) {
            auto dist =
                static_cast<std::size_t>(
                    std::distance(queue_.arrayTwo().second, invalidIter));
            if ((0 < dist) && (dist < requiredQueueSize)) {
                queue_.resize(curSize + 1);
                auto placePtr = static_cast<void*>(&queue_.back());
                new (placePtr) Task();
                continue;
            }
        }

        queue_.resize(curSize + requiredQueueSize);
        return &queue_[curSize];
    }
}

In case of wrap-around, when there is not enough space between the end of the queue and end of its storage area, number of simple Task objects which do nothing (the body of exec() function is empty) are pushed to fill the space till the end of storage area to make the queue non-linearised, which in turn will allow creation of continuous area of required size in the second half of the circular queue.

The event handling loop will be something like this:

while (true) {
    ...
    // Get an access pointer to next handler
    auto taskPtr = reinterpret_cast<Task*>(&queue_.front());
    auto sizeToRemove = taskPtr->getSize();

    // Execute the handler while allowing interrutps
    enableInterrupts();
    taskPtr->exec();

    // Remove the handler information from the queue
    taskPtr->~Task();
    disableInterrupts();
    queue_.popFront(sizeToRemove);

    ...
}

The only remaining thing is to create a convenient and generic interface to be able to add new handlers for execution from both interrupt and non-interrupt contexts.

Analogy with Threads

Before diving into implementation of such interface, I’d like to make an analogy between interrupt/non-interrupt execution modes and two threads. The inter-threads communication is managed using locks (such as std::mutex) and condition variables (such as std::condition_variable_any). Using this analogy the handlers execution loop (executed in non-interrupt thread) can be implemented like this:

std::mutex lock_;
std::condition_variable_any cond_;
...

while (true) {
    lock_.lock();

    while (!queue_.isEmpty()) {
        auto taskPtr = reinterpret_cast<Task*>(&queue_.front());
        auto sizeToRemove = taskPtr->getSize();
        lock_.unlock();

        // Executed with interrupts enabled
        taskPtr->exec();
        taskPtr->~Task();

        lock_.lock();
        queue_.popFront(sizeToRemove);
    }

    // Still locked prior to wait
    cond_.wait(lock_);
    lock_.unlock();
}

And adding new execution handler from any thread can be:

template <typename TTask>
bool addHandler(TTask&& task)
{
   std::lock_guard<decltype(lock_)> guard(lock_);
   ... // adding handler functionality
   cond_.notify_all(); // notify the condition variable
}

If we think about interrupt and non-interrupt execution modes as two threads, the locking in non-interrupt thread is equivalent to disabling interrupts; and waiting for condition variable to be notified is equivalent for waiting for interrupts (using WFI or WFE instructions in ARM architecture) while notification can be automatic due to pending interrupts or implemented using SEV instruction. However, our interrupt and non-interrupt mode threads differ slightly from conventional threads. The non-interrupt mode one can be interrupted at any time by interrupt mode, while the interrupt mode “thread” won’t be interrupted and doesn’t actually need to protect itself from other thread’s intervention.

The whole logic of event handling loop in non-interrupt context described above is generic except locking (disabling interrupts) and waiting for new handlers to be added (waiting for interrupts) which are platform and architecture specific. As I’ve mentioned before, the whole idea of using C++ instead of C in bare metal development is to be able to write and reuse generic code while providing minimal platform specific hardware control functionality. The embxx library provides EventLoop class that receives the locking and condition variable classes as template parameters and manages safe addition of new handlers and in-order execution of the latter in non-interrupt context.

The class definition looks like this:
template <std::size_t TSize, typename Tlock, typename TCond>
class EventLoop
{
    ...
};

The TLock class must expose the following public interface:

class PlatformLock
{
public:
    // Locks out interrupt "thread". The function is called
    // in non-interrupt context
    void lock() {...}

    // Restore previous state changed by "lock()" function, i.e.
    // allow interrupts if they were disabled by lock().
    void unlock() {...}

    // Same as lock(), but will be called when new handler is about to
    // be added from interrupt handler. In normal case it should be an
    // empty function, unless the interrupts are prioritised and there
    // is a need to disable other interrupts from an interrupt handler
    void lockInterruptCtx() {...}

    // Same as unlock, but will be called in interrupt context. Should
    // also be empty function when interrupts are not prioritised.
    void unlockInterruptCtx() {...}
};

The TCond class must expose the following public interface:

class PlatformCond
{
public:
    // Receives the reference to lockable object that is locked
    // (has lock() and unlock() member functions) and
    // responsible to release the lock if needed and wait for
    // notifications from other thread(s). After the notification
    // occurs it must re-acquire the lock prior to returning.
    template <typename TLock>
    void wait(TLock& lock) {...}

    // This function is used to notify condition that wait should
    // be terminated.
    void notify() {...}
};

The example of such classes for Raspberry Pi platform may be found here.

class InterruptLock
{
public:
    InterruptLock()
        : flags_(0) {}

    void lock()
    {
        __asm volatile("mrs %0, cpsr" : "=r" (flags_)); // store flags
        __asm volatile("cpsid i"); // disable interrupts
    }

    void unlock()
    {
        if ((flags_ & IntMask) == 0) {
            // Was previously enabled
            __asm volatile("cpsie i"); // enable interrupts
        }
    }

    void lockInterruptCtx()
    {
        // Nothing to do
    }

    void unlockInterruptCtx()
    {
        // Nothing to do
    }

private:
    volatile std::uint32_t flags_;
    static const std::uint32_t IntMask = 1U << 7;
};

class WaitCond
{
public:
    template <typename TLock>
    void wait(TLock& lock)
    {
        // no need to unlock (re-enable interrupts)
        static_cast<void>(lock);
        __asm volatile("wfi");
    }

    void notify()
    {
        // Nothing to do, pending interrupt will cause wfi
        // to exit even with interrupts disabled
    }
};

The EventLoop class exposes the following public interface:

template <std::size_t TSize, typename Tlock, typename TCond>
class EventLoop
{
public:
    ...
    /// @brief Post new handler for execution.
    /// @details Acquires regular context lock. The task is added to
    ///          the execution queue. If the execution queue is empty
    ///          before the new handler is added, the condition
    ///          variable is signalled by calling its notify() member
    ///          function.
    /// @param[in] task R-value reference to new handler functor.
    /// @return true in case the handler was successfully posted,
    ///         false if there is not enough space in the execution
    ///         queue.
    template <typename TTask>
    bool post(TTask&& task);

    /// @brief Post new handler for execution from interrupt context.
    /// @details Acquires interrupt context lock. The task is added to
    ///          the execution queue. If the execution queue is empty
    ///          before the new handler is added, the condition variable
    ///          is signalled by calling its notify() member function.
    /// @param[in] task R-value reference to new handler functor.
    /// @return true in case the handler was successfully posted, false
    ///         if there is not enough space in the execution queue.
    template <typename TTask>
    bool postInterruptCtx(TTask&& task);

    /// @brief Event loop execution function.
    /// @details The function keeps executing posted handlers until
    ///          none are left. When execution queue becomes empty the
    ///          wait(...) member function of the condition variable
    ///          gets called to execute blocking wait for new handlers.
    ///          When new handler is added, the condition variable will
    ///          be signalled and blocking wait is expected to be
    ///          terminated to continue execution of the event loop.
    ///          This function never exits unless stop() was called to
    ///          terminate the execution. After stopping the main
    ///          loop, use reset() member function to enable the loop
    ///          to be executed again.
    void run();

    /// @brief Stop execution of the event loop.
    /// @details The execution may not be stopped immediately. If there
    ///          is an event handler being executed, the loop will be
    ///          stopped after the execution of the handler is finished.
    void stop();

    /// @brief Reset the state of the event loop.
    /// @details Clear the queue of registered event handlers and
    ///          resets the "stopped" flag to allow new event loop
    ///          execution.
    void reset();
}:

I’ll leave the implementation of the functions above as an exercise to the reader. Don’t forget to call notify() member function of condition variable when adding new handler to the empty queue.

If needed, the reference implementation can be found here.

Busy Loops

The event loop described above is an easy and convenient way to implement soft real-time systems. However, the main rule with such architecture is: DON’T DO BUSY LOOPS! It means, if there is a real need to perform a busy wait before proceeding to the next stage, do it by letting other events being handled as well. The EventLoop class also provides busyWait() member function that does exactly that.

template <std::size_t TSize, typename Tlock, typename TCond>
class EventLoop
{
public:
    ...
    /// @brief Perform busy wait.
    /// @details Executes busy wait while allowing other event handlers
    ///          posted by interrupt handlers being processed.
    /// @tparam TPred Predicate class type, must define
    ///         @code bool operator()(); @endcode
    ///         that return true in case busy wait must be terminated.
    /// @tparam TFunc Functor class that will be executed when wait is
    ///         complete. It must define
    ///         @code void operator()(); @endcode
    /// @param pred Any type of reference to predicate object
    /// @param func Any type of reference to "wait complete" function.
    /// @pre The event loop must have enough space to repost the call
    ///      to busyWait(). Note that there is no wait to notify the
    ///      caller if post operation fails. In debug compilation mode
    ///      there will be an assertion failure in case call to post()
    ///      returned false, in release compilation mode the failure
    ///      will be silent.
    template <typename TPred, typename TFunc>
    void busyWait(TPred&& pred, TFunc&& func)
    {
        if (pred()) {
            bool result = post(std::forward<TFunc>(func));
            GASSERT(result);
            static_cast<void>(result);
            return;
        }

        bool result = post(
            [this, pred, func]()
            {
                busyWait(std::move(pred), std::move(func));
            });
        GASSERT(result);
        static_cast<void>(result);
    }
};

Device-Driver-Component

Now, after understanding what the event loop is and how to implement it in C++, I’d like to describe Device-Driver-Component stack concept before proceeding to practical examples.

The Device is a platform specific peripheral(s) control layer. Sometimes it is called HAL - Hardware Abstraction Layer. It has an access to platform specific peripheral control registers. Its job is to implement predefined interface required by upper Driver layer, handle the relevant interrupts and report them to the Driver via callbacks.

The Driver is a generic platform independent layer. Its job is to receive requests for asynchronous operation from the Component layer and forward the request to the Device. It is also responsible for receiving notifications about the interrupts from the Device via callbacks, perform minimal processing of the hardware event if necessary and schedule the execution of proper event handling callback from the Component in non interrupt context using Event Loop.

The Component is a generic or product specific layer that works fully in event loop (non-interrupt) context. It initiates asynchronous operations using Driver while providing a callback object to be called in event loop context when the asynchronous operation is complete.

There are several main operations required for any asynchronous event handling:

Start the operation.
Complete the operation.
Cancel the operation.
Suspend the operation.
Resume suspended operation.

All the peripherals described in Peripherals chapter will follow the same scheme for these operations with minor changes, such as having extra parameters or intermediate stages.

Starting Asynchronous Operation

Any non-interrupt context operation is initiated from some event handler executed by the Event Loop or from the main() function before the event loop started its execution. The handler being executed invokes some function in some Component, which requests the Driver to perform some asynchronous operation while providing a callback object to be executed when such operation is complete. The Driver stores the provided callback object and other parameters in its internal data structures, then forwards the request to the Device, which configures the hardware accordingly and enables all the required interrupts.

Completing Asynchronous Operation

The first entity, that is aware of asynchronous operation completion, is Device when appropriate interrupt occurs. It must report the completion to the Driver somehow. As was described earlier, the Device is a platform specific layer that resides at the bottom of the Device-Driver-Component stack and is not aware of the generic Driver layer that uses it. The Device must provide a way to set an operation completion report object. The Driver will usually assign such object during construction/initialisation stage:

When the expected interrupt occurs, the Device reports operation completion to the Driver, which in turn schedules execution of the callback object from the Component in non-interrupt context using Event Loop.

Image: Completing Asynchronous Operation

Note that the operation may fail, due to some hardware faults, This is the reason to have status parameter reporting success and/or error condition in both callback invocations.

Canceling Asynchronous Operation

There must be an ability to cancel asynchronous operations in progress. For example some Component activates asynchronous operation request on some hardware peripheral together with asynchronous wait request to the timer to measure the operation timeout. If timeout callback is invoked first, then there is a need to cancel the outstanding asynchronous operation. Or the opposite, once the read is successful, the timeout measure should be canceled. However, the cancellation may be a bit tricky. One of the main requirements for asynchronous events handling is that the Component's callback MUST be called and called only ONCE. It creates a situation when cancellation may become unsuccessful. For instance, the callback of the asynchronous operation was posted for execution in Event Loop, but hasn’t been executed by the latter yet. It brings us to the necessity to provide an indication whether the cancellation request was successful. Simple boolean return value is enough.

When the cancellation is successful the Component's callback object is invoked with status specifying that operation was Aborted.

One possible case of unsuccessful cancellation is when callback was posted for execution in event loop, but hasn’t been executed yet when cancellation is attempted. In this case Driver is aware that there is no pending asynchronous operation and can return false immediately.

Image: Canceling Asynchronous Operation Unsuccessful1

Another possible case of unsuccessful cancellation is when completion interrupt occurs in the middle of cancellation request:

Image: Canceling Asynchronous Operation Unsuccessful2

In this case the Device must be able to handle such race condition appropriately, by temporarily disabling interrupts before checking whether the completion callback was executed. The Driver must also be able to handle interrupt context execution in the middle on non-interrupt one.

Suspend / Resume Asynchronous Operation

There may be a Driver, that is required to support multiple asynchronous operations at the same time, while managing internal queue of such requests and issuing them one by one to the Device. In this case there is a need to prevent "operation complete" callback being invoked in interrupt mode context, while trying to access the internal data structures in the event loop (non-interrupt) context. The Device must provide both suspendOp() and resumeOp() to suppress invocation of the callback and allow it back again respectively. Usually suspension means disabling the interrupts without stopping current operation, while resume means re-enabling them again.

Image: Suspending Asynchronous Operation

Note that the suspendOp() request must also indicate whether the suspension was successful or the completion callback has been already invoked in interrupt mode, just like with the cancellation. After the operation being successfully suspended, it must be either resumed or canceled.

Device Function Invocation Context

Let’s think about the case when Driver supports multiple asynchronous operations at the same time and queuing them internally while issueing start requests to the Device one by one.

Image: Starting Multiple Asynchronous Operations

The reader may notice that the startOp() member function of the Device was invoked in event loop (non-interrupt) context while the second time it was in interrupt context right after the completion of the first operation was reported. There may be a need for the Device's implementation to differentiate between these calls.

One of the ways to do so is to have different names and make the Driver use them depending on the current execution context:

class MyDevice
{
public:
    void startOp();
    void startOpInterruptCtx();
}

Another way is to use a tag dispatching idiom, which I decided to use in embxx library.

It defines two extra tag structs in embxx/device/context.h:

namespace embxx
{

namespace device
{

namespace context
{

// Event loop context tag class.
struct EventLoop {};

// Interrupt context tag class.
struct Interrupt {};

} // namespace context

} // namespace device

} // namespace embxx

Then, almost every member function defined by Device class has to specify extra tag parameter indicating context:

class MyDevice
{
public:
    typedef embxx::device::context::EventLoop EventLoopCtx;
    typedef embxx::device::context::Interrupt InterruptCtx;

    void startOp(EventLoopCtx context)
    {
        static_cast<void>(context); // unused parameter
        ... // Perform operation when called in event loop context
    }

    void startOp(InterruptCtx context)
    {
        static_cast<void>(context); // unused parameter
        ... // Perform operation when called in interrupt context
    }
};

The Driver class will invoke the Device functions using relevant temporary context object passed as the last parameter:

class MyDriver
{
public:
    typedef embxx::device::context::EventLoop EventLoopCtx;
    typedef embxx::device::context::Interrupt InterruptCtx;

    // Invoked by some Component object in Event Loop Context
    void asyncOp(...)
    {
        ...
        device_.startOp(EventLoopCtx());
        ...
    }

private:

   // Some registered event callback handler,
   // invoked in interrupt context
   void interruptCallbackHandler()
   {
       ...
       device_.startOp(InterruptCtx());
   }
};

If some function needs to be called only in, say EventLoop context, and not supported in Interrupt context, then it is enough to implement only supported variant. If Driver layer tries to invoke the function with unsupported context tag parameter, the compilation will fail:

class MyDevice
{
public:
    typedef embxx::device::context::EventLoop EventLoopCtx;

    void cancelOp(EventLoopCtx context)
    {
        static_cast<void>(context); // unused parameter
        ... // Cancel recent operation
    }
};

If there is no need to differentiate between the contexts the function is invoked in, then it is quite easy to unify them:

class SomeDevice
{
public:

    template <typename TContext>
    void startOp(TContext context)
    {
        static_cast<void>(context); // unused parameter
        startOpInternal();
    }

private:
    void startOpInternal()
    {
        ...
    }
};

Reporting Errors

When issuing asynchronous operation request to the Driver and/or Component, there must be a way to report success / failure status of the operation, and if it failed provide some extra information about the reason of the failure. Providing such information as first parameter to the callback functor object is a widely used convention among the developers.

In most cases, the numeric value of error code is good enough.

The embxx library provides a short list of such values in enumeration class defined in embxx/error/ErrorCode.h:

namespace embxx
{

namespace error
{

enum class ErrorCode
{
    Success, ///< Successful completion of operation.
    Aborted, ///< The operation was cancelled/aborted.
    BufferOverflow, /// The buffer is full with read termination condition being false
    HwProtocolError, ///< Hardware peripheral reported protocol error.
    Timeout, ///< The operation takes too much time.
    NumOfStatuses ///< Number of available statuses. Must be last
};

}  // namespace error

}  // namespace embxx

There is also a wrapper class around the embxx::error::ErrorCode, called embxx::error::ErrorStatus (defined in embxx/error/ErrorStatus.h):

namespace embxx
{

namespace error
{

template <typename TErrorCode = ErrorCode>
class ErrorStatusT
{
public:
    /// @brief Error code enum type
    typedef TErrorCode ErrorCodeType;

    /// @brief Default constructor.
    /// @details The code value is 0, which is "success".
    ErrorStatusT();

    /// @brief Constructor
    /// @details This constructor may be used for implicit
    ///          construction of error status object out
    ///          of error code value.
    /// @param code Numeric error code value.
    ErrorStatusT(ErrorCodeType code);

    /// @brief Copy constructor is default
    ErrorStatusT(const ErrorStatusT&) = default;

    /// @brief Destructor is default
    ~ErrorStatusT() = default;

    /// @brief Copy assignment is default
    ErrorStatusT& operator=(const ErrorStatusT&) = default;

    /// @brief Retrieve error code value.
    const ErrorCodeType code() const;

    /// @brief boolean conversion operator.
    /// @details Returns true if error code is not equal 0,
    ///          i.e. any error will return true, success
    ///          value will return false.
    operator bool() const;

    /// @brief Same as !(static_cast<bool>(*this)).
    bool operator!() const;

private:
    ErrorCodeType code_;
};

typedef ErrorStatusT<ErrorCode> ErrorStatus;

}  // namespace error

}  // namespace embxx

It allows implicit conversion from embxx::error::ErrorCode to embxx::error::ErrorStatus and convenient evaluation whether error has occurred in if sentences:

embxx::error::ErrorStatus es;
GASSERT(!es); // No error
...
if (/* some condition */) {
    es = embxx::error::ErrorCode::BufferOverflow;
}
...
if (es) {
    ... // Error occurred, access the arror code by calling es.code()
}

By convention every callback function provided with any asynchronous request to any Driver and/or Component implemented in embxx library will receive const embxx::error::ErrorStatus& as its first argument:

void callback(const embxx::error::ErrorStatus& es, ... /* some other parameters */)
{
    if (es == embxx::error::ErrorCode::Aborted) {
        return; // Nothing to do
    }

    if (es) {
        ... // Error occurred
        return;
    }
    ... // Success
}

Cooperation

As it is seen in the charts above, the Driver must have an access to the Device as well as Event Loop objects. However, the former is not aware of the exact type of the latter. In order to write fully generic code, the Device and Event Loop types must be provided as template arguments:

template <typename TDevice, typename TEventLoop>
class MyDriver
{
public:
    // During the construction store references to Device
    // and Event Loop objects.
    MyDriver(TDevice& device, TEventLoop& el)
      : device_(device),
        el_(el)
    {
    }

    ...

private:

    TDevice& device_;
    TEventLoop& el_;
};

The Component needs an access only to the Device and maybe Event Loop. The reference to the latter may be retrieved from the Device object itself:

template <typename TDevice, typename TEventLoop>
class MyDriver
{
public:
    TEventLoop& getEventLoop()
    {
        return el_;
    }

private:
    TEventLoop& el_;
};

template <typename TDriver>
class MyComponent
{
public:
    MyComponent(TDriver& driver)
      : driver_(driver)
    {
    }

    void someFunc()
    {
        auto& el = driver_.getEventLoop();
        el.post(...);
    }

private:
    TDriver& driver_;
};

Storing Callback Object

The Driver needs to provide a callback object to the Device to be called when appropriate interrupt occurs. The Component also provides a callback object to be invoked in non-interrupt context when the asynchronous operation is complete, aborted or terminated due to some error condition. These callback objects need to be stored somewhere. The best way to do so in conventional C++ is using std::function.

template <typename TDevice, typename TEventLoop>
class MyDriver
{
public:
    template <typename TFunc>
    void asyncOp(TFunc&& callbackObj)
    {
        callback_ = std::forward<TFunc>(callbackObj);
        ... // Start the operation
    }

private:
    typedef std::function<void embxx::error::ErrorStatus&> CallbackType;

    void opCompleteInterruptCallback(void embxx::error::ErrorStatus& es)
    {
        ... // Complete the operation
        el_.postInterruptCtx(std::bind(std::move(callback_), es));
    }

    EventLoop& el_;
    CallbackType callback_;
};

There are two problems with using std::function: exceptions and dynamic memory allocation. It is possible to suppress the usage of exceptions by making sure that function object is never invoked without proper object being assigned to it, and by overriding appropriate __throw_* function(s) to remove exception handling code from binary image (described in Exceptions chapter). However, it is impossible to get rid of dynamic memory allocation in this case, which reduces number of bare metal products the Driver code can be reused in, i.e. it makes the Driver class not fully generic.

The problem is resolved by defining the callback storage type as a template parameter to the Driver:

template <typename TDevice,
          typename TEventLoop,
          typename TCallbackType>
class MyDriver
{
private:
    ...
    TCallbackType callback_;
};

For projects that allow dynamic memory allocation std::function<…> can be passed, for others embxx::util::StaticFunction<…> or similar must be used.

Peripherals

It this chapter I will describe and give multiple examples of how to drive and control multiple hardware peripherals while using Device-Driver-Component model in conjunction with Event Loop.

All the generic, platform independent code provided here is implemented as part of embxx library while platform (Raspberry Pi) specific code is taken from embxx_on_rpi project.

All the platform specific peripheral control classes reside in src/device directory.

The src/app directory contains several simple applications, such as flashing the led or responding to button presses.

There are also common Component classes shared between the applications. They reside in src/component directory.

In order to compile all the applications please follow the instructions described in Contents of This Book.

Function Configuration

In ARM platform every pin needs to be configured as either gpio input, gpio output or having one of several alternative functions the microcontroller supports. The device::Function class defined in src/device/Function.h and src/device/Function.cpp implements simple interface which allows every Device class configure the pins it uses.

class Function
{
public:
    enum class FuncSel {
        Input,  // b000
        Output, // b001
        Alt5,   // b010
        Alt4,   // b011
        Alt0,   // b100
        Alt1,   // b101
        Alt2,   // b110
        Alt3    // b111
    };

    typedef unsigned PinIdxType;

    static const std::size_t NumOfLines = 54;

    void configure(PinIdxType idx, FuncSel sel);
};

Every implemented Device class will receive reference to Function object in its constructor and will have to use it to configure the pins as required.

Interrupts Management

There is one more componenet that every Device will use. It’s device::InterruptMgr defined in src/device/InterruptMgr.h. The main responsibility of the object of this class is to control global level interrupts, register interrupt handlers from various Devices and invoke the appropriate handler when interrupt occurs.

The interface of the device::InterruptMgr is defined as following:

template <typename THandler = embxx::util::StaticFunction<void ()> >
class InterruptMgr
{
public:
    typedef THandler HandlerFunc;
    enum IrqId {
        IrqId_Timer,
        IrqId_AuxInt,
        IrqId_Gpio1,
        IrqId_Gpio2,
        IrqId_Gpio3,
        IrqId_Gpio4,
        IrqId_I2C,
        IrqId_SPI,
        IrqId_NumOfIds // Must be last
    };

    InterruptMgr();

    template <typename TFunc>
    void registerHandler(IrqId id, TFunc&& handler);

    void enableInterrupt(IrqId id);

    void disableInterrupt(IrqId id);

    void handleInterrupt();

private:
    typedef std::uint32_t EntryType;

    struct IrqInfo {
        ... // Contains interrupt related information
            // per single IrqId
    };

    typedef std::array<IrqInfo, IrqId_NumOfIds> IrqsArray;

    IrqsArray irqs_;
};

Every Driver will use registerHandler() member function to register its member function as the handler for its IrqId. The enableInterrupt() and disableInterrupt() are also used by the Device objects to control their interrupts on global level.

In order to use the Interrupt Manager described above every application has to implement proper interrupt handler that will retrieve the reference to device::InterruptMgr object (via global/static variables) and invoke its handleInterrupt() function, which in turn check the appropriate status register(s) and invoke registered handler(s). Please note, that the handler will be executed in interrupt context.

The code will look something like this:

extern "C"
void interruptHandler()
{
    System::instance().interruptMgr().handleInterrupt();
}

There may also be a need to enable/disable all the interrupts by toggling i flag in CPS register. The same src/device/InterruptMgr.h file provides two function for this purpose:

namespace device
{

namespace interrupt
{

inline
void enable()
{
    __asm volatile("cpsie i");
}

inline
void disable()
{
    __asm volatile("cpsid i");
}

}  // namespace interrupt

}  // namespace device

Timer

It is customary in bare metal development to flash leds in the first application (instead of writing "Hello world"). However most tutorials show how to do it synchronously using loops to wait some time before changing state of the led. I’m going to describe how to do it asynchronously using timer interrupt in conjunction with Event Loop.

Almost every embedded platform has usually one or two timer peripherals. One such peripheral can be programmed to provide an interrupt after some period of time. However, there may be a need to have multiple timers that can be activated independently at the same time. It is quite clear that there should be an entity that receives all the wait requests from various Components in non-interrupt context, then queues the wait requests internally, programs the timer peripheral to provide an interrupt after some time, and finally reports the completion to appropriate Component via callback also in non-interrupt (event loop) context.

Such entity can be a generic (platform independent) Driver, if it is provided with platform specific Device object, that exposes some predefined public interface and controls the actual platform specific hardware.

The asynchronous timer event handling follows the same pattern described in Device-Driver-Component chapter.

Assigning Wait Complete Callback

Just like described in Device-Driver-Component chapter the Driver needs to provide the "Wait Complete" callback object to be called when timer interrupt occurs. The assignment is usually performed during initialisation/construction stage of the Driver:

Starting Asynchronous Wait

The Driver must be able to support multiple wait requests from various Components and manage the internal queue accordingly. In the chart above the timer peripheral activated on the first asyncWait() request. When the second request is issued (assuming timeout1 < timeout2 and existing wait mustn’t be stopped), the Driver must prevent the completion of the currently scheduled timer countdown being reported in interrupt context while interfering with an update to internal data structures. The interrupts are disabled by calling suspendWait() member function of the Device. The call to the suspendWait() returns true, which means the interrupts are successfully disabled and it is safe to update internal data structures. If the call to suspendWait() returns false, it means that the interrupt has already occurred and there is no existing wait in progress, i.e. the second asyncWait() actually becomes a first one in the new sequence.

There also may be a case when timeout2 < timeout1 which means the order of the timeout requests must be re-evaluated, and new wait re-programmed.

The Driver must be able to cancel the existing timer countdown, evaluate how much time has passed since the first request, evaluate the new values to reprogram the timer Device countdown again.

Completing Asynchronous Wait

Due to the fact that Driver may receive multiple independent wait requests, it must reprogram the next wait (if such exists) while running in interrupt mode. Please pay attention to InterruptCtx() tag parameter passed to the startWait() member function of the Device. It indicates that the request is executed in interrupt context, while the same request used EventLoopCtx() as the tag parameter to specify that the call was performed in event loop (non-interrupt) context.

Canceling Asynchronous Wait

If there is a request to cancel the currently executed wait, the Driver must receive the information about the elapsed time and reprogram the next wait if such exists.

If the cancellation request to some other wait, that hasn’t been forwarded to the Device, the Driver just needs to update its internal data structures without canceling currently performed timer countdown.

The unsuccessful attempts to cancel wait is performed in exactly the same way as described in Device-Driver-Component chapter.

Identifying Wait Requests

There is obviously a need to have some kind of identification of the wait requests in order to be able to cancel some specific request while keeping the rest in waiting queue. One approach would be to have some kind of a handle which can be used during the cancellation request:

class MyTimerDriver
{
public:
    typedef ... Handle;

    Handle asyncWait(...);

    void cancelWait(Handle handle);
};

Another one is to hide the handle in some wrapper class, which makes it a bit safer to use:

class MyTimerDriver
{
public:

    typedef ... Handle;

    class Timer
    {
    public:
        Timer(MyTimerDriver& mgr, Handle handle)
          : mgr_(mgr),
            handle_(handle)
        {
        }

        ~Timer()
        {
            ... // Invalidate the allocated handle
        }

        void asyncWait(...)
        {
            mgr_.asyncWait(handle_, ...)
        }

        void cancelWait()
        {
            mgr_.cancelWait(handle_);
        }

    private:
	MyTimerDriver& mgr_;
        Handle handle_;
    };

    Timer allocTimer()
    {
        auto someHandle = ...;
        return Timer(*this, someHandle)
    }

private:

    friend class TimerMgr::Timer;

    void asyncWait(Handle handle, ...);

    void cancelWait(Handle handle);
};

The Driver itself has only one public function allocTimer(). It is used to allocate the Timer object. All the wait and/or cancel requests are issued to this timer object directly, which is declared to be a friend of the Driver class, i.e. it is able to call private functions of the latter using the handle it has. The destructor of the Timer makes sure that the handle is properly invalidated.

MyTimerDriver driver(...);
auto timer = driver.allocTimer();
timer.asyncWait(...);
...
timer.cancelWait();
...

The second approach is a bit safer than the first one and it is used in the implementation of such generic "Timer Management Driver" in embxx library.

Specifying the Wait Duration

The timer Device is platform specific. Some platforms may support wait duration granularity of a microsecond, others can achieve only a millisecond. It usually depends on the system clock speed. However, when using generic Driver and/or Component there is a need to be able to write platform independent code that performs wait of the specified duration regardless of the Device in use. The Standard Template Library (STL) of C++11 standard provides convenient Date and Time Utilities that make such usage possible.

In case the Device declares a minimal wait duration unit using std::chrono::duration type, the Driver may use std::chrono::duration_cast to convert the requested wait duration to supported duration units.

class MyTimerDevice
{
public:
    typedef std::chrono::duration<unsigned, std::milli>
                                                WaitTimeUnitDuration;

    typedef embxx::device::context::EventLoop EventLoopCtx;

    void startWait(WaitTimeUnitDuration::rep count, EventLoopCtx) {...}
    ...
};

In the example above the minimal supported duration unit (WaitTimeUnitDuration) is declared to be 1 millisecond. Please note that startWait() member function expects to receive number of wait units, i.e. milliseconds as its first parameter.

Then the definition of the asyncWait() member function of the Driver may be defined like this:

template <typename TDevice, ...>
class MyTimerDriver
{
public:
    typedef typename TDevice::WaitTimeUnitDuration WaitTimeUnitDuration
    class Timer
    {
    public:
        template <typename TRep, typename TPeriod, typename TFunc>
        void asyncWait(
            const std::chrono::duration<TRep, TPeriod>& waitTime,
            TFunc&& func)
        {
            auto castedWaitDuration =
                std::chrono::duration_cast<WaitTimeUnitDuration>(waitTime);
            auto waitUnits = castedWaitDuration.count();
            ... // Call the asyncWait() of the driver with waitUnits as
                // first parameter.
        }

    };
};

In the example above the call below will perform correct adjustment of the duration and will measure the same timeout with any Device whether the latter expects milliseconds or microseconds in its startWait() member function.

timer.asyncWait(std::chrono::seconds(5), ...);

In case the developer tries to execute a wait of several microseconds when Driver supports only milliseconds granularity, the compilation will fail.

timer.asyncWait(std::chrono::microseconds(5), ...);

Driver Implementation

The timer management Driver is a generic layer. It must work on any platform with any timer Device object that exposes the right interface.

Such Driver is already implemented in embxx library as embxx::driver::TimerMgr and resides in embxx/driver/TimerMgr.h while platform specific (Raspberry Pi) peripheral control object is implemented in embxx_on_rpi project as device::Timer and resides in src/device/Timer.h.

The embxx::driver::TimerMgr is defined like this:

template <typename TDevice,
          typename TEventLoop,
          std::size_t TMaxTimers,
          typename TTimeoutHandler = embxx::util::StaticFunction<void (const embxx::error::ErrorStatus&)> >
class TimerMgr
{
public:
    TimerMgr(TDevice& device, TEventLoop& el);
      : device_(device),
        el_(el)
    {
        ...
    }

    ...

private:
    struct TimerInfo {
        TTimeoutHandler handler_; //
        ...;                      // Some other internal data
    }

    // Internal data structures to track all the scheduled
    // wait requests.
    std::array<TimerInfo, TMaxTimers> infos_;

    TDevice& device_;
    TEventLoop& el_;
    ...
};

The TDevice template parameter is Platform specific control class for timer peripheral.

The TEventLoop template parameter is the class of the Event Loop.

The TMaxTimers template parameters specifies the maximal number of timer objects the TimerMgr will be able to allocate. This parameter is required because embxx::driver::TimerMgr was designed to be used in the systems without dynamic memory allocation. If dynamic memory allocation is allowed, then it is quite easy to implement similar functionality without this limitation.

The TTimeoutHandler template parameter specifies type of the timeout callback object. This object must have void (const embxx::error::ErrorStatus&) signature and expose similar interface to std::function or embxx::util::StaticFunction.

The embxx::driver::TimerMgr exposes the following public interface:

template <...>
class TimerMgr
{
public:
    class Timer
    {
    public:
        // Destructor, removes Timer record from internal
        // data structures of TimerMgr
        ~Timer() {...}

        // Activates asyncrhonous wait
        void asyncWait(...) {...}

        // Cancels scheduled asynchronous wait
        void cancel() {...}
    };

    // Allocate timer object
    Timer allocTimer() {...}

private:
    // Allows usage of non-exposed private functions of
    // TimerMgr
    friend class TimerMgr::Timer;
    ...
};

The reader may notice that embxx::driver::TimerMgr exposes only one public function: Timer allocTimer();. This function returns simple TimerMgr::Timer object which can be used to schedule new wait as well as cancel the previous wait request. Also note that TimerMgr::Timer class is declared to be a friend of TimerMgr. This is required to allow seamless delegation of the wait/cancel request from TimerMgr::Timer to TimerMgr which is responsible for managing multiple simultaneous wait requests and delegating them one by one to the the actual hardware control object.

Then the led flashing application (implemented in src/app/app_led_flash) can be as simple as the code below:

namespace
{

const auto LedChangeStateTimeout = std::chrono::milliseconds(500);

template <typename TTimer>
void ledOff(
    TTimer& timer,
    System::Led& led);

template <typename TTimer>
void ledOn(
    TTimer& timer,
    System::Led& led)
{
    led.on();

    timer.asyncWait(
        LedChangeStateTimeout,
        [&timer, &led](const embxx::error::ErrorStatus& status)
        {
            static_cast<void>(status);
            ledOff(timer, led);
        });
}

template <typename TTimer>
void ledOff(
    TTimer& timer,
    System::Led& led)
{
    led.off();

    timer.asyncWait(
        std::chrono::milliseconds(LedChangeStateTimeout),
        [&timer, &led](const embxx::error::ErrorStatus& status)
        {
            static_cast<void>(status);
            ledOn(timer, led);
        });
}

}  // namespace

int main() {
    // Get reference to TimerMgr object
    auto& system = System::instance();
    auto& timerMgr = system.timerMgr();

    // Allocate timer
    auto timer = timerMgr.allocTimer();

    // Start flashing with initial state to be OFF
    device::interrupt::enable();
    ledOff(timer, led);

    // Run the event loop
    auto& el = system.eventLoop();
    el.run();

    GASSERT(0); // Mustn't exit
    return 0;
}

Platform Specific Timer Device

As it was already mentioned earlier, the embxx::driver::TimerMgr is a generic Driver class that does most of the work of managing and scheduling independent wait requests. It requires support from low level timer Device object to program the actual hardware of the platform the code runs on. The embxx::driver::TimerMgr is defined to receive the Device class as template parameter as well as reference to the Device timer object in the constructor. The Driver doesn’t know the exact Device type, but expects it to expose certain public interface:

template <typename TDevice, typename TEventLoop, ...>
class TimerMgr
{
public:
    TimerMgr(TDevice& device, TEventLoop& el);
    ...
};

The timer control Device class must expose the following public interface:

Define WaitTimeUnitDuration type as variation of std::chrono::duration that specifies duration of single wait unit supported by the Device.
```
typedef std::chrono::duration<...> WaitTimeUnitDuration;
```

Function to set the callback object to be invoked from timer interrupt:

template <typename TFunc>
void setWaitCompleteCallback(TFunc&& func);

Functions to start timer countdown in both event loop (non-interrupt) and interrupt contexts:

void startWait(
    WaitTimeUnitDuration::rep waitTime, // num of wait units
    embxx::device::context::EventLoop context);
void startWait(
    WaitTimeUnitDuration::rep waitTime, // num of wait units
    embxx::device::context::Interrupt context);

Function to cancel timer countdown in event loop (non-interrupt) context. The function must return true in case the wait was actually canceled and false when there is no wait in progress.
```
bool cancelWait(embxx::device::context::EventLoop context);
```
Function to suspend countdown (disable interrupts while the actual wait countdown is not stopped) in event loop (non-interrupt) context. The function must return true in case the wait was actually suspended and false when there is no wait in progress. The call to this function will be followed either by resumeWait() or by cancelWait().
```
bool suspendWait(embxx::device::context::EventLoop context);
```
Function to resume countdown in event loop (non-interrupt) context.
```
void resumeWait(embxx::device::context::EventLoop context);
```
Function to retrieve elapsed time of the last executed wait. It will be called right after the cancelWait().
```
WaitTimeUnitDuration::rep getElapsed(embxx::device::context::EventLoop context) const;
```

The definition and implementation of such timer device for Raspberry Pi platform can be found in src/device/Timer.h file of embxx_on_rpi project.

UART

Our next stage will be to support debug logging via UART interface. In conventional C++ logging is performed using either printf function or output streams (such as std::cout or std::cerr).

If printf is used the compilation may fail at the linking stage with following errors:

/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-sbrkr.o): In function `_sbrk_r':
sbrkr.c:(.text._sbrk_r+0x18): undefined reference to `_sbrk'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-writer.o): In function `_write_r':
writer.c:(.text._write_r+0x20): undefined reference to `_write'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-closer.o): In function `_close_r':
closer.c:(.text._close_r+0x18): undefined reference to `_close'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-fstatr.o): In function `_fstat_r':
fstatr.c:(.text._fstat_r+0x1c): undefined reference to `_fstat'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-isattyr.o): In function `_isatty_r':
isattyr.c:(.text._isatty_r+0x18): undefined reference to `_isatty'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-lseekr.o): In function `_lseek_r':
lseekr.c:(.text._lseek_r+0x20): undefined reference to `_lseek'
/usr/bin/../lib/gcc/arm-none-eabi/4.8.3/../../../../arm-none-eabi/lib/libc.a(lib_a-readr.o): In function `_read_r':
readr.c:(.text._read_r+0x20): undefined reference to `_read'
collect2: error: ld returned 1 exit status

Once these functions are stubbed with empty bodies, the compilation will succeed, but the image size will be quite big (around 45KB).

The _sbrk function is required to support dynamic memory allocation. The printf function probably uses malloc() to allocate some temporary buffers. If we open the assembly listing file we will see calls to <malloc> and <free>.

The _write function is used to write characters into the standard output consol, which doesn’t exist in embedded product. The developer must use this function implementation to write all the provided characters to UART serial interface. Many developers implement this function in a straightforward synchronous way with busy loop:

extern "C" int _write(int file, char *ptr, int len)
{
    int count = len;
    if (file == 1) { // stdout
        while (count > 0) {
            while (... /* poll the status bit */) {} // just wait
            TX_REG = *ptr;
            ++ptr;
            --count;
        }
    }
    return len;
}

In this case the call to printf function will be blocking and won’t return until all the characters are written one by one to UART, which takes a lot of execution time. This approach is suitable for quick and dirty debugging, but will quickly become impractical when the project grows.

In order to make the execution of printf quick, there must be some kind of interrupt driven component that is responsible to buffer all the provided characters and forward it to UART asynchronously one by one using "TX buffer register is free" kind of interrupts.

One of disadvantages in using printf for logging is a necessity to specify an output format of the printed variables:

std::int32_t i = ...; // some value
printf("Value = %d\n");

In case the type of the printed variable changes, the developer must remember to update type in the format string too. This is the reason why many C++ developers prefer using streams instead of printf:

std::int32_t i = ...; // some value
std::cout << "Value = " << i << std::endl;

Even if type of printed variable changes the compiler will generate a call to appropriate overloaded operator<< of std::ostream and the value will be printed correctly. The developer will also have to implement the missing _write function to write provided characters somewhere (UART interface in our case).

However using C++ streams in bare metal development is often not an option. They use exceptions to handle error cases as well as locales for formatting. The compilation of simple output statement with streams above created image of more than 500KB using GNU Tools for ARM Embedded Processors compiler.

To summarise all the stated above, there may be a problem to use standard printf function or output streams for debug logging, especially in systems with small memory and where dynamic memory allocations and exceptions mustn’t be used. Our ultimate goal will be creation of standard output stream like interface for debug logging while using asynchronous event handling with Device-Driver-Component model and Event Loop where most of the code is generic and only smal part of managing write of a single character to the UART interface is platform specific.

Asyncrhonous read and write operations on the UART interface are very similar to the generic way of programming and handling asynchronous events described earlier in Device-Driver-Component chapter.

Writing to UART

Stage1 - Sending asynchronous buffer write request from the Component layer to Driver in event loop (non-interrupt) context.

The Component calls asyncWrite() member function of the Driver and provides pointer to the buffer, size of the buffer and the callback object to invoke when the write is complete. The asyncWrite() function needs to be able to receive any type of callable object, such as std::bind expression or lambda function. To achieve this the function must be templated:

class CharacterDriver
{
public:
    typedef ... CharType;

    template <typename TCallbackFunc>
    void asyncWrite(
        const CharType* buf,
        std::size_t bufSize,
        TCallbackFunc&& func);
};

According to the convention mentioned earlier, the callback must receive an error status of whether the operation is successful as its first parameter. When performing asynchronous operation on the buffer, it can be required to know how many characters have been read / written before the error occurred, in case the operation wasn’t successful. For this purpose such callback object must receive number of bytes written as the second parameter, i.e. expose the void (const embxx::error::ErrorStatus& err, std::size_t bytesTransferred) signature.

When the Driver receives the asynchronous operation request, it forwards it to the Device, letting the latter know how many bytes will be written during the whole process. Please note that Driver uses embxx::device::context::EventLoop tag parameter to specify that startWrite() member function of Device is invoked in event loop (non-interrut) context. The job of the Device object is to enable appropriate interrupts and return immediately. Once the interrupt occurs, the stage of writing the data begins.

Stage2 - Writing provided data.

Once the interrupt of "TX available" occurs, the Device must let the Driver know. There must obviously be some kind of callback involved, which Driver must provide during its construction / initialisation stage. Let’s assume at this moment that such assignment was successfully done, and Device is capable of successfully notifying the Driver, that there is an ability to write character to TX FIFO of the peripheral.

When the Driver receives such notification, it attempts to write as many characters as possible:

typedef embxx::device::context::Interrupt InterruptContext;

void canWriteCallback()
{
    // Executed in interrupt context, must be quick
    while(device_.canWrite(InterruptContext())) {
        if ((writeBufStart_ + writeBufSize_) <= currentWriteBufPtr_) {
            break;
        }

        device_.write(*currentWriteBufPtr_, InterruptContext());
        ++currentWriteBufPtr_;
    }
}

This is because when "TX available" interrupt occurs, there may be a place for multiple characters to be sent, not just one. Doing checks and writes in a loop may save many CPU cycles.

Please note, that all these calls are performed in interrupt context. They are marked in red in the picture above.

Once the Tx FIFO of the underlying Device is full or there are no more characters to write, the callback returns. The whole cycle described above is repeated on every "TX available" interrupt until the whole provided buffer is sent to the Device for writing.

Stage3 - Notifying caller about completion:

Once the whole buffer is sent to the Device for writing, the Driver is aware that there will be no more writes performed. However it doesn’t report completion until the Device itself calls appropriate callback indicating that the operation has been indeed completed. Shifting the responsibility of identifying when the operation is complete to Device will be needed later when we will want to reuse the same Driver for I2C and SPI peripherals. It will be important to know when internal Tx FIFO of the peripheral becomes empty after all the characters from previous operation have been written.

Image: Notifying caller about completion

Once the Driver receives notification from the Device (still in interrupt context), that the write operation is complete, it bundles the callback object, provided with initial asyncWrite() request, together with error status and number of actual bytes transferred using std::bind expression and sends the callable object to Event Loop for execution in event loop (non-interrupt) context.

Reading from UART

The reading from UART is done in a very similar manner.

Stage1 - Sending asynchronous buffer read request from the Component layer to Driver in event loop (non-interrupt) context.

The asyncRead() member function of the Driver should allow callback to be callable object of any type (but one that exposes predefined signature of course).

class CharacterDriver
{
public:
    typedef ... CharType;

    template <typename TCallbackFunc>
    void asyncRead(
        CharType* buf,
        std::size_t bufSize,
        TCallbackFunc&& func);
};

Stage2 - Reading data into the buffer.

The callback’s implementation will be something like:

    void canReadCallback()
    {
        while(device_.canRead(InterruptContext())) {
            if ((readBufStart_ + readBufSize_) <= currentReadBufPtr_) {
                break;
            }

            auto ch = device_.read(InterruptContext());
            *currentReadBufPtr_ = ch;
            ++currentReadBufPtr_;
        }
    }

Stage3 - Notifying caller about completion:

Cancelling Asynchronous Operations

The cancellation flow is very similar to the one described in Device-Driver-Component chapter:

If the cancellation is successful, the callback must be invoked with error code indicating that the operation was aborted (embxx::error::ErrorCode::Aborted).

Another possible case of unsuccessful cancellation is when completion interrupt occurs in the middle of cancellation request:

Reading "Until"

There may be a case, when partial read needs to be performed, for example until specific character is encountered. In this case the Driver is responsible to monitor incoming characters and cancel the read into the buffer operation before its completion:

Note, that previously Driver called cancelRead() member function of the Device in event loop (non-interrupt) context, while in "read until" situation the cancellation happens in interrupt mode. That requires Device to implement these functions for both modes:

class MyDevice
{
public:
    bool cancelRead(embxx::device::context::EventLoop) {...}
    bool cancelRead(embxx::device::context::Interrupt) {...}
};

The asyncReadUntil() member function of the Driver should be able to receive any stateless predicate object that defines bool operator()(CharType ch) const. The predicate invocation should return true when expected character is received and reading operation must be stopped.

class MyDriver
{
public:
    template <typename TPred, typename TFunc>
    void asyncReadUntil(
        CharType* buf,
        std::size_t size,
        TPred&& pred,
        TFunc&& func)
    {
        ...
    }
};

It allows using complex conditions in evaluating the character. For example, stopping when either '\r' or '\n' is encountered:

typedef embxx::error::ErrorStatus EmbxxErrorStatus;

driver_.asyncReadUntil(
    buf,
    bufSize,
    [](CharType ch) -> bool
        {
            return (ch == '\r') || (ch == '\n');
        },
    [](const EmbxxErrorStatus& es, std::size_t bytesTransferred)
        {
            ...
        });

Device Implementation

In this section I will try to describe in more details what Device class needs to provide for the Driver to work correctly. First of all it needs to define the type of characters used:

class MyDevice
{
public:
    typedef std::uint8_t CharType;
};

The Driver layer will reuse the definition of the character in its internal functions:

template<typename TDevice, ...>
class MyDriver
{
public:
    typedef typename TDevice::CharType CharType;

    void asyncRead(CharType* buf, std::size_t bufSize, ...) {}
};

There is a need for Device to be able to record callback objects from the Driver in order to notify the latter about an ability to read/write next character and about operation completion.

class MyDevice
{
public:
    template <typename TFunc>
    void setCanReadHandler(TFunc&& func)
    {
        canReadHandler_ = std::forward<TFunc>(func);
    }

    template <typename TFunc>
    void setCanWriteHandler(TFunc&& func)
    {
        canWriteHandler_ = std::forward<TFunc>(func);
    }

    template <typename TFunc>
    void setReadCompleteHandler(TFunc&& func)
    {
        readCompleteHandler_ = std::forward<TFunc>(func);
    }

    template <typename TFunc>
    void setWriteCompleteHandler(TFunc&& func)
    {
        writeCompleteHandler_ = std::forward<TFunc>(func);
    }

private:
    typedef ... OpAvailableHandler;
    typedef ... OpCompleteHandler;

    OpAvailableHandler canReadHandler_;
    OpCompleteHandler readCompleteHandler_;

    OpAvailableHandler canWriteHandler_;
    OpCompleteHandler writeCompleteHandler_;

};

The OpAvailableHandler and OpCompleteHandler type may be either hard coded to be std::function<void ()> and std::function<void (const embxx::error::ErrorStatus&)> respectively or passed as template parameters:

template <typename TCanReadHandler,
          typename TCanWriteHandler,
          typename TReadCompleteHandler,
          typename TWriteCompleteHandler>
class MyDevice
{
public:
    ... // setters are as above

private:

    TCanReadHandler canReadHandler_;
    TReadCompleteHandler readCompleteHandler_;

    TCanWriteHandler canWriteHandler_;
    TWriteCompleteHandler writeCompleteHandler_;
};

Choosing the "template parameters option" is useful when the same Device class is reused between multiple applications for the same product line.

The next stage would be implementing all the required functions:

class MyDevice
{
public:

    typedef embxx::device::context::EventLoop EventLoopContext;
    typedef embxx::device::context::Interrupt InterruptContext;

    // Start read operation - enables interrupts
    void startRead(std::size_t length, EventLoopContext context);

    // Cancel read in event loop context
    bool cancelRead(EventLoopContext context);

    // Cancel read in interrupt context - used only if
    // asyncReadUntil() function was used in Device
    bool cancelRead(InterruptContext context);

    // Start write operation - enables interrupts
    void startWrite(std::size_t length, EventLoopContext context);

    // Cancell write operation
    bool cancelWrite(EventLoopContext context);

    // Check whether there is a character available to be read.
    bool canRead(InterruptContext context);

    // Check whether there is space for one character to be written.
    bool canWrite(InterruptContext context);

    // Read the available character from Rx FIFO of the peripheral
    CharType read(InterruptContext context);

    // Write one more character to Tx FIFO of the peripheral
    void write(CharType value, InterruptContext context);
};

Note, that there may be extra configuration functions specific for the peripheral being controlled. For example baud rate, parity, flow control for UART. Such configuration is almost always platform and/or product specific and usually performed at application startup. It is irrelevant to the Device-Driver-Component model introduced in this book.

class MyDevice
{
public:
    void configBaud(unsigned value) { ... }
    ...
};

The embxx_on_rpi project has multiple applications that use UART1 interface for logging. The peripheral control code is the same for all of them and is implemented in src/device/Uart1.h.

Driver Implementation

Driver must be a generic piece of code, that can be reused with any Device control object (as long as it exposed right public interface) and in any application, including ones without dynamic memory allocation.

First of all, we will need references to Device as well as Event Loop objects:

template <typename TDevice, typename TEventLoop>
class MyDriver
{
public:
    // Reuse definition of character type from the Device
    typedef TDevice::CharType CharType;

    // During the construction store references to Device
    // and Event Loop objects.
    MyDriver(TDevice& device, TEventLoop& el)
      : device_(device),
        el_(el)
    {
        // Register appropriate callbacks with device
        device_.setCanReadHandler(
            std::bind(
                &MyDriver::canReadInterruptHandler, this));
        device_.setReadCompleteHandler(
            std::bind(
                &MyDriver::readCompleteInterruptHandler,
                this,
                std::placeholders::_1));

        device_.setCanWriteHandler(
            std::bind(
                &MyDriver::canWriteInterruptHandler, this));
        device_.setWriteCompleteHandler(
            std::bind(
                &MyDriver::writeCompleteInterruptHandler,
                this,
                std::placeholders::_1));

    }

    ...

private:

    void canReadInterruptHandler() {...}
    void readCompleteInterruptHandler(
        const embxx::error::ErrorStatus& es) {...}

    void canWriteInterruptHandler() {...}
    void writeCompleteInterruptHandler(
        const embxx::error::ErrorStatus& es) {...}

    TDevice& device_;
    TEventLoop& el_;
};

We will also need to store callbacks provided with any asynchronous operation. Note that the "read" and "write" are independent operations and it should be possible to perform asyncRead() and asyncWrite() calls at the same time.

The only way to make Driver generic is to move responsibility of specifying callback storage type up one level, i.e. we must put them as template parameters:

template <typename TDevice,
          typename TEventLoop,
          typename TReadCompleteCallback,
          typename TWriteCompleteCallback>
class MyDriver
{
public:
    ...

    typedef embxx::device::context::EventLoop EventLoopContext;

    template <typename TFunc>
    void asyncRead(
        CharType* buf,
        std::size_t bufSize,
        TFunc&& func)
    {
        readBufStart_ = buf;
        currentReadBufPtr = buf;
        readBufSize_ = bufSize;
        readCompleteCallback_ = std::forward<TFunc>(func);
        driver_.startRead(bufSize, EventLoopContext());
    }

    template <typename TFunc>
    void asyncWrite(
        const CharType* buf,
        std::size_t bufSize,
        TFunc&& func)
    {
        writeBufStart_ = buf;
        currentWriteBufPtr = buf;
        writeBufSize_ = bufSize;
        writeCompleteCallback_ = std::forward<TFunc>(func);
        driver_.startWrite(bufSize, EventLoopContext());
    }

private:
    ...

    // Read info
    CharType* readBufStart_;
    CharType* currentReadBufPtr_;
    std::size_t readBufSize_;
    TReadCompleteCallback readCompleteCallback_;

    // Write info
    const CharType* writeBufStart_;
    const CharType* currentWriteBufPtr_;
    std::size_t writeBufSize_;
    TWriteCompleteCallback writeCompleteCallback_;
};

As it was mentioned earlier in Reading "Until" section, there is quite often a need to stop reading characters into the provided buffer when some condition evaluates to true. It means there is also a need to provide storage for the character evaluation predicate:

template <typename TDevice,
          typename TEventLoop,
          typename TReadCompleteCallback,
          typename TWriteCompleteCallback,
          typename TReadUntilPred>
class MyDriver
{
public:
    ...

    typedef embxx::device::context::EventLoop EventLoopContext;

    template <typename TPred, typename TFunc>
    void asyncReadUntil(
        CharType* buf,
        std::size_t bufSize,
        TPred&& pred,
        TFunc&& func)
    {
        readBufStart_ = buf;
        currentReadBufPtr = buf;
        readBufSize_ = bufSize;
        readCompleteCallback_ = std::forward<TFunc>(func);
        readUntilPred_ = std::forward<TPred>(pred)
        driver_.startRead(bufSize, EventLoopContext());
    }

private:
    ...

    // Read info
    CharType* readBufStart_;
    CharType* currentReadBufPtr_;
    std::size_t readBufSize_;
    TReadCompleteCallback readCompleteCallback_;
    TReadUntilPred readUntilPred_;

    ...
};

The example code above may work, but it contradicts to one of the basic principles of C++: "You should pay only for what you use". In case of using UART for logging, there is no input from the peripheral and it is a waist to keep data members for "read" required to manage "read" operations. Let’s try to improve the situation a little bit by using template specialisation as well as reduce number of template parameters by using "Traits" aggregation struct.

struct MyOutputTraits
{
    // The "read" handler storage type.
    typedef std::nullptr_t ReadHandler;

    // The "write" handler storage type.
    // The valid handler must have the following signature:
    //  "void handler(const embxx::error::ErrorStatus&, std::size_t);"
    typedef embxx::util::StaticFunction<
        void(const embxx::error::ErrorStatus&, std::size_t)> WriteHandler;

    // The "read until" predicate storage type
    typedef std::nullptr_t ReadUntilPred;

    // Read queue size
    static const std::size_t ReadQueueSize = 0;

    // Write queue size
    static const std::size_t WriteQueueSize = 1;
};

Please note, that allowed number of pending "read" requests is specified as 0 in the traits struct above, i.e. the read operations are not allowed. The "read complete" and "read until predicate" types are irrelevant and specified as std::nullptr_t. The instantiation of the Driver object must take it into account and not include any "read" related functionality. In order to achieve this the Driver class needs to have two independent sub-functionalities of "read" and "write". It may be achieved by inheriting from two base classes.

template <typename TDevice,
          typename TEventLoop,
          typename TTraits = MyOutputTraits>
class MyDriver :
    public ReadSupportBase<
                TDevice,
                TEventLoop,
                typename TTraits::ReadHandler,
                typename TTraits::ReadUntilPred,
                TTraits::ReadQueueSize>,
    public WriteSupportBase<
                TDevice,
                TEventLoop,
                typename TTraits::WriteHandler,
                TTraits::WriteQueueSize>
{
    typedef ReadSupportBase<...> ReadBase;
    typedef WriteSupportBase<...> WriteBase;
public:
    template <typename TPred, typename TFunc>
    void asyncRead(
        CharType* buf,
        std::size_t bufSize,
        TFunc&& func)
    {
        ReadBase::asyncRead(buf, bufSize, std::forward<TFunc>(func);
    }

    template <typename TPred, typename TFunc>
    void asyncWrite(
        const CharType* buf,
        std::size_t bufSize,
        TFunc&& func)
    {
        WriteBase::asyncWrite(buf, bufSize, std::forward<TFunc>(func);
    }
};

Now, the template specialisation based on queue size should do the job:

template <typename TDevice,
          typename TEventLoop,
          typename TReadHandler,
          typename TReadUntilPred,
          std::size_t ReadQueueSize>;
class ReadSupportBase;


template <typename TDevice,
          typename TEventLoop,
          typename TReadHandler,
          typename TReadUntilPred>;
class ReadSupportBase<TDevice, TEventLoop, TReadHandler, TReadUntilPred, 1>
{
public:
    ReadSupportBase(TDevice& device, TEventLoop& el) {...}
    ... // Implements the "read" related API
private:
    ... // Read related data members
};

template <typename TDevice,
          typename TEventLoop,
          typename TReadHandler,
          typename TReadUntilPred>;
class ReadSupportBase<TDevice, TEventLoop, TReadHandler, TReadUntilPred, 0>
{
public:
    ReadSupportBase(TDevice& device, TEventLoop& el) {}
    // No need for any "read" related API and data members
};

template <typename TDevice,
          typename TEventLoop,
          typename TWriteHandler,
          std::size_t WriteQueueSize>;
class WriteSupportBase;


template <typename TDevice,
          typename TEventLoop,
          typename TReadHandler>;
class WriteSupportBase<TDevice, TEventLoop, TWriteHandler, 1>
{
public:
    WriteSupportBase(TDevice& device, TEventLoop& el) {...}
    ... // Implements the "write" related API
private:
    ... // Write related data members
};

template <typename TDevice,
          typename TEventLoop,
          typename TWriteHandler>;
class WriteSupportBase<TDevice, TEventLoop, TWriteHandler, 0>
{
public:
    WriteSupportBase(TDevice& device, TEventLoop& el) {}
    // No need for any "write" related API and data members
};

Note, that it is possible to implement general case when read/write queue size is greater than 1. It will require some kind of request queuing (using Static (Fixed Size) Queue for example) and will allow issuing multiple asynchronous read/write requests at the same time.

In order to support this extension, the Device class must implement some extra functionality too:

The new read/write request can be issued by the Driver in interrupt context, after previous operation reported completion.

class MyDevice
{
public:
    void startRead(std::size_t length, InterruptContext context);
    void startWrite(std::size_t length, InterruptContext context);
};

When new asynchronous read/write request is issued to the Driver it must be able to prevent interrupt context callbacks from being invoked to avoid races on the internal data structure:
```
class MyDevice
{
public:
    bool suspendRead(EventLoopContext context);
    void resumeRead(EventLoopContext context)
    bool suspendWrite(EventLoopContext context);
    void resumeWrite(EventLoopContext context);
};
```
Please pay attention to the boolean return value of suspend*() functions. They are like cancel*() ones, there is an indication whether the invocation of the callbacks is suspended or there is no operation currently in progress.

Such generic Driver is already implemented in embxx/driver/Character.h file of embxx library. The Driver is called "Character", because it reads/writes the provided buffer one character at a time.

Character Echo Application

Now, it is time to do something practical. The app_uart1_echo application in embxx_on_rpi project implements simple single character echo.

The System class in System.h file defines the Device and Driver layers:

class System
{
public:
    static const std::size_t EventLoopSpaceSize = 1024;
    typedef embxx::util::EventLoop<
        EventLoopSpaceSize,
        device::InterruptLock,
        device::WaitCond> EventLoop;

    typedef device::InterruptMgr<> InterruptMgr;

    typedef device::Uart1<InterruptMgr> Uart;

    typedef embxx::driver::Character<Uart, EventLoop> UartSocket;

    ...

private:

    ...
    EventLoop el_;
    Uart uart_;
    UartSocket uartSocket_;
};

Note that UartSocket uses default "TTraits" template parameter of embxx::driver::Character, which is defined to be:

struct DefaultCharacterTraits
{
    typedef embxx::util::StaticFunction<
        void(const embxx::error::ErrorStatus&, std::size_t)> ReadHandler;
    typedef embxx::util::StaticFunction<
        void(const embxx::error::ErrorStatus&, std::size_t)> WriteHandler;
    typedef std::nullptr_t ReadUntilPred;
    static const std::size_t ReadQueueSize = 1;
    static const std::size_t WriteQueueSize = 1;
};

It allows usage of both "read" and "write" operations at the same time. Having the definitions in place it is quite easy to implement the "echo" functionality:

// Forward declaration
void writeChar(System::UartSocket& uartSocket, System::Uart::CharType& ch);

void readChar(System::UartSocket& uartSocket, System::Uart::CharType& ch)
{
    uartSocket.asyncRead(&ch, 1,
        [&uartSocket, &ch](const embxx::error::ErrorStatus& es, std::size_t bytesRead)
        {
            GASSERT(!es);
            GASSERT(bytesRead == 1);
            static_cast<void>(es);
            static_cast<void>(bytesRead);
            writeChar(uartSocket, ch);
        });
}

void writeChar(System::UartSocket& uartSocket, System::Uart::CharType& ch)
{
    uartSocket.asyncWrite(&ch, 1,
        [&uartSocket, &ch](const embxx::error::ErrorStatus& es, std::size_t bytesWritten)
        {
            GASSERT(!es);
            GASSERT(bytesWritten == 1);
            static_cast<void>(es);
            static_cast<void>(bytesWritten);
            readChar(uartSocket, ch);
        });
}

int main() {
    auto& system = System::instance();
    auto& uart = system.uart();

    // Configure serial interface
    uart.configBaud(115200);
    uart.setReadEnabled(true);
    uart.setWriteEnabled(true);

    // Start with asynchronous read
    auto& uartSocket = system.uartSocket();
    System::Uart::CharType ch = 0;
    readChar(uartSocket, ch);

    // Run the event loop
    device::interrupt::enable();
    auto& el = system.eventLoop();
    el.run();

    GASSERT(0); // Mustn't exit
    return 0;

Stream-like Printing Interface

As was mentioned earlier, our ultimate goal would be having standard output stream like interface for debug output, which works asynchronously without any blocking busy waits. Such interface must be a generic Component, which works in non-interrupt context, while using recently covered generic "Character" Driver in conjunction with platform specific "Uart" Device.

Such Component should be implemented as two sub-Components. One is "Stream Buffer" which is responsible to maintain circular buffer of written characters and flush them to the peripheral using "Character" Driver when needed. The characters, that have been successfully written, are removed from the internal buffer. The second one is "Stream" itself, which is responsible to convert various values into characters and write them to the end of the "Stream Buffer".

Let’s start with "Output Stream Buffer" first. It needs to receive reference to the Driver it’s going to use:

template <typename TDriver>
class OutStreamBuf
{
public:
    OutStreamBuf(TDriver& driver)
      : driver_(driver)
    {
    }

private:
   TDriver& driver_;
   ...
};

There is also a need to have a buffer, where characters are stored before they are written to the device. Remember that we are trying to create a Component, which can be reused in multiple independent projects, including ones that do not support dynamic memory allocation. Hence, Static (Fixed Size) Queue may be a good choice for it. It means, there is a need to provide size of the buffer as one of the template arguments:

template <typename TDriver,
          std::size_t TBufSize>
class OutStreamBuf
{
public:
   typedef typename TDriver::CharType CharType;
   typedef embxx::container::StaticQueue<CharType, BufSize> Buffer;

private:
   ...
   Buffer buf_;
};

The "Output Stream Buffer" needs to support two main operations:

Pushing new single character at the end of the buffer.
Flushing all (or part of) written characters, i.e. activate asynchronous write with Driver.

When pushing a new character, there may be a case when the internal buffer is full. In this case, the pushed character needs to be discarded and there must be an indication whether "push" operation was successful. The function may return either bool to indicate success of the operation or std::size_t to inform the caller how may characters where written. If 0 is returned, the character wasn’t written.

template <...>
class OutStreamBuf
{
public:
   // Add new character at the end of the buffer
   std::size_t pushBack(CharType ch);

   // Get number of written, not-flushed characters
   std::size_t size();

   // Flush number of characters
   void flush(std::size_t count = size());
   ...
};

This limited number of operations is enough to implement "Output Stream" - like interface. However, "Output Stream Buffer" can be useful in writing any serialised data into the peripheral, not only the debug output. For example using standard algorithms:

OutStreamBuf<...> outStreamBuf(...);
std::array<std::uint8_t, 128> data = {{.../* some data*/}};

std::copy(data.begin(), data.end(), std::back_inserter(outStreamBuf));
outStreamBuf.flush();

In the example above, std::back_inserter requires a container to define push_back() member function:

template <...>
class OutStreamBuf
{
public:
   // Wrap pushBack()
   void push_back(CharType ch)
   {
       pushBack(ch);
   }
   ...
};

There also may be a need to iterate over written, but still not flushed, characters and update some of them before the call to flush(). In other words the "Output Stream Buffer" must be treated as random access container:

template <...>
class OutStreamBuf
{
public:
    typedef embxx::container::StaticQueue<CharType, BufSize> Buffer;
    typedef typename Buffer::Iterator Iterator;
    typedef typename Buffer::ConstIterator ConstIterator;
    typedef typename Buffer::ValueType ValueType;
    typedef typename Buffer::Reference Reference;
    typedef typename Buffer::ConstReference ConstReference;

    bool empty() const;
    void clear();
    void resize(std::size_t newSize);

    Iterator begin();
    Iterator end();

    ConstIterator begin() const;
    ConstIterator end() const;

    ConstIterator cbegin() const;
    ConstIterator cend() const;

    Reference operator[](std::size_t idx);
    ConstReference operator[](std::size_t idx) const;
   ...
};

As was mentioned earlier, the OutStreamBuf uses Static (Fixed Size) Queue as its internal buffer and any characters pushed beyond the capacity gets discarded. There must be a way to identify available capacity as well as request asynchronous notification via callback when requested capacity becomes available:

template <typename TDriver,
          std::size_t TBufSize,
          typename TWaitHandler =
              embxx::util::StaticFunction<void (const embxx::error::ErrorStatus&)> >
class OutStreamBuf
{
public:
    std::size_t availableCapacity() const;

    template <typename TFunc>
    void asyncWaitAvailableCapacity(
        std::size_t capacity,
        TFunc&& func)
    {
        if (capacity <= availableCapacity()) {
            ... // invoke callback via post() member function of Event Loop
        }
        waitAvailableCapacity_ = capacity;
        waitHandler_ = std::forward<TFunc>(func);

        // The Driver is writing some portion of flushed characters,
        // evaluate the capacity again when Driver reports completion.
    }

private:
    ...
    std::size_t waitAvailableCapacity_;
    WaitHandler waitHandler_;
};

Such "Output Stream Buffer" is already implemented in embxx/io/OutStreamBuf.h file of embxx library.

The next stage would be defining the "Output Stream" class, which will allow printing of null terminated strings as well as various integral values.

template <typename TStreamBuf>
class OutStream
{
public:
    typedef typename TStreamBuf::CharType CharType;

    explicit OutStream(TStreamBuf& buf)
    : buf_(buf)
    {
    }

    OutStream(OutStream&) = delete;
    ~OutStream() = default;

    void flush()
    {
        buf_.flush();
    }

    OutStream& operator<<(const CharType* str)
    {
        while (*str != '\0') {
            buf_.pushBack(*str);
            ++str;
        }
        return *this;
    }

    OutStream& operator<<(char ch)
    {
        buf_.pushBack(ch);
        return *this;
    }

    OutStream& operator<<(std::uint8_t value)
    {
        // Cast std::uint8_t to unsigned and print.
        return (*this << static_cast<unsigned>(value));
    }

    OutStream& operator<<(std::int16_t value)
    {
        ... // Cast std::int16_t to int type and print.
        return *this;
    }

    OutStream& operator<<(std::uint16_t value)
    {
        // Cast std::uint16_t to unsigned and print
        return (*this << static_cast<std::uint32_t>(value));
    }

    OutStream& operator<<(std::int32_t value)
    {
        ... // Print signed value
        return *this;
    }

    OutStream& operator<<(std::uint32_t value)
    {
        ... // Print unsigned value
        return *this;
    }

    OutStream& operator<<(std::int64_t value)
    {
        ... // Print 64 bit signed value
        return *this
    }

    OutStream& operator<<(std::uint64_t value)
    {
        ... // Print 64 bit signed value
        return *this
    }

private:
    TStreamBuf& buf_;
};

We will also require the numeric base representation and manipulator. Unfortunately, usage of std::oct, std::dec`or `std::hex manipulators will require inclusion of standard library header <ios>, which in turn includes other standard stream related headers, which define some static objects, which in turn are defined and instantiated in standard library. It contradicts our main goal of writing generic code that doesn’t require standard library to be used. It is better to define such manipulators ourselves:

enum Base
{
    bin, ///< Binary numeric base stream manipulator
    oct, ///< Octal numeric base stream manipulator
    dec, ///< Decimal numeric base stream manipulator
    hex, ///< Hexadecimal numeric base stream manipulator
    Base_NumOfBases ///< Must be last
};

template <typename TStreamBuf>
class OutStream
{
public:
    explicit OutStream(TStreamBuf& buf)
    : buf_(buf)
      base_(dec)
    {
    }

    OutStream& operator<<(Base value)
    {
        base_ = value;
        return *this
    }

private:
    TStreamBuf& buf_;
    Base base_;
};

The value of the numeric base representation must be taken into account when creating string representation of numeric values. The usage is very similar to standard:

OutStream<...> stream;

stream << "var1=" << dec << var1 << "; var2=" << hex << var2 << '\n';
stream.flush();

It may be convenient to support a little bit of formatting, such as specifying minimal width of the output as well as fill character:

class WidthManip : public ValueManipBase<std::size_t>
{
public:
    WidthManip(std::size_t value) : value_(value) {}
    std::size_t value() const { return value_;}
private:
    std::size_t value_;
};

inline
WidthManip setw(std::size_t value)
{
    return WidthManip(value);
}

template <typename T>
class FillManip
{
public:
    FillManip(T value) : value_(value) {}
    T value() const { return value_;}
private:
    T value_;
};

template <typename T>
inline
FillManip<T> setfill(T value)
{
    return FillManip<T>(value);
}

template <typename TStreamBuf>
class OutStream
{
public:
    explicit OutStream(TStreamBuf& buf)
    : buf_(buf)
      base_(dec),
      width_(0),
      fill_(static_cast<CharType>(' ');
    {
    }

    OutStream& operator<<(WidthManip manip)
    {
        width_ = manip.value();
        return *this;
    }

    template <typename T>
    OutStream& operator<<(details::FillManip<T> manip)
    {
        fill_ = static_cast<CharType>(manip.value());
        return *this;
    }

private:
    TStreamBuf& buf_;
    Base base_;
    std::size_t width_;
    CharType fill_;
};

The usage is very similar to the base manipulator:

OutStream<...> stream;

stream << "var1=" << dec << setw(4) << var1 << "; var2=" << hex
       << setfill('0') << var2 << '\n';
stream.flush();

Another useful manipulator is adding '\n' at the end as well as calling flush(), just like std::endl does when using standard output streams:

enum Endl
{
    endl ///< End of line stream manipulator
};

template <typename TStreamBuf>
class OutStream
{
public:

    OutStream& operator<<(Endl manip)
    {
        static_cast<void>(manip);
        buf_.pushBack(static_cast<CharType>('\n');
        flush();
        return *this;
    }


private:
    ...
};

Then usage example may be changed to:

OutStream<...> stream;

stream << "var1=" << dec << setw(4) << var1 << "; var2=" << hex
       << setfill('0') << var2 << endl;

To summarise: The "Output Stream" object converts given integer value into the printable characters and uses pushBack() member function of "Output Stream Buffer" to pass these characters further. The request to flush() is also passed on. When "Output Stream Buffer" receives a request to flush internal buffer it activates the "Character" Driver, which it turn uses "UART" Device to write characters to serial interface one by one. As the result of such cooperation, the "printing" statement is very quick, there is no wait for all the characters to be written before the function returns, like it is usually done with printf(). All the characters are written at the background using interrupts, while the main thread of the application continues its execution without stalling.

Such "Output Stream" is already implemented in embxx/io/OutStream.h file of embxx library.

Logging

In general, debug logging should be under conditional compilation, for example only in DEBUG mode, while the printing code is excluded when compiling in RELEASE mode.

#ifndef NDEBUG
    stream << "Some info massage" << endl;
#endif

Sometimes there is a need to easily change the amount of debug messages being printed. For that purpose, the concept of logging levels is widely used:

namespace log
{

enum Level
{
    Trace, ///< Use for tracing enter to and exit from functions.
    Debug, ///< Use for debugging information.
    Info, ///< Use for general informative output.
    Warning, ///< Use for warning about potential dangers.
    Error, ///< Use to report execution errors.
    NumOfLogLevels ///< Number of log levels, must be last
};

}  // namespace log

The logging statement becomes a macro:

const auto MinLogLevel = log::Info;

#define LOG(stream__, level__, output__) \
    do { \
        if (MinLevel <= (level__)) { \
            (stream__).stream() << output__; \
        } \
    } while (false)

In this case all the logging attempts for level below log::Info get optimised away by the compiler, because the if statement known to evaluate to false at compile time:

LOG(stream, log::Debug, "This message is not printed." << endl);
LOG(stream, log::Info, "This message IS printed." << endl);
LOG(stream, log::Warning, "This message IS printed also." << endl);

It would be nice to be able to add some automatic formatting to the logged statements, such as printing the log level and/or adding '\n' and flushing at the end. For example, the code below

LOG(stream, log::Debug, "This is DEBUG message.");
LOG(stream, log::Info, "This is INFO message.");
LOG(stream, log::Warning, "This is WARNING message.");

to produce the following output

[DEBUG]: This is DEBUG message.
[INFO]: This is INFO message.
[WARNING]: This is WARNING message.

with '\n' character and call to flush() at the end.

It is easy to achieve when using some kind of wrapper logging class around the output stream as well as relevant formatters. For example:

template <log::Level TLevel, typename TStream>
class StreamLogger
{
public:

    typedef TStream Stream;

    static const log::Level MinLevel = TLevel;

    explicit StreamLogger(Stream& outStream)
      : outStream_(outStream)
    {
    }

    Stream& stream()
    {
        return outStream_;
    }

    // Begin output. This function is called before requested
    // output is redirected to stream. It does nothing.
    void begin(log::Level level)
    {
        static_cast<void>(level);
    }

    // End output. This function is called after requested
    // output is redirected to stream. It does nothing.
    void end(log::Level level)
    {
        static_cast<void>(level);
    }

private:
    Stream& outStream_;
};

The logging macro will look like this:

#define SLOG(log__, level__, output__) \
    do { \
        if ((log__).MinLevel <= (level__)) { \
            (log__).begin(level__); \
            (log__).stream() << output__; \
            (log__).end(level__); \
        } \
    } while (false)

A formatter can be defined by exposing the same interface, but wraps the original StreamLogger or another formatter. For example let’s define formatter that calls flush() member function of the stream when output is complete:

template <typename TNextLayer>
class StreamFlushSuffixer
{
public:

    // Constructor, forwards all the other parameters to the constructor
    // of the next layer.
    template<typename... TParams>
    StreamFlushSuffixer(TParams&&... params)
      : nextLavel_(std::forward<TParams>(params)...)
    {
    }

    Stream& stream()
    {
        return nextLavel_.stream();
    }

    void begin(log::Level level)
    {
        nextLavel_.begin(level);
    }

    void end(log::Level level)
    {
        nextLavel_.end(level);
        stream().flush();
    }

private:
    TNextLavel nextLavel_;
};

The definition of such logger would be:

typedef ... OutStream; // type of the output stream
typedef
    StreamFlushSuffixer<
        StreamLogger<
            log::Debug,
            OutStream
        >
    > Log;

The same SLOG() macro will work for this logger with extra formatting:

OutStream stream(... /* construction params */);
Log log(stream);
SLOG(log, log::Debug, "This is DEBUG message.\n");

Let’s also add a formatter that capable of printing any value (and '\n' in particular) at the end of the output.

template <typename T, typename TNextLayer>
class StreamableValueSuffixer
{
public:

    template<typename... TParams>
    explicit StreamableValueSuffixer(T&& value, TParams&&... params)
      : value_(std::forward<T>(value)),
        nextLevel_(std::forward<TParams>(params)...)
    {
    }

    Stream& stream()
    {
        return nextLavel_.stream();
    }

    void begin(log::Level level)
    {
        nextLavel_.begin(level);
    }

    void end(log::Level level)
    {
        nextLavel_.end(level);
        stream() << value_;
    }


private:
    T value_;
    TNextLavel nextLavel_;
};

The definition of the logger that adds '\n' character and then calls flush() member function of the underlying stream would be:

typedef embxx::io::OutStream<...> OutStream;
typedef
    StreamFlushSuffixer<
        StreamableValueSuffixer<
            char,
            StreamLogger<
                log::Debug,
                OutStream
            >
        >
    > Log;

While the construction will require to specify the character which is going to be printed at the end, but before call to flush().

OutStream stream(...);
Log log('\n', stream);
SLOG(log, log::Debug, "This is DEBUG message.");

As the last formatter, let’s do the one that prefixes the output with log level information:

template <typename TNextLayer>
class LevelStringPrefixer
{
public:
    template<typename... TParams>
    LevelStringPrefixer(TParams&&... params);
      : next_value(std::forward<TParams>(params)...)
    {
    }

    Stream& stream()
    {
        return nextLavel_.stream();
    }

    void begin(Level level)
    {
        static const char* const Strings[NumOfLogLevels] = {
            "[TRACE] ",
            "[DEBUG] ",
            "[INFO] ",
            "[WARNING] ",
            "[ERROR] "
        };

        if ((level < NumOfLogLevels) && (Strings[level] != nullptr)) {
            stream() << Strings[level];
        }

        nextLavel_.begin(level);
    }

    void end(log::Level level)
    {
        nextLavel_.end(level);
    }

private:
    TNextLavel nextLavel_;
};

The definition of the logger that prints such a prefix at the beginning and '\n' at the end together with call to flush() would be:

typedef
    StreamFlushSuffixer<
        StreamableValueSuffixer<
            char,
            LevelStringPrefixer<
                StreamLogger<
                    log::Debug,
                    OutStream
                >
            >
        >
    > Log;

Such StreamLogger together with multiple formatters is already implemented in embxx/util/StreamLogger.h file of embxx library.

Logging Application

The app_uart1_logging application in embxx_on_rpi project implements logging of simple counter that gets incremented once a second:

namespace log = embxx::util::log;
template <typename TLog, typename TTimer>
void performLog(TLog& log, TTimer& timer, std::size_t& counter)
{
    ++counter;

    SLOG(log, log::Info,
        "Logging output: counter = " <<
        embxx::io::dec << counter <<
        " (0x" << embxx::io::hex << counter << ")");

    // Perform next logging after a timeout
    static const auto LoggingWaitPeriod = std::chrono::seconds(1);
    timer.asyncWait(
        LoggingWaitPeriod,
        [&](const embxx::error::ErrorStatus& es)
        {
            GASSERT(!es);
            static_cast<void>(es);
            performLog(log, timer, counter);
        });
}

int main() {
    auto& system = System::instance();
    auto& log = system.log();

    // Configure UART
    auto& uart = system.uart();
    uart.configBaud(115200);
    uart.setWriteEnabled(true);

    // Timer allocation
    auto timer = system.timerMgr().allocTimer();
    GASSERT(timer.isValid());

    // Start logging
    std::size_t counter = 0;
    performLog(log, timer, counter);

    // Run event loop
    device::interrupt::enable();
    auto& el = system.eventLoop();
    el.run();

    GASSERT(0); // Mustn't exit
    return 0;
}

The System.h file defines the whole output stack:

class System
{
public:
    static const std::size_t EventLoopSpaceSize = 1024;
    typedef embxx::util::EventLoop<
        EventLoopSpaceSize,
        device::InterruptLock,
        device::WaitCond> EventLoop;

    // Devices
    typedef device::Uart1<InterruptMgr> Uart;
    ...

    // Drivers
    struct CharacterTraits
    {
        typedef std::nullptr_t ReadHandler;
        typedef embxx::util::StaticFunction<
            void(const embxx::error::ErrorStatus&, std::size_t)> WriteHandler;
        typedef std::nullptr_t ReadUntilPred;
        static const std::size_t ReadQueueSize = 0;
        static const std::size_t WriteQueueSize = 1;
    };
    typedef embxx::driver::Character<
        Uart, EventLoop, CharacterTraits> UartDriver;
    ...

    // Components
    static const std::size_t OutStreamBufSize = 1024;
    typedef embxx::io::OutStreamBuf<
        UartDriver, OutStreamBufSize> OutStreamBuf;

    typedef embxx::io::OutStream<OutStreamBuf> OutStream;
    typedef embxx::util::log::StreamFlushSuffixer<
            embxx::util::log::StreamableValueSuffixer<
                const OutStream::CharType*,
                embxx::util::log::LevelStringPrefixer<
                    embxx::util::StreamLogger<
                        embxx::util::log::Debug,
                        OutStream
                    >
                >
            >
        > Log;

    ...
private:

    EventLoop el_;

    // Devices
    Uart uart_;
    ...

    // Drivers
    UartDriver uartDriver_;
    ...

    // Components
    OutStreamBuf buf_;
    OutStream stream_;
    Log log_;
    ...
};

This application will produce the following output to the UART interface with new line appearing every second:

[INFO] Logging output: counter = 1 (0x1)
[INFO] Logging output: counter = 2 (0x2)
[INFO] Logging output: counter = 3 (0x3)
...

Buffered Input

In many systems the UART interfaces are also used to communicate between various microcontrollers on the same board or with external devices. When there are incoming messages, the characters must be stored in some buffer before they can be processed by some Component. Just like we had "Output Stream Buffer" for buffering outgoing characters, we must have "Input Stream Buffer" for buffering incoming ones.

It must obviously have an access to the Character Driver and will probably have a circular buffer to store incoming characters.

template <typename TDriver, std::size_t TBufSize>
class InStreamBuf
{
public:
    typedef typename TDriver::CharType CharType;
    typedef embxx::container::StaticQueue<CharType, TBufSize> Buffer;

    explicit
    InStreamBuf(TDriver& driver)
      : driver_(driver)
    {
    }

private:
    TDriver& driver_;
    Buffer buf_;
};

The Driver won’t perform any read operations unless it is explicitly requested to do so with its asyncRead() member function. Sometimes, there is a need to keep characters flowing in and being stored in the buffer, even when the Component responsible for processing them is not ready. In order to make this happen, the "Input Stream Buffer" must be responsible for constantly requesting the Driver to perform asynchronous read while providing space where these characters are going to be stored.

template <typename TDriver, std::size_t TBufSize>
class InStreamBuf
{
public:
    // Start data accumulation in the internal buffer.
    void start();

    // Stop data accumulation in the internal buffer.
    void stop();

    // Inquire whether characters are being accumulated.
    bool isRunning() const;
};

Most of the times the responsible Component will require some number of characters to be accumulated before their processing can be started. There is a need to provide asynchronous notification callback request when appropriate number of characters becomes available. The callback must be stored in the internal data structures of the "Input Stream Buffer" and invoked when needed. Due to the latter being developed as a generic class, there is a need to provide callback storage type as a template parameter.

template <typename TDriver, std::size_t TBufSize, typename TWaitHandler>
class InStreamBuf
{
public:

    template <typename TFunc>
    void asyncWaitDataAvailable(std::size_t reqSize, TFunc&& func)
    {
        callback_ = std::forward<TFunc>(func)
        ...
    }

private:
    TWaitHandler callback_;
};

Once the required number of characters is accumulated, the Component must be able to access and process them. It means that "Input Stream Buffer" must also be a container with random access iterators.

template <typename TDriver, std::size_t TBufSize, typename TWaitHandler>
class InStreamBuf
{
public:
    typedef typename Buffer::ConstIterator ConstIterator;
    typedef ConstIterator const_iterator;
    typedef typename Buffer::ValueType ValueType;
    typedef ValueType value_type;
    typedef typename Buffer::ConstReference ConstReference;
    typedef ConstReference const_reference;

    // Get size of available for read data.
    std::size_t size() const;

    // Check whether number of available characters is 0.
    bool empty() const;

    //Get full capacity of the buffer.
    constexpr std::size_t fullCapacity() const;

    ConstIterator begin() const;
    ConstIterator end() const;
    ConstIterator cbegin() const;
    ConstIterator cend() const;
    ConstReference operator[](std::size_t idx) const;
};

Please note, that all the access to the characters are done using const iterator. It means we do not allow external and uncontrolled update of the characters inside of the buffer.

When the characters inside the buffer got processed and aren’t needed any more, they need to be discarded to free the space inside the buffer for new ones to come.

template <typename TDriver, std::size_t TBufSize, typename TWaitHandler>
class InStreamBuf
{
public:
    // Consume part or the whole buffer of the available data for read.
    void consume(std::size_t consumeSize = size());
};

Morse Code Application

The app_uart1_morse application in embxx_on_rpi project implements buffering of incoming characters in the "Input Stream Buffer" and uses the Morse Code method to display them by flashing the on-board led.

First of all there is a need to have an access to the led to flash, input buffer to store the incoming characters and timer manager to allocate a timer to measure timeouts.

template <typename TLed, typename TInBuf, typename TTimerMgr>
class Morse
{
public:
    typedef TLed Led;
    typedef TInBuf InBuf;
    typedef TTimerMgr TimerMgr;
    typedef typename TimerMgr::Timer Timer;

    Morse(Led& led, InBuf& buf, TimerMgr& timerMgr)
      : led_(led),
        buf_(buf),
        timer_(timerMgr.allocTimer())
    {
        GASSERT(timer_.isValid());
    }

    ~Morse() = default;

private:
    Led& led_;
    InBuf& buf_;
    Timer timer_;
};

Second, there is a need to define a Morse code sequences in terms of dots and dashes duration as well as mapping an incoming character to the respective sequence.

template <...>
class Morse
{
public:
    typedef typename InBuf::CharType CharType;
    ...
private:
    typedef unsigned Duration;
    static const Duration Dot = 200;
    static const Duration Dash = Dot * 3;
    static const Duration End = 0;
    static const Duration Spacing = Dot;
    static const Duration InterSpacing = Spacing * 2;

    const Duration* getLettersSeq(CharType ch) const
    {
        static const Duration Seq_A[] = {Dot, Dash, End};
        static const Duration Seq_B[] = {Dash, Dot, Dot, Dot, End};
        ...
        static const Duration Seq_Z[] = {Dash, Dash, Dot, Dot, End};

        static const Duration Seq_0[] = {
            Dash, Dash, Dash, Dash, Dash, End};
        static const Duration Seq_1[] = {
            Dot, Dash, Dash, Dash, Dash, End};
        ...
        static const Duration Seq_9[] = {
            Dash, Dash, Dash, Dash, Dot, End};

        static const Duration* Letters[] = {
            Seq_A,
            Seq_B,
            ...
            Seq_Z
        };

        static const Duration* Numbers[] = {
            Seq_0,
            ...
            Seq_9
        };


        if ((static_cast<CharType>('A') <= ch) &&
            (ch <= static_cast<CharType>('Z'))) {
            return Letters[ch - 'A'];
        }

        if ((static_cast<CharType>('a') <= ch) &&
            (ch <= static_cast<CharType>('z'))) {
            return Letters[ch - 'a'];
        }

        if ((static_cast<CharType>('0') <= ch) &&
            (ch <= static_cast<CharType>('9'))) {
            return Numbers[ch - '0'];
        }

        return nullptr;
    }
};

Now, the code that is responsible to flash a led is quite simple:

template <...>
class Morse
{
public:

    void start()
    {
        buf_.start();
        nextLetter();
    }

private:
    void nextLetter()
    {
        buf_.asyncWaitDataAvailable(
            1U,
            [this](const embxx::error::ErrorStatus& es)
            {
                if (es) {
                    GASSERT(buf_.empty());
                    nextLetter();
                    return;
                }

                GASSERT(!buf_.empty());
                auto ch = buf_[0];
                buf_.consume(1U);

                auto* seq = getLettersSeq(ch);
                if (seq == nullptr) {
                    nextLetter();
                    return;
                }

                nextSyllable(seq);
            });
    }

    void nextSyllable(const Duration* seq)
    {
        GASSERT(seq != nullptr);
        GASSERT(*seq != End);

        auto duration = *seq;
        ++seq;

        led_.on();
        timer_.asyncWait(
            std::chrono::milliseconds(duration),
            [this, seq](const embxx::error::ErrorStatus& es)
            {
                static_cast<void>(es);
                GASSERT(!es);

                led_.off();

                if (*seq != End) {
                    timer_.asyncWait(
                        std::chrono::milliseconds(Duration(Spacing)),
                        [this, seq](const embxx::error::ErrorStatus& es)
                        {
                            static_cast<void>(es);
                            GASSERT(!es);
                            nextSyllable(seq);
                        });
                    return;
                }

                timer_.asyncWait(
                    std::chrono::milliseconds(Duration(InterSpacing)),
                    [this](const embxx::error::ErrorStatus& es)
                    {
                        static_cast<void>(es);
                        GASSERT(!es);
                        nextLetter();
                    });
            });
    }

};

The nextLetter() member function waits until one character becomes available in the buffer, then maps it to the sequence and removes it from the buffer. If the mapping exists it calls the nextSyllable() member function to start the flashing sequence. The function activates the led and waits the relevant amount of time, based on the provided dot or dash duration. After the timeout, the led goes off and new wait is activated. However if the end of sequence is reached, the wait will be of InterSpacing duration and nextLetter() member function will be called again, otherwise the wait will be of Spacing duration and nextSyllable() will be called again to activate the led and wait for the next period in the sequence.

Summary

After this quite a significant effort we’ve created a full generic stack to perform asynchronous input/output operations over serial interface, such as UART. It may be reused in multiple independent projects while providing platform specific low level device control object at the bottom of this stack.

GPIO

In many cases, the GPIO input doesn’t need to be processed at the same time the interrupt has occured. It can easilily be scheduled for execution in event loop (non-interrupt) context using Device-Driver-Component model.

According to what was written in Device-Driver-Component chapter and to what we’ve seen so far, the Component provides a callback object together with the asynchronous operation request. The callback is executed only once when the operation is compete, canceled or terminated due to some error. If the operation needs to be repeated, another asynchronous operation needs to be issued to the Driver while providing another callback object to be called on operation completion.

The need for GPIO input handling is a bit different though. The line may change its value multiple times between the reporting of the event to the Component and the latter re-requesting asynchronous wait on value change. The Driver must preserve the callback object, provided by the Component, and invoke it every time the GPIO input value changes until the Component cancels the operation.

Let’s go through all the stages in more detail.

Configuration

The Device must provide a callback object to handle GPIO interrupts on all the requested input lines.

The hardware must also be configured properly: input/output lines, the interrupts on the rising/falling edges, etc. Such configuration is platform/product specific and is not part of the generic Device-Driver-Component model presented in this book. Hence, the product specific Component must get an access to the device object and configure it as needed.

Start Continuous Asynchronous Read Operation

The Driver must be able to support multiple asynchronous read operations on different inputs. It means that it must protect an access to the internal data structures by requesting the Device to suspend the callback invocation (i.e. disable interrupts). Also to follow the pattern we used so far, there must be a request to start or enable the Device's operation on the first read request and cancel or disable it on the last.

The reader may notice that on the first asyncReadCont() request, the Driver issued suspend() request to the Device and got false in return. It means that the Device's monitoring of the GPIO inputs hasn’t been started yet. That’s the reason for the following call to enable(). On the second asyncReadCont() request the call to suspend() returned true which was followed by the resume() later.

Reporting GPIO Input Event

Now, every time the relevant GPIO interrupt occurs, the Driver's handler is invoked in interrupt mode context. It is responsible to schedule the execution of Component's handler in event loop (non-interrupt) context.

Cancel Continuous Read Operation

When the there is no need to monitor some input any more, the Component may request the Driver to cancel the continuous asynchronous read operation. In case of last recorded asynchronous read operation being canceled, the Driver is responsible to let the Device know that no more GPIO interrupts are needed:

GPIO Device

Based on the information above, the platform specific GPIO control Device object must provide the following public interface:

Define pin identification type.
```
typedef unsigned PinIdType;
```
Function to provide a callback object to be called when interrupt occurs. The callback parameters must provide an information of pin as well as final input value that caused the interrupt. The callback object must implement the following signature: "void (PinIdType, bool)" where the first parameter is pin and second parameter is input value.
```
template <typename TFunc>
void setHandler(TFunc&& func);
```

Function to start / enable the GPIO input monitoring.

void start(embxx::device::context::EventLoop context);

Function to cancel / disable the GPIO input monitoring.

bool cancel(embxx::device::context::EventLoop context);

Function to enable/disable gpio interrupts for single pin.

void setEnabled(
    PinIdType pin,
    bool enabled,
    embxx::device::context::EventLoop context);

Function to suspend invocation of callback in interrupt mode, i.e. disable gpio interrupts.
```
bool suspend(embxx::device::context::EventLoop context);
```
Function to resume suspended invocation of callback in interrupt mode, i.e. enable gpio interrupts.
```
void resume(embxx::device::context::EventLoop context);
```

Such GPIO control Device class for RaspberryPi platform is implemented in src/device/Gpio.h file of embxx_on_rpi project.

GPIO Driver

First of all, we will need references to Device as well as Event Loop objects:

template <typename TDevice, typename TEventLoop>
class MyGpioDriver
{
public:
    // During the construction store references to Device
    // and Event Loop objects.
    MyGpioDriver(TDevice& device, TEventLoop& el)
      : device_(device),
        el_(el)
    {
        // Register appropriate interrupt callbacks with device
        device_.setHandler(...);
    }

    ...

private:

    TDevice& device_;
    TEventLoop& el_;
};

The Driver must also provide an ability to perform and cancel continuous asynchronous read operations for multiple pins:

template <typename TDevice, typename TEventLoop>
class MyGpioDriver
{
public:
    typedef typename TDevice::PinIdType PinIdType;

    template <typename TFunc>
    void asyncReadCont(PinIdType id, TFunc&& func) { ... }

    bool cancelReadCont(PinIdType id) { ... }
};

Like with any asynchronous operation so far the callback must receive status information as its first parameter and probably the value of the input as the second one. When the operation canceled with cancelReadCont(), the callback must be invoked one last time with status specifying that operation was Aborted.

The Driver is supposed to be a generic piece of code that can be reused in multiple independent products, including ones without dynamic memory allocation and/or exceptions. It means that the Driver class must receive maximum number of the pins it is going to support and type of the callback storage.

template <typename TDevice,
          typename TEventLoop,
          std::size_t TNumOfLines,
          typename THandler =
              embxx::util::StaticFunction<void (const embxx::error::ErrorStatus&, bool)> >
class MyGpioDriver
{
public:
    template <typename TFunc>
    void asyncReadCont(PinIdType id, TFunc&& func)
    {
        ...
        auto* node = ...; // Locate or allocate appropriate node
        node->id_ = id;
        node->handler_ = std::forward<TFunc>(func);
        ...
    }

private:
    struct Node
    {
        Node() : id_(PinIdType()) {}

        PinIdType id_;
        THandler handler_;
    };

    typedef std::array<Node, TNumOfLines> Infos;

    Infos infos_;
    ...
};

The Driver doesn’t do anything special, it just receives the notification from the Device that gpio interrupt has occurred, locates the appropriate registered Component's callback object (based on the pin information provided by the Device), and uses Event Loop to schedule an execution of the Component's callback together with information about input’s value in event loop (non-interrupt) context.

Such generic GPIO Driver is already implemented in embxx/driver/Gpio.h file of embxx library.

Button Component

The embxx_on_rpi project has a simple button Component, implemented in src/component/Button.h. It configures provided GPIO line to be an input and to have both rising and falling edges interrupts. It also exposes simple interface to be able to monitor button presses and releases.

template <typename TDriver,
          bool TActiveState,
          typename THandler = embxx::util::StaticFunction<void ()> >
class Button
{
public:
    typedef TDriver Driver;
    typedef typename Driver::PinIdType PinIdType;

    Button(Driver& driver, PinIdType pin);
    ~Button();

    bool isPressed() const;

    template <typename TFunc>
    void setPressedHandler(TFunc&& func);

    template <typename TFunc>
    void setReleasedHandler(TFunc&& func);
};

Button Press Monitoring Application

The embxx_on_rpi project also contains a simple application called app_button. It monitors presses and releases of a single button connected to one of the GPIO lines. When the button is pressed, the led is turned on for 1 second and "Button Pressed" string is logged to UART. When the button is released, just "Button Released" string is logged to UART without influencing the led state. If new button press is recognised prior to 1 second timeout for the led being on, the led stays on and a new 1 second timer countdown is started.

Thanks to the Device-Driver-Component model and all levels of abstractions, the application code is quite simple.

int main() {
    auto& system = System::instance();

    // Configure uart
    auto& uart = system.uart();
    uart.configBaud(9600);
    uart.setWriteEnabled(true);

    // Allocate timer
    auto& timerMgr = system.timerMgr();
    auto timer = timerMgr.allocTimer();
    GASSERT(timer.isValid());

    // Set handlers for button press / release
    auto& button = system.button();
    button.setPressedHandler(
        std::bind(
            &buttonPressed,
            std::ref(timer)));

    button.setReleasedHandler(&buttonReleased);

    // Run event loop with enabled interrupts
    device::interrupt::enable();
    auto& el = system.eventLoop();
    el.run();

    GASSERT(0); // Mustn't exit
	return 0;
}

The code for "button pressed" is as following:

void buttonPressed(System::TimerMgr::Timer& timer)
{
    static_cast<void>(timer);
    auto& system = System::instance();
    auto& el = system.eventLoop();
    auto& led = system.led();
    auto& log = system.log();

    SLOG(log, embxx::util::log::Info, "Button Pressed");

    timer.cancel();
    auto result = el.post(
        [&led]()
        {
            led.on();
        });
    GASSERT(result);
    static_cast<void>(result);

    static const auto WaitTime = std::chrono::seconds(1);
    timer.asyncWait(
        WaitTime,
        [&led](const embxx::error::ErrorStatus& es)
        {
            if (es == embxx::error::ErrorCode::Aborted) {
                return;
            }
            led.off();
        });
}

The code for "button release" is very simple:

void buttonReleased()
{
    auto& system = System::instance();
    auto& log = system.log();

    SLOG(log, embxx::util::log::Info, "Button Released");
}

I2C

I2C is serial communication bus. It is very popular in embedded development and mostly used to communicate to various low speed peripherals, such as eeproms and various sensors.

The control and use of I2C fits nicely into the Device-Driver-Component model described in this book. It is a serial interface and the controlling Device object will have to read/write characters one by one, just like it was with UART. It would be nice if we coud reuse the Character Driver we implemented before. However, the I2C is multi-master / multi-slave bus and there is a need to specify the slave ID (or address) when initiating read and/or write operation.

ID Adaptor

It is quite clear that some kind of ID Device Adaptor is needed. It will be constructed with additional ID parameter and will be responsible to forward all the API calls from the Character Driver to I2C Device while adding one extra parameter of ID.

The implementation of such adaptor is very simple and straightforward:

template <typename TDevice>
class IdAdaptor
{
public:
    // Type of the underlaying device.
    typedef TDevice Device;

    // Character type defined in the wrapped device
    typedef typename TDevice::CharType CharType;

    // Device identification type defined in the wrapped device class.
    typedef typename TDevice::DeviceIdType DeviceIdType;

    IdAdaptor(Device& device, DeviceIdType id)
      : device_(device),
        id_(id)
    {
    }

    template <typename TFunc>
    void setCanReadHandler(TFunc&& func)
    {
        device_.setCanReadHandler(id_, std::forward<TFunc>(func));
    }

    template <typename TFunc>
    void setCanWriteHandler(TFunc&& func)
    {
        device_.setCanWriteHandler(id_, std::forward<TFunc>(func));
    }

    template <typename TFunc>
    void setReadCompleteHandler(TFunc&& func)
    {
        device_.setReadCompleteHandler(id_, std::forward<TFunc>(func));
    }

    template <typename TFunc>
    void setWriteCompleteHandler(TFunc&& func)
    {
        device_.setWriteCompleteHandler(id_, std::forward<TFunc>(func));
    }

    template <typename... TArgs>
    void startRead(TArgs&&... args)
    {
        device_.startRead(id_, std::forward<TArgs>(args)...);
    }

    template <typename... TArgs>
    bool cancelRead(TArgs&&... args)
    {
        return device_.cancelRead(id_, std::forward<TArgs>(args)...);
    }

    template <typename... TArgs>
    void startWrite(TArgs&&... args)
    {
        device_.startWrite(id_, std::forward<TArgs>(args)...);
    }

    template <typename... TArgs>
    bool cancelWrite(TArgs&&... args)
    {
        return device_.cancelWrite(id_, std::forward<TArgs>(args)...);
    }

    template <typename... TArgs>
    bool suspend(TArgs&&... args)
    {
        return device_.suspend(id_, std::forward<TArgs>(args)...);
    }

    template <typename... TArgs>
    void resume(TArgs&&... args)
    {
        device_.resume(id_, std::forward<TArgs>(args)...);
    }

    template <typename... TArgs>
    bool canRead(TArgs&&... args)
    {
        return device_.canRead(id_, std::forward<TArgs>(args)...);
    }

    template <typename... TArgs>
    bool canWrite(TArgs&&... args)
    {
        return device_.canWrite(id_, std::forward<TArgs>(args)...);
    }

    template <typename... TArgs>
    CharType read(TArgs&&... args)
    {
        return device_.read(id_, std::forward<TArgs>(args)...);
    }

    template <typename... TArgs>
    void write(TArgs&&... args)
    {
        device_.write(id_, std::forward<TArgs>(args)...);
    }

private:
    Device& device_;
    DeviceIdType id_;
};

The same adaptor class is implemented in embxx/device/IdDeviceCharAdapter.h file of embxx library.

Operations Queue

The I2C protocol allows existence of multiple independent slaves on the same bus. It means there may be several independent Components that communicate to different I2C devices (for example EEPROM and temperature sensor), but must share the same Device control object and may issue read/write requests to it in parallel. To resolve this problem, there must be some kind of operation queuing facility that is responsible to queue all the read/write requests to the Device and issue them one by one.

The objects' usage map looks like this:

Such queue is a platform/product independent piece of code and it should be implemented without using dynamic memory allocation and/or exceptions. It means that it should receive number of various Driver objects, that may issue independent read/write requests to it (i.e. size of the internal queue), as a template parameter and probably use Static (Fixed Size) Queue to queue all the requests that are coming in. It should also receive callback storage types to report when a new character can be read/written, as well as when read/write operation is complete.

template <typename TDevice,
          std::size_t TSize,
          typename TCanDoOpHandler = embxx::util::StaticFunction<void()>,
          typename TOpCompleteHandler =
              embxx::util::StaticFunction<void (const embxx::error::ErrorStatus&)> >
class DeviceOpQueue
{
public:
    DeviceOpQueue(TDevice& device);
    ...
private:
    typedef embxx::container::StaticQueue<..., TSize> Queue;
    Queue queue_;
};

When the TSize template parameter is set to 1, there is no need for all the queuing facility and the DeviceOpQueue class may become a simple pass-through inline class using template specialisation:

template <typename TDevice>
class DeviceOpQueue<TDevice, 1>
{
public:

    typedef typename TDevice::PinIdType PinIdType;

    template <typename... TArgs>
    void startRead(TArgs&&... args)
    {
        device_.startRead(std::forward<TArgs>(args)...)
    }

    template <typename... TArgs>
    bool cancelRead(PinIdType id, TArgs&&... args)
    {
        static_cast<void>(id); // No use for id in the Device itself
        return device_.cancelRead(std::forward<TArgs>(args)...)
    }

    template <typename... TArgs>
    bool suspend(PinIdType id, TArgs&&... args)
    {
        static_cast<void>(id); // No use for id in the Device itself
        return device_.suspend(std::forward<TArgs>(args)...)
    }

    ...
};

Such queue is also implemented in embxx library. It resides in the embxx/device/DeviceOpQueue.h file.

Please note that ID Adaptor and [peripherals-i2c-operations_queue] are both Device layer classes. The serve as wrappers to actual peripheral control Device in order to expose the right interface to the upper layer Driver.

I2C Device

The only thing that remains is to properly implement I2C control device, which can be used by the DeviceOpQueue, which in turn is used by the IdAdaptor. The IdAdaptor object can be used with the existing Character Driver implemented to be used with the UART peripheral.

Based on the information above, the platform specific I2C control Device object must provide the following public interface:

class I2CDevice
{
public:
    // Single character type
    typedef std::uint8_t CharType;

    // ID type
    typedef std::uint8_t DeviceIdType;

    // Context types
    typedef embxx::device::context::EventLoop EventLoopContext;
    typedef embxx::device::context::Interrupt InterruptContext;

    // Set various interrupt handlers
    template <typename TFunc>
    void setCanReadHandler(TFunc&& func);

    template <typename TFunc>
    void setCanWriteHandler(TFunc&& func);

    template <typename TFunc>
    void setReadCompleteHandler(TFunc&& func);

    template <typename TFunc>
    void setWriteCompleteHandler(TFunc&& func);

    // Start read for both contexts.
    void startRead(DeviceIdType address, std::size_t length, EventLoopContext);
    void startRead(DeviceIdType address, std::size_t length, InterruptContext);

    // Cancel read for both contexts.
    bool cancelRead(EventLoopContext);
    bool cancelRead(InterruptContext);

    // Start write for both contexts.
    void startWrite(DeviceIdType address, std::size_t length, EventLoopContext);
    void startWrite(DeviceIdType address, std::size_t length, InterruptContext);
        TContext context);

    // Cancel write for both contexts.
    bool cancelWrite(EventLoopContext);
    bool cancelWrite(InterruptContext);

    // Suspend/Resume
    bool suspend(EventLoopContext);
    void resume(EventLoopContext);

    // Helper functions to manage read/write during the interrupt
    bool canRead(InterruptContext);
    bool canWrite(InterruptContext);
    CharType read(InterruptContext);
    void write(CharType value, InterruptContext);
};

Such device to control I2C0 interface on RaspberryPi platform is implemented in src/device/I2C0.h file of embxx_on_rpi project.

EEPROM Access Application

The embxx_on_rpi project contains an application called app_i2c0_eeprom. It implements a parallel access to 2 EEPROMs connected to the same I2C0 bus, but having different addresses. The EEPROMs are accessed independently at the same time with read/write operations. These operations are queued and managed by the DeviceOpQueue object that wraps actual I2C control Device and forwards the requests one by one.

SPI

SPI is also quite popular serial communication interface. It is very similar to I2C in terms of using it the Device-Driver-Component model described in this book. The main differences are:

SPI uses "chip select" identification method instead of "address" of the peripheral.
SPI is a double direction link - there are always read and write operations that are executed in parallel (instead of only read or only write).

The "chip select" slave identefication will require the same "ID Adaptor" that was used for I2C integration.

Just like with I2C, the SPI is a multi-slave bus. It allows connection of multiple independent devices to the same MISO/MOSI/CLK lines of the SPI interface. It means there is a need for the same "Operations Queue" that was used for I2C integration. Due to the fact that SPI is a double direction link, the "Operations Queue" must be able to forward, say, read operation request to the actual Device even if "write" operation to the same slave device is already in progress.

It means that the objects' usage map is exactly the same as with I2C.

All the intermediate layers (Character Driver, ID Adaptor, Operations Queue) in the map above must allow issuing read and write operations at the same time. It becomes a responsibility of the product specific Component to be aware what kind of the Device is used and not to issue these requests in parallel if the actual Device (such as I2C) doesn’t support it.

SPI Device

Based on the information above, the platform specific SPI control Device object must provide and implement exactly the same interface as I2C Device:

class SpiDevice
{
public:
    // Single character type
    typedef std::uint8_t CharType;

    // ID type - chip select index
    typedef unsigned DeviceIdType;

    // Context types
    typedef embxx::device::context::EventLoop EventLoopContext;
    typedef embxx::device::context::Interrupt InterruptContext;

    // Set various interrupt handlers
    template <typename TFunc>
    void setCanReadHandler(TFunc&& func);

    template <typename TFunc>
    void setCanWriteHandler(TFunc&& func);

    template <typename TFunc>
    void setReadCompleteHandler(TFunc&& func);

    template <typename TFunc>
    void setWriteCompleteHandler(TFunc&& func);

    // Start read for both contexts.
    void startRead(DeviceIdType chipSelect, std::size_t length, EventLoopContext);
    void startRead(DeviceIdType chipSelect, std::size_t length, InterruptContext);

    // Cancel read for both contexts.
    bool cancelRead(EventLoopContext);
    bool cancelRead(InterruptContext);

    // Start write for both contexts.
    void startWrite(DeviceIdType chipSelect, std::size_t length, EventLoopContext);
    void startWrite(DeviceIdType chipSelect, std::size_t length, InterruptContext);
        TContext context);

    // Cancel write for both contexts.
    bool cancelWrite(EventLoopContext);
    bool cancelWrite(InterruptContext);

    // Suspend/Resume
    bool suspend(EventLoopContext);
    void resume(EventLoopContext);

    // Helper functions to manage read/write during the interrupt
    bool canRead(InterruptContext);
    bool canWrite(InterruptContext);
    CharType read(InterruptContext);
    void write(CharType value, InterruptContext);
};

Such device to control SPI0 interface on RaspberryPi platform is implemented in src/device/Spi0.h file of embxx_on_rpi project.

Other Nuances

SPI is quite often used with external persistent storage, such as SD card. Such devices may have some significant delays between the block write operation on the MOSI line and the time they send an acknowledgement about operation completion on the MISO line. The SPI Device must constantly read the incoming bytes until the expected ACK/NACK byte is received without de-asserting the CS (chip select). If the Component, responsible for managing SPI flash memory, issues only single "read" operation to wait for such an acknowledgement, the provided buffer may get full before the required byte is received. In this case the SPI control Device object is not aware that the new "read" request may follow and has to de-assert the CS, which is undesireble.

In order to solve this problem, the Character Driver described in UART chapter must be extended to support issuing multiple read/write operations at the same time. Such extension is based on the values of ReadQueueSize/WriteQueueSize in the provided Traits class. These values indicate maximal number of simultaneous read/write operations that may be issued to the Driver. The responsible Component, in turn, must perform 2 or 3 "read until" operations at the same time to wait for the expected response. Once the first buffer is full, the Driver will post the Component's callback object for execution in the event loop context, while calling startRead() member function of the Device for the next pending "read until" operation still in interrupt context to fill the second buffer. The Device is responsible to continue its read operation without de-asserting the CS line. While the second buffer being filled, the Component has enough time to identify that there is no response in the filled buffer and re-issue the "read until" request to the Driver while reusing the same buffer. This circle of "read until" requests must continue until expected response is encountered or until operation timeout, which is measured independently by the asynchronous wait request to the [Timer](timer.md). It is up to the responsible Component object to manage the operations to the Character Driver as well as the Timer in event loop context and cancel one upon execution of callback from another.

External Storage

As was mentioned in previous section, SPI is often used with external persistent storage, such as SD card. In order to properly support it, there must be some kind of SpiFlash management Component, that is responsible to implement proper communication protocol while providing necessary public interface. The minimal required interface will have to be able to:

Asynchronously initialise the device.
Asynchronously read block of data.
Asynchronously write block of data.

Once such Component is implemented and tested, the next stage would be implementing proper file system (FAT32) management Component, using the asynchronous functions of the former. It will allow processing time consuming file system reads and writes while still allowing processing of all other events without creating any performance bottlenecks and without requiring any complex independent task scheduling.

Other

There are many other peripherals and/or protocols (such as I2S, USB, one wire). The implementation and the main concepts should be pretty similar to the peripherals covered so far. At this stage I do not plan to do it in this book. At least not in the near future.

Various micro-controllers may also support DMA access to some peripherals. In this case the Character Driver that was covered in UART chapter must be replaced with some kind of Block Driver, that will allow issuing of multiple read/write requests at the same time and will receive only "operation complete" notifications from the Device. I leave implementation of it as an excercise for the reader. At least for now.