Architecture

From Ultibo.org
Jump to: navigation, search

Introduction


Ultibo core is, by design, intended to be a development environment for embedded devices which currently includes all models of the Raspberry Pi but this will most likely expand to include other popular development boards as time and resources permit. Because it is not an operating system and is not derived from an existing operating system, the design decisions made reflect the needs of the environment and intentionally leave out certain general purpose operating system concepts.

The unit of execution in Ultibo core is the thread, there is no concept of a process but multi threading is fundamental to the design and cannot be removed. There is also no concept of a user space or kernel space, all threads run within a single address space and all threads have equal privileges.

All elements are intended to be dynamic in their implementation, for example devices are intended to be added and removed dynamically with notifications used to signal interested parties about the addition or removal. Only a few internal items have a fixed limit, a specific example being TLS index slots which are statically allocated during boot.

The design is also intended to be modular so that only the required elements are included in any given application, a small number of mandatory elements are included in every application such as memory management, threading and scheduling, interrupt handling, device management and the core functionality of the FPC run time library. Most other features such as the filesystem and network interfaces, USB, MMC/SD and hardware specific drivers can be omitted if they are not required.

Platform support


With an intention to support a range of embedded devices and development boards over time, the architecture of Ultibo core tries to separate platform specific functionality from common functionality. Each supported board has both a boot module and a platform module which are specific to that board and provide the necessary initialization and platform functionality to support the common modules without requiring conditional compilation.

For the Raspberry Pi (Models A, B, A+, B+, Zero and ZeroW*) the board specific modules are:

BootRPi.pas
PlatformRPi.pas

* Each of these models use the BCM2835 System on Chip which means they are identical from a code perspective, differences can be detected by model or revision numbers.

For the Raspberry Pi 2B the board specific modules are:

BootRPi2.pas
PlatformRPi2.pas

For the Raspberry Pi 3 models (which includes 3B, 3B+, 3B+, CM3, CM3+ and Zero2W) the board specific modules are:

BootRPi3.pas
PlatformRPi3.pas

And for the Raspberry Pi 4 models (which includes 4B, 400 and CM4) the board specific modules are:

BootRPi4.pas
PlatformRPi4.pas

In addition there are architecture specific modules for the ARM processor and for the ARM processor type, these include:

PlatformARM.pas (for functionality common to all ARM processors)
PlatformARMv6.pas (this includes the ARM1176JZF-S used in the Raspberry Pi A, B, A+, B+, Zero and ZeroW)
PlatformARMv7.pas (this includes the ARM Cortex-A7 MPCore used in the Raspberry Pi 2B and the ARM Cortex-A53 used by the Raspberry Pi 3 and 4 in 32-bit mode
PlatformARMv8.pas (this includes the ARM Cortex-A53 MPCore used in the Raspberry Pi 3 and 4 in 64-bit* mode

The appropriate boot module is included in an application based on the Controller Type selected and passed to the compiler (-Wp), the boot module then includes the relevant platform and architecture modules as well as the mandatory common modules for platform, memory, threads and devices etc.

* Full 64-bit support in Ultibo is not yet complete.

Memory Management


The HeapManager unit is a mandatory inclusion in all applications, during boot it registers itself with the RTL so that memory allocation via GetMem, FreeMem etc function normally. The platform specific module for the board then registers with the HeapManager all of the available blocks of memory that can be allocated, how this occurs is specific to the board but on the Raspberry Pi it is passed by the boot loader in a tag (or ATAG) parameter or as part of the DTB (device tree blob) provided by the firmware.

All memory allocation must go via the HeapManager either by calling the standard RTL functions or by calling the additional functions it provides, depending on the platform, the command line and information passed by the boot loader various types of memory are registered and made available for allocation. These include Normal, Shared, Non Cached, Code and Device memory which are required by certain Ultibo core components or can be requested by calling HeapManager functions.

Memory Mapping


Even though all threads share a single address space and all memory is mapped one to one between physical and virtual that doesn't mean there are no access controls in Ultibo core. During startup the platform module sets up a default page table before enabling the memory management unit (MMU) and caching.

The default page table specifies that code memory (defined by the linker as the TEXT section) is executable but read only, whereas data memory (the DATA and BSS sections) are not executable and are read write. All other normal memory available for allocation is defined as read write and not executable so when a stack is allocated for a thread it will not be executable in order to avoid buffer overflow issues executing code on the stack.

A number of other sections of the memory map are defined with specific settings as well, for example the vector table is executable and read only and page tables themselves are read write not executable. Of particular importance is the page located at address zero (0x00000000) which is defined with no access at all so as to provide a "guard page" functionality, using this a read or write to a null pointer or a call to an unassigned procedure variable will result in a hardware exception that can be caught by exception handling in an application to avoid what would otherwise be a crash.

Based on the board type and other parameters, sections of the memory map are also defined as shared, device or per CPU and there is even an allowance for allocating executable memory to keep open the possibility of loadable modules in future.

The page table layout can be viewed using the Page Tables link in the WebStatus unit which is also included in the demo image.

Memory Layout


Like almost everything else in Ultibo the memory map is not fixed but is created dynamically during boot depending on things like the size of your image, the amount of RAM in the board and where the peripherals are located.

On the Raspberry Pi the bootloader will always load our image at $00008000 and since we are not doing any kernel to user separation or any virtual memory mapping then there is currently no reason to move the code from where it was loaded. Based on that the general outline of the memory looks like this on a Raspberry Pi 2 or 3:

Start End Size Description
$00000000 $00001000 4KB Zero page (No access at all) Used to make nil pointer reads or writes generate an exception
$00001000 $00002000 4KB Vector Table (Interrupt vectors)
$00002000 $00004000 8KB Unused at present (Reserved for future uses)
$00004000 $00008000 16KB First level page table (The coarse page table with a granularity of 1MB on ARM)
$00008000 $XXXXXXXX Executable code loaded with the image (Size is determined by value of _etext from the RTL)
$XXXXXXXX $XXXXXXXX Initialized data loaded with the image (Size is determined by values of _data and _edata from the RTL)
$XXXXXXXX $XXXXXXXX Uninitialized data loaded with the image (Size is determined by values of _bss_start and _bss_end from the RTL)
$XXXXXXXX $XXXXXXXX Second level page table (Size is calculated during boot based on the size of the image) Normally >256KB
$XXXXXXXX $XXXXXXXX 32KB Initial stack (Stack used during boot)
$XXXXXXXX $XXXXXXXX 64KB Initial heap (Starting heap used during boot)
$XXXXXXXX $XXXXXXXX Main heap (Determined by the ATAGs values passed by the boot loader)
$XXXXXXXX $XXXXXXXX Non shared memory (Default 8MB)
$XXXXXXXX $XXXXXXXX Non cached memory (Default 16MB)
$XXXXXXXX $XXXXXXXX Device memory (Default 0MB)
$XXXXXXXX $XXXXXXXX Shared memory (Default 32MB)
$XXXXXXXX $XXXXXXXX Local memory (Default 8MB per CPU)
$XXXXXXXX $XXXXXXXX IRQ memory (Default 8MB per CPU)
$XXXXXXXX $XXXXXXXX FIQ memory (Default 8MB per CPU)
$XXXXXXXX $3F000000 GPU Memory (Size is determined by config.txt settings) Default 64MB starting at $3B000000
$3F000000 $40000000 16MB Peripherals
$40000000 $40100000 1MB Local Peripherals

On Raspberry Pi A/B/Zero the Peripherals are from $20000000 to $21000000 and the Local Peripherals do not exist whereas on the Raspberry Pi 4 the Peripherals range has moved to $FE000000 with the Local Peripherals immediately following at $FF800000.

You can see the actual calculation of the memory layout by looking at the function RPiPageTableInit in PlatformRPi (or RPi2PageTableInit / RPi3PageTableInit / RPi4PageTableInit depending on which model you are using), this function uses information calculated during the early stages of boot to setup the actual memory layout.

Threads and Scheduling


Even the simplest "Hello World" example is multi threaded in Ultibo core and threading is an integral part of the design. When Ultibo core starts it creates a number of threads with special purposes such as the Idle thread which tracks CPU utilization but also provides an always ready thread for the scheduler to select. There are also IRQ, FIQ and SWI threads which are used to handle interrupts so that RTL elements like threadvars and exceptions also work during interrupt handling.

Timer and worker threads are used by many common components to perform callbacks and notification or response to events, timers and workers are actually just normal threads but their sole purpose is to wait for registered timer and worker events to be triggered and process them as quickly as possible.

The other thread that is always created is the Main thread which is the thread that starts the application code, between begin and end in the program file, running. The main thread is also just a normal thread and as you can see in the examples can be halted without affecting the operation of the rest of the program.

There are no specific limits on the number of threads that can be created and the only limit is available memory. During testing we routinely use between 50 and 60 threads and have tested up to several hundred without issues.

Ultibo core allows multiple priority levels for threads which start at IDLE and go up to CRITICAL, the Idle thread itself uses IDLE priority so it will only run when there is nothing else ready but you can also set other threads to IDLE as well for background processing. The CRITICAL priority is used by subsystems like USB which must respond immediately to the hardware and should be used very carefully to avoid starving other threads of CPU time. Most threads run at the NORMAL priority level but any thread can change priority dynamically by calling the appropriate API.

The scheduler in Ultibo core is preemptive and is driven by an interrupt (2000 times per second on the Raspberry Pi) which checks if the current thread has quantum (or time slice) remaining and if there are other higher priority threads waiting. Each thread receives a base quantum of time plus a priority quantum determined by the thread priority level, these can be adjusted dynamically by changing the SCHEDULER_THREAD_QUANTUM and SCHEDULER_PRIORITY_QUANTUM array values in the GlobalConfig unit. Threads can also yield their remaining quantum at any time if they have no further work to do and the use of wait states and sleep or yield is strongly encouraged in favor of endless loops.

The implementation of the scheduler is based on a set of queues, one for each priority level, and a bitmap of threads ready to run at each level. The scheduling algorithm is not revolutionary at all but provides a stable and reliable way to manage and coordinate a varying number of threads without adding the complexities of a process model to the architecture.

As with many of the internal elements of Ultibo core, the scheduler can actually be replaced by assigning new functions to a number of procedure variables declared in the Threads unit. While there is a small overhead in checking for assignment of variables during each call this is more than offset by the reduced overhead of the overall environment.

Multiple CPUs


Given the commodity nature of multi core or multi CPU systems available today it would seem almost pointless to design a new environment without support for this technology. Ultibo core supports multiple CPUs as a fundamental aspect of the design, in fact it is always multi CPU enabled even on a system that only has one CPU.

During the boot process, the platform specific module defines the number of CPUs available and which one is being used to start the system. The common platform and threading modules then use this information to allocate an appropriate number of each resource to support those CPUs, many items are simply multiplied so on a Raspberry Pi 2, 3 or 4 which has 4 CPUs there are 4 Idle threads, 4 IRQ/FIQ/SWI threads and 4 sets of scheduler queues. Some of this is done to minimize concurrency issues between the CPUs, others are just to simplify the design and reduce the number of points in the architecture that have to be reworked or refined as the number of CPUs grows.

Current support is theoretically for up to 32 CPUs in a single device, however without testing it is unknown what the behavior would be at that point. In practice 4 CPUs are supported on the Raspberry Pi 2, 3 and 4 and each is assigned threads on a round robin basis.

When multiple CPUs are present the scheduler supports migration of threads between CPUs either on demand via an API call or as a result of balancing of the workload by the scheduler itself. Threads can also be assigned affinity to one or more specific CPUs which allows the option for a specific thread or threads to be allocated to a specific CPU for handling real time workloads.

Preemption and thread allocation can be enabled or disabled per CPU and thread migration can be enabled or disabled globally giving maximum flexibility in possible uses for Ultibo core.

Locking and Synchronization


Unlike cooperative multitasking environments where thread switches can be predicted, preemptive multi threading as used in Ultibo core places a critical reliance on synchronization of access to data in order to prevent multiple simultaneous accesses colliding with each other.

The Threads unit provides a full range of locking and synchronization primitives including spin locks, mutexes, critical sections and semaphores as well as more advanced concepts like multiple reader single writer locks, message queues, buffer lists, mailboxes and events. At their core all of the locking and synchronization mechanisms make use of the hardware level features in the ARM processor to support atomic access to memory locations across multiple CPUs. A primary element of the design of all except simple spin locks and mutexes is the use of wait lists so threads consume no processor resources while waiting to acquire a lock or for an event to occur, this greatly enhances scalability of the design since a waiting or sleeping thread does not attract the attention of the scheduler at all until it is ready to run again.

The downside of preemptive scheduling combined with multiple CPUs is that the programmer must make no assumptions about the order in which code will be executed. In practice consider what would happen to the functionality of your code if it could be interrupted at absolutely any point, if the results would be unpredictable (or catastrophic) then you need to include a locking and synchronization strategy into the code. The mindset required to write code that can safely handle preemption and multi processor execution takes time to learn and often only comes by making the mistakes yourself.

Global Configuration


As noted already many of the internal elements of Ultibo core can be dynamically adjusted while running, many others can be changed at startup by setting command line parameters and yet more can be configured in code. The heart of this configurability lies in the GlobalConfig unit which holds variables that define a host of settings used across a wide range of units.

The types of items defined in the global configuration can be things like the default size of a thread stack, or how many worker threads are created at startup and even what color the console window background should be. While the decision to define many of these things as variables rather than use constants with conditional compilation may be considered unusual by some the potential benefits of dynamic configuration and reconfiguration via command line or code are enough to justify the choice so far. It is possible that for pure performance some highly used variables may become constants in future for now there are still plenty of performance gains to be achieved simply by code enhancements.

It is intended that over time a much broader range of global configuration settings will be available as command line parameters, right now focus is on making sure the most important ones are available.

Boot process


The Ultibo core boot process always begins with the Startup function in the boot module for the selected platform. This module is responsible for performing the low level work of setting up an initial stack, zeroing the BSS section, invalidating caches and ensuring the device is in a ready state to continue booting.

The ARM processor starts in secure supervisor mode by default, however on Raspberry Pi 2, 3 or 4 recent versions of the firmware switch to non secure hypervisor mode during boot in order to support virtualization under Linux. The Raspberry Pi 2, 3 and 4 boot modules in Ultibo core switch the CPU back to secure supervisor mode very early in the boot process to allow access to all features.

The basic boot sequence goes like this:

  1. Boot Startup
    1. Save information passed by the boot loader
    2. Invalidate caches to ensure they are clean
    3. Set CPU to an appropriate mode and disable interrupts
    4. Setup interrupt vectors and vector table
    5. Clear BSS section of memory to zero
    6. Setup initial stack to allow function calls
    7. Setup initial heap to allow memory allocation
    8. Calculate the size of the page tables and allow memory for them
  2. Platform specific initialization
    1. Setup IO and peripheral addresses
    2. Setup vector and page table addresses and sizes
    3. Setup machine type
    4. Setup CPU type, count and boot
    5. Setup FPU type
    6. Setup GPU type
    7. Setup interrupt count and configuration
    8. Setup clock configuration
    9. Setup scheduler configuration
    10. Register handlers for platform common functions
  3. Architecture specific initialization
    1. Setup alignment and shift values
    2. Register handlers for platform common functions
  4. Platform common initialization
    1. Initialize internal structures
    2. Register handlers for RTL functions
    3. Initialize the CPU
    4. Initialize the FPU
    5. Register the memory manager
    6. Register initial heap allocations
    7. Initialize the GPU
    8. Initialize the MMU
    9. Initialize multi processor support
    10. Initialize data and instruction caches
    11. Initialize board specific information
    12. Initialize available memory
    13. Initialize interrupts
    14. Initialize the clock
    15. Parse boot tags, command line and environment
    16. Initialize mailboxes, power management and peripherals
    17. Register hardware exception handling
  5. Threads, locks and scheduler initialization
    1. Setup thread stacks and handles
    2. Create thread stacks
    3. Register RTL functions and parameters
    4. Initialize locale information
    5. Initialize unicode tables
    6. Initialize lock tables and create first spin lock
    7. Create first thread
    8. Initialize mailbox, buffer, event and timer tables
    9. Create locks for heap, clock, power and interrupts
    10. Initialize and allocate thread variables
    11. Initialize first thread and standard input/output
    12. Initialize exception handling
    13. Initialize the scheduler
    14. Create IRQ/FIQ/Idle/Main threads
    15. Initialize timers and create timer threads
    16. Initialize workers and create worker threads
    17. Start secondary CPUs
    18. Enable interrupts and start the scheduler

At the end of the initialization process, the Main thread will begin at the function Pascalmain which will perform unit initialization before starting execution at the first line of code in the begin..end section of the program file.

Interrupt handling


Ultibo core supports both IRQ and FIQ interrupts and allows registration of a handler for each available interrupt. Some common interrupts are already registered by default including a timer peripheral suitable for use as the system clock and another timer for the scheduler interrupt.

On the Raspberry Pi 2, 3 and 4 each CPU has a number of local timers available and one of these is used for the scheduler (each CPU gets a scheduler interrupt 2000 times per second), one of the local timers on CPU0 is used for the system clock. The BCM283X system on chip allows the main group of interrupts (not local to each CPU) to be routed to only one CPU, by default CPU0 is allocated to this but it can be changed in the Ultibo core configuration.

Registered interrupt handlers receive a parameter when called, this can be used to pass a reference to a device or an instance of a structure so that the handler can service multiple instances simultaneously without using global variables that may present concurrency issues.

Ultibo core makes use of the IRQ and FIQ modes of the ARM processor for handling interrupts, each mode is assigned a stack during startup and a thread is created to represent that mode. The IRQ and FIQ threads are never scheduled after boot as they are set to priority level NONE but their thread ID and thread variable allocations are used by the IRQ and FIQ modes to allow proper RTL support during interrupts.

Interrupt handlers must obey strict rules about the use of locks and other synchronization primitives in order to avoid deadlocks, interrupt handlers must never sleep, yield or wait and must not call any function which will cause them to do these things. The heap manager has allocations of IRQ and FIQ memory available to allow interrupt handlers to allocate and free memory without causing a deadlock.

The USB and MMC/SD subsystems and the drivers for DWCOTG, BCM2708 and BCM2709 are good sources to understand the use of interrupt handlers and how to avoid deadlock scenarios.

Final thoughts


If you're thinking about a project that could be suited to the Ultibo core environment and would benefit by having complete access to hardware without the restrictions of a full operating system then we encourage you to explore further and take some time to try out what it offers. Don't be put off by concepts like a shared address space or the lack of kernel to user separation, after all bad programming is bad programming and always ends in bad results even on Windows or Linux.

If you are interested in learning more about the internals of how operating systems and devices work but find that most are just too complex to know where to start, feel free to get involved with Ultibo core. Read the wiki, post questions you have on the forum, try out your own code and experiment with your ideas, the best way to learn is to try.

And if you just want to make things then we hope you've found what you need to get started, every thing you see today in the world was once just an idea.

Ultibo, Make the future.