Difference between revisions of "Architecture"
| Line 96: | Line 96: | ||
| Unlike cooperative multitasking environments where thread switches can be predicted, preemptive multi threading as used in Ultibo core places a critical reliance on synchronization of access to data in order to prevent multiple simultaneous accesses colliding with each other. | Unlike cooperative multitasking environments where thread switches can be predicted, preemptive multi threading as used in Ultibo core places a critical reliance on synchronization of access to data in order to prevent multiple simultaneous accesses colliding with each other. | ||
| + | The [[Unit_Threads|Threads]] unit provides a full range of locking and synchronization primitives including spin locks, mutexes, critical sections and semaphores as well as more advanced concepts like multiple reader single writer locks, message queues, buffer lists, mailboxes and events. At their core all of the locking and synchronization mechanisms make use of the hardware level features in the ARM processor to support atomic access to memory locations across multiple CPUs. A primary element of the design of all except simple spin locks and mutexes is the use of wait lists so threads consume no processor resources while waiting to acquire a lock or for an event to occur, this greatly enhances scalability of the design since a waiting or sleeping thread does not attract the attention of the scheduler at all until it is ready to run again. | ||
| + | |||
| + | The downside of preemptive scheduling combined with multiple CPUs is that the programmer must make no assumptions about the order in which code will be executed. In practice consider what would happen to the functionality of your code if it could be interrupted at absolutely any point, if the results would be unpredictable (or catastrophic) then you need to include a locking and synchronization strategy into the code. The mindset required to write code that can safely handle preemption and multi processor execution takes time to learn and often only comes by making the mistakes yourself. | ||
| === Global Configuration === | === Global Configuration === | ||
Revision as of 05:30, 29 January 2016
Contents
Introduction
Ultibo core is, by design, intended to be a development environment for embedded devices which currently includes all models of the Raspberry Pi but this will most likely expand to include other popular development boards as time and resources permit. Because it is not an operating system and is not derived from an existing operating system, the design decisions made reflect the needs of the environment and intentionally leave out certain general purpose operating system concepts.
The unit of execution in Ultibo core is the thread, there is no concept of a process but multi threading is fundamental to the design and cannot be removed. There is also no concept of a user space or kernel space, all threads run within a single address space and all threads have equal privileges.
All elements are intended to be dynamic in their implementation, for example devices are intended to be added and removed dynamically with notifications used to signal interested parties about the addition or removal. Only a few internal items have a fixed limit, a specific example being TLS index slots which are statically allocated during boot.
The design is also intended to be modular so that only the required elements are included in any given application, a small number of mandatory elements are included in every application such as memory management, threading and scheduling, interrupt handling, device management and the core functionality of the FPC run time library. Most other features such as the filesystem and network interfaces, USB, MMC/SD and hardware specific drivers can be omitted if they are not required.
Platform support
With an intention to support a range of embedded devices and development boards over time, the architecture of Ultibo core tries to separate platform specific functionality from common functionality. Each supported board has both a boot module and a platform module which are specific to that board and provide the necessary initialization and platform functionality to support the common modules without requiring conditional compilation.
For the Raspberry Pi (Models A, B, A+, B+ and Zero*) the board specific modules are:
- BootRPi.pas
- PlatformRPi.pas
* Each of these models use the BCM2835 System on Chip which means they are identical from a code perspective, differences can be detected by model or revision numbers.
For the Raspberry Pi 2B the board specific modules are:
- BootRPi2.pas
- PlatformRPi2.pas
In addition there are architecture specific modules for the ARM processor and for the ARM processor type, these include:
- PlatformARM.pas (for functionality common to all ARM processors)
- PlatformARMv6.pas (this includes the ARM1176JZF-S used in the Raspberry Pi A, B, A+, B+ and Zero)
- PlatformARMv7.pas (this includes the ARM Cortex-A7 MPCore used in the Raspberry Pi 2B
The appropriate boot module is included in an application based on the Controller Type selected and passed to the compiler (-Wp), the boot module then includes the relevant platform and architecture modules as well as the mandatory common modules for platform, memory, threads and devices etc.
Memory Management
The HeapManager unit is a mandatory inclusion in all applications, during boot it registers itself with the RTL so that memory allocation via GetMem, FreeMem etc function normally. The platform specific module for the board then registers with the HeapManager all of the available blocks of memory that can be allocated, how this occurs is specific to the board but on the Raspberry Pi it is passed by the boot loader in a tag (or ATAG) parameter.
All memory allocation must go via the HeapManager either by calling the standard RTL functions or by calling the additional functions it provides, depending on the platform, the command line and information passed by the boot loader various types of memory are registered and made available for allocation. These include Normal, Shared, Non Cached, Code and Device memory which are required by certain Ultibo core components or can be requested by calling HeapManager functions.
Memory Mapping
Even though all threads share a single address space and all memory is mapped one to one between physical and virtual that doesn't mean there are no access controls in Ultibo core. During startup the platform module sets up a default page table before enabling the memory management unit (MMU) and caching.
The default page table specifies that code memory (defined by the linker as the TEXT section) is executable but read only, whereas data memory (the DATA and BSS sections) are not executable and are read write. All other normal memory available for allocation is defined as read write and not executable so when a stack is allocated for a thread it will not be executable in order to avoid buffer overflow issues executing code on the stack.
A number of other sections of the memory map are defined with specific settings as well, for example the vector table is executable and read only and page tables themselves are read write not executable. Of particular importance is the page located at address zero (0x00000000) which is defined with no access at all* so as to provide a "guard page" functionality, using this a read or write to a null pointer or a call to an unassigned procedure variable will result in a hardware exception that can be caught by exception handling in an application to avoid what would otherwise be a crash.
Based on the board type and other parameters, sections of the memory map are also defined as shared, device or per CPU and there is even an allowance for allocating executable memory to keep open the possibility of loadable modules in future.
The page table layout can be viewed using the Page Tables link in the WebStatus unit which is also included in the demo image.
*Due to ongoing refinement the zero page is currently marked as read only instead of no access, a read of a null pointer will not currently produce an exception.
Threads and Scheduling
Even the simplest "Hello World" example is multi threaded in Ultibo core and threading is an integral part of the design. When Ultibo core starts it creates a number of threads with special purposes such as the Idle thread which tracks CPU utilization but also provides an always ready thread for the scheduler to select. There are also IRQ, FIQ and SWI threads which are used to handle interrupts so that RTL elements like threadvars and exceptions also work during interrupt handling.
Timer and worker threads are used by many common components to perform callbacks and notification or response to events, timers and workers are actually just normal threads but their sole purpose is to wait for registered timer and worker events to be triggered and process them as quickly as possible.
The other thread that is always created is the Main thread which is the thread that starts the application code, between begin and end in the program file, running. The main thread is also just a normal thread and as you can see in the examples can be halted without affecting the operation of the rest of the program.
There are no specific limits on the number of threads that can be created and the only limit is available memory. During testing we routinely use between 50 and 60 threads and have tested up to several hundred without issues.
Ultibo core allows multiple priority levels for threads which start at IDLE and go up to CRITICAL, the Idle thread itself uses IDLE priority so it will only run when there is nothing else ready but you can also set other threads to IDLE as well for background processing. The CRITICAL priority is used by subsystems like USB which must respond immediately to the hardware and should be used very carefully to avoid starving other threads of CPU time. Most threads run at the NORMAL priority level but any thread can change priority dynamically by calling the appropriate API.
The scheduler in Ultibo core is preemptive and is driven by an interrupt (2000 times per second on the Raspberry Pi) which checks if the current thread has quantum (or time slice) remaining and if there are other higher priority threads waiting. Each thread receives a base quantum of time plus a priority quantum determined by the thread priority level, these can be adjusted dynamically by changing the SCHEDULER_THREAD_QUANTUM and SCHEDULER_PRIORITY_QUANTUM array values in the GlobalConfig unit. Threads can also yield their remaining quantum at any time if they have no further work to do and the use of wait states and sleep or yield is strongly encouraged in favor of endless loops.
The implementation of the scheduler is based on a set of queues, one for each priority level, and a bitmap of threads ready to run at each level. The scheduling algorithm is not revolutionary at all but provides a stable and reliable way to manage and coordinate a varying number of threads without adding the complexities of a process model to the architecture.
As with many of the internal elements of Ultibo core, the scheduler can actually be replaced by assigning new functions to a number of procedure variables declared in the Threads unit. While there is a small overhead in checking for assignment of variables during each call this is more than offset by the reduced overhead of the overall environment.
Multiple CPUs
Given the commodity nature of multi core or multi CPU systems available today it would seem almost pointless to design a new environment without support for this technology. Ultibo core supports multiple CPUs as a fundamental aspect of the design, in fact it is always multi CPU enabled even on a system that only has one CPU.
During the boot process, the platform specific module defines the number of CPUs available and which one is being used to start the system. The common platform and threading modules then use this information to allocate an appropriate number of each resource to support those CPUs, many items are simply multiplied so on a Raspberry Pi 2 which has 4 CPUs there are 4 Idle threads, 4 IRQ/FIQ/SWI threads and 4 sets of scheduler queues. Some of this is done to minimize concurrency issues between the CPUs, others are just to simplify the design and reduce the number of points in the architecture that have to be reworked or refined as the number of CPUs grows.
Current support is theoretically for up to 32 CPUs in a single device, however without testing it is unknown what the behavior would be at that point. In practice 4 CPUs are supported on the Raspberry Pi 2 and each is assigned threads on a round robin basis.
When multiple CPUs are present the scheduler supports migration of threads between CPUs either on demand via an API call or as a result of balancing of the workload by the scheduler itself. Threads can also be assigned affinity to one or more specific CPUs which allows the option for a specific thread or threads to be allocated to a specific CPU for handling real time workloads.
Preemption and thread allocation can be enabled or disabled per CPU and thread migration can be enabled or disabled globally given maximum flexibility in possible uses for Ultibo core.
Locking and Synchronization
Unlike cooperative multitasking environments where thread switches can be predicted, preemptive multi threading as used in Ultibo core places a critical reliance on synchronization of access to data in order to prevent multiple simultaneous accesses colliding with each other.
The Threads unit provides a full range of locking and synchronization primitives including spin locks, mutexes, critical sections and semaphores as well as more advanced concepts like multiple reader single writer locks, message queues, buffer lists, mailboxes and events. At their core all of the locking and synchronization mechanisms make use of the hardware level features in the ARM processor to support atomic access to memory locations across multiple CPUs. A primary element of the design of all except simple spin locks and mutexes is the use of wait lists so threads consume no processor resources while waiting to acquire a lock or for an event to occur, this greatly enhances scalability of the design since a waiting or sleeping thread does not attract the attention of the scheduler at all until it is ready to run again.
The downside of preemptive scheduling combined with multiple CPUs is that the programmer must make no assumptions about the order in which code will be executed. In practice consider what would happen to the functionality of your code if it could be interrupted at absolutely any point, if the results would be unpredictable (or catastrophic) then you need to include a locking and synchronization strategy into the code. The mindset required to write code that can safely handle preemption and multi processor execution takes time to learn and often only comes by making the mistakes yourself.
Global Configuration
Boot process
Initialization process

