lwn.net

A big.LITTLE scheduler update [LWN.net]

Did you know...?
LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

ARM's big.LITTLE architecture is an example of asymmetric multiprocessing where all CPUs are instruction-set compatible, but where different CPUs have very different performance and energy-efficiency characteristics. In the case of big.LITTLE, the big CPUs are Cortex-A15 CPUs with deep pipelines and numerous functional units, providing maximal performance. In contrast, the LITTLE CPUs are Cortex-A7 with short pipelines and few functional units, which optimizes for energy efficiency. Linaro is working on two methods of supporting big.LITTLE systems.

The first method pairs up each big CPU with a LITTLE CPU, so that user applications are aware of only one CPU corresponding to each such pair. A given application thread can then be transparently migrated between the big and LITTLE CPUs making up the pair, essentially providing extreme dynamic voltage/frequency scaling. Instead of merely varying the voltage and frequency, at some point the application thread is migrated between the big and LITTLE CPUs making up the corresponding pair, as was described in an earlier LWN article by Nicolas Pitre. This approach is termed “big.LITTLE Switcher.”

The other way to support big.LITTLE systems is to have all CPUs, both big and LITTLE, visible in a multiprocessor configuration. This approach offers greater flexibility, but also poses special challenges for the Linux kernel. For example, the scheduler assumes that CPUs are interchangeable, which is anything but the case on big.LITTLE systems. These big.LITTLE systems were therefore the subject of a scheduler minisummit at last February's Linaro Connect in the Bay Area which was reported on at LWN.

This article summarizes progress since then, both on the relevant mailing lists and as reported during the Linaro Connect sessions in Hong Kong, and is divided into the following topics:

Following this are the inevitable Answers to Quick Quizzes.

Emulating asymmetric computation on commodity systems

The need to develop software concurrently with the corresponding hardware has led to heavy reliance on various forms of emulation, and ARM's asymmetric big.LITTLE systems are no exception. Unfortunately, full emulation is quite slow, especially given that core kernel code and user-level code really only needs a reasonable approximation to big.LITTLE's performance characteristics. What we need instead is some way to cause commodity systems to roughly mimic big.LITTLE, for example, by artificially slowing down the CPUs designated as LITTLE CPUs.

There are a number of ways of slowing down CPUs, the first three of which were discussed at the February scheduler minisummit:

Assign a high-priority SCHED_FIFO cycle-stealing thread for each designated LITTLE CPU, which consumes a predetermined fraction of that CPU's bandwidth.
Constrain clock frequencies of the LITTLE CPUs.
Make use of Intel's T-state capability.
Use perf to place a load on the designated-LITTLE CPUs.

Rob Lee presented the results of experiments comparing the cycle-stealing and clock-frequency approaches. Morten Rasmussen proposed the perf-based approach, in which he configured perf to interrupt the designated LITTLE CPUs every few thousand cycles. Each of these approaches has advantages and disadvantages, as laid out in the following table:

Type	Transparent to Scheduler?	Portability?	Calibration?	Scope?
Cycle stealing	`SCHED_FIFO` threads visible	Portable	Requires calibration for duty cycles less than about 10 milliseconds	Process-level only
Clock Frequency	Transparent	Requires per-CPU clock domains	Auto-calibrating, but limited number of settings	All execution
T-State	Transparent	Intel only	Auto-calibrating	All execution
perf	Transparent	Requires fine-grained perf	Requires calibration	All code subject to perf exceptions

Quick Quiz 1: How can you tell whether multiple CPUs on your system are in the same clock domain? Answer

As can be seen from the table, none of these mechanisms is perfect, for example, many embedded systems-on-a-chip (SoCs) have multiple CPUs (often all of them) in a given clock domain, which limits the applicability of the clock-frequency approach. Rob has placed scripts for the cycle-stealing and clock-frequency approaches in a git tree located at git://git.linaro.org/people/rob_lee/asymm_cpu_emu.git; he plans to add Morten's perf-based approach as well.

Quick Quiz 2: Given that these emulation approaches are never going to provide a perfect emulation of big.LITTLE systems, why bother? Wouldn't it be simpler to just wait until real hardware is available? Answer

At this point, it seems likely that more than one of these will be required in order to cover the full range of development hardware and workloads.

Documenting system configuration for big.LITTLE work

The Linux kernel as it is today can handle big.LITTLE systems, give or take hardware bring-up issues. However, the current kernel does need some help to get good results from a big.LITTLE system. Chris Redpath's presentation is a first step towards determining the best division of labor between current kernels and the application.

Chris described an experiment running an Android workload on emulated big.LITTLE hardware. He made use of Android's bg_non_interactive cpuset, which holds low-priority threads (ANDROID_PRIORITY_BACKGROUND) and is limited to 10% of CPU usage. Chris further constrained this cpuset to run only on CPU 0, which on his system is a LITTLE CPU. Chris then created two new cpusets, default and fg_boost. The default cpuset is constrained to the LITTLE CPUs, and contains all non-background tasks that are not SCHED_BATCH. The SCHED_BATCH tasks remain in the bg_non_interactive cpuset called out above. The new fg_boost cpuset contains tasks that are SCHED_NORMAL and that have a priority higher than ANDROID_PRIORITY_NORMAL.

Chris used a combination of sysbench and cyclictest as a rough-and-ready mobile-like workload, where sysbench mimics high-CPU-consumption mobile tasks (e.g., rendering a complex web page) and cyclictest mimics interactive workloads (e.g., user input and communications). Chris's configuration uses four sysbench compute threads and eight cyclictest threads. The sysbench threads all run for 15 seconds and then block for 10 seconds, and repeat this on a 25-second cycle. The eight cyclictest threads run throughout the full benchmark run. Without Chris's additional cgroups, the scheduler scattered the threads across the system, slowing execution of the sysbench threads and wasting power. With the additional cgroups, the sysbench threads were confined to the big CPUs, and the cyclictest threads to a single LITTLE CPU.

In short, use of cpusets to constrain whether given threads run on big or LITTLE CPUs works quite well, at least as long as we know a priori which threads should run where.

Chris also tested sched_mc, which can be set up to keep short-lived tasks off of the big CPUs. Although sched_mc was able to keep an additional CPU free of work, it was unable to help the remaining CPUs to reach deeper sleep states, and was nowhere near as effective as use of cpusets. These big.LITTLE results support the decision taken at the February Linaro Connect to remove sched_mc from the kernel.

Finally, Chris tested userspace tools that dynamically migrate tasks among the CPUs in response to Android's priority boosting of foreground user-interface (UI) threads. In contrast, Chris's first approach statically assigned the threads. This approach required minor changes to the Android software stack. Although this approach was effective in getting high-consumption UI threads running on big CPUs, the migration latency rivaled the bursts of UI CPU-bound activity, which in many cases defeats the purpose. The migration latency would need to be decreased by about a factor of two for this approach to be worthwhile, though your mileage may vary on real hardware. These results underscore the importance of planned scheduler changes for dynamic general-purpose workloads.

Several other potential approaches were discussed:

Use a modified cpufreq governor to switch between the big and LITTLE CPUs. One potential issue with this approach is that the governor currently runs at full frequency most of the time, in accordance with its race-to-idle strategy. Some tweaking would therefore be required for workloads for which race-to-idle is inappropriate.
Treat specific applications specially, for example, make big.LITTLE scheduling decisions based on the ->comm name. This is likely to work well in some cases, but is unable to optimally handle applications that have different threads that want to run on different CPUs.
Rather than migrating threads from one cpuset to another, add and remove CPUs to given cpusets. This should reduce the sysfs overhead in some cases.

In short, it is possible to get decent performance out of big.LITTLE systems on current Linux kernels for at least some workloads. As more capabilities are added to the scheduler, we can expect more workloads will run well on big.LITTLE systems, and also that it will become easier to tune workloads that can already be made to run well on big.LITTLE.

Define mobile workload and create simulator

An easy-to-use mobile-workload simulator is needed to allow people without access to the relevant hardware, emulators, and SDKs to evaluate the effects of changes on mobile workloads. The good news is that the work presented at Linaro Connect used a number of synthetic workloads, but the not-so-good news is that the setups are not yet generally useful. Shortcomings include: (1) Much work is required to interpret the output of the workloads, (2) The figures of merit are not constant, but instead different figures of merit are chosen in different circumstances, and of course (3) They are not (yet) nicely packaged. Hopefully we will see additional progress going forward.

Address kthread performance issues

At the February Linaro Connect, Vincent Guittot presented work showing that the bulk of CPU-hotplug latency was due to creation, teardown, and migration of the per-CPU kthreads that carry out housekeeping tasks for various kernel subsystems. Further discussion indicated that any mechanism that successfully clears all current and future work from a given CPU will likely incur similar latencies.

In an ideal world, these per-CPU kthreads would automatically quiesce when the corresponding CPU went offline, and then automatically spring back into action when the CPU came back online, but without the overheads of kthread creation, teardown, or migration. Unfortunately, the scheduler does not take kindly to runnable tasks whose affinity masks only allow them to run on CPUs that are currently offline. However, Thomas Gleixner noticed that there is one exception to this rule, namely a set of per-CPU tasks that have exactly this property: the idle tasks. This key insight led Thomas to propose that the per-CPU kthreads should have this same property, which should completely eliminate the overhead of per-kthread creation, teardown, and migration.

The idle tasks are currently created by each individual architecture, so Thomas posted a patchset that moves idle-task creation to common code. This patchset has been well received thus far, and went into mainline during the 3.5 merge window. Thomas has since followed up with a new patch that allows kthreads to be parked and unparked. The new kthread_create_on_cpu(), kthread_should_park(), kthread_park(), and kthread_unpark() APIs can be applied to the per-CPU kthreads that are now created and destroyed on each CPU-hotplug cycle.

Quick Quiz 3: This sounds way simpler than CPU hotplug, so why not just get rid of CPU hotplug and replace it with this mechanism? Answer

At the Q2 Linaro Connect in Hong Kong, Nicolas Pitre suggested a different way to leverage the properties of idle tasks, namely to provide a set of high-priority per-CPU idle tasks that would normally be blocked. When it was time to quiesce a given CPU, its high-priority idle task would become runnable, preempting all of the rest of that CPU's tasks. The high-priority idle task could then power down the CPU. This approach can be thought of as a variant of idle-cycle injection. There are some limitations to this approach, but it might work quite well in situations where the CPU is to be quiesced for short periods of time.

In the meantime, Vincent Guittot has been experimenting with various userspace approaches to reduce the per-kthread overhead. One promising approach is to use cgroups to ensure that kthreadd (the parent of all other kernel threads) is guaranteed the CPU bandwidth needed during the hotplug operation. Preliminary results indicate large improvements in latency. Although this approach cannot achieve a five-millisecond CPU-hotplug operation (a target adopted at the February meeting), it is nevertheless likely to be useful for projects that are required to use current kernels with the existing slow CPU-hotplug code.

Wean CPU hotplug from stopping all CPUs

Another source of CPU-hotplug slowness—and of real-time latency degradation—is its use of __stop_machine(). The problem is that __stop_machine() function causes each CPU to run a special per-CPU __stop_machine() kthread, which brings all processing on the system to a grinding halt for the duration. This system-wide grinding halt is bad for performance and fatal to real-time response. Given that other systems have managed to remove CPUs from service without requiring such a grinding halt, it should be possible to wean Linux's CPU-hotplug process from its __stop_machine() habit.

Quick Quiz 4: Why bother to fix CPU hotplug? Wouldn't it be better to just rip it out and replace it with something better? Answer

This is a large change, and will take considerable time and effort. However, it is likely to provide good improvements in both performance and robustness of CPU hotplug.

Add minimal support to scheduler for asymmetric cores

There has been great progress in a number of areas. First, Paul Turner posted a new version of his entity load-tracking patchset. This patchset should allow the scheduler to make better (and more power-aware) task-placement and load-balancing decisions. Second, Morten Rasmussen ran some experiments (including experimental patches) on top of Paul Turner's patchset. See below for more information. Third, Peter Zijlstra posted a pair of patches removing sched_mc and also posted an RFC email proposing increased scheduler awareness of hardware topology. This should allow the scheduler to better handle asymmetric architectures such as ARM's big.LITTLE architecture. Finally, Juri Lelli posted an updated version of a prototype SCHED_DEADLINE patchset, which Juri has taken over from Dario Faggioli. Please see below for a discussion of how SCHED_DEADLINE might be used for energy efficiency on big.LITTLE systems.

Morten Rasmussen presented prototype enhancements to the Linux scheduler that better accommodate big.LITTLE systems. As noted above, these enhancements are based on Paul Turner's entity load-tracking patchset. The key point behind Morten's enhancements is that work should be distributed unevenly on big.LITTLE systems in order to maximize power efficiency with minimal performance degradation. Unlike SMP systems, on big.LITTLE systems it matters deeply where a given task is run, not just when a task is run. Morten therefore created a pair of cgroups, one for the big CPUs and another for the LITTLE CPUs, but with no load balancing between them. The select_task_rq_fair() function checks task load history in order to make the correct big.LITTLE selection when a given task starts running. High-priority tasks that have tended to be CPU bound are placed on big CPUs, and all other tasks are placed on LITTLE CPUs.

Of course, a high-priority task that has historically been I/O bound might enter a CPU-bound phase. Therefore, Morten's patchset also periodically migrates tasks via load balancing. There are of course issues with global load balancing, and addressing these issues is important future work. Nevertheless, when running the Bbench browser benchmark on an emulated big.LITTLE Android system, Morten's patches provided response time very nearly that of an SMP system with all big CPUs, both when running on SMP hardware emulating big.LITTLE and when running on Paul Turner's LinSched. This is clearly an impressive achievement. In contrast, big.LITTLE response time on a vanilla kernel is 10-20% slower with much worse variability in response times.

Obviously, considerably more performance evaluation is required, but preliminary results are quite promising. In addition, more work will be required to handle thermal issues and also to do load balancing of tasks back onto LITTLE CPUs when all the big CPUs are saturated.

Juri Lelli's SCHED_DEADLINE is quite important for some types of real-time computing, but some thought is required to see possible application to non-real-time big.LITTLE systems. The key thing about SCHED_DEADLINE is that it provides the scheduler with information about the likely future needs of the various applications running on the system. This information should allow the scheduler to make better big.LITTLE task placement decisions, and also potentially better frequency-selection decisions.

Of course, this begs the question of where the deadlines and CPU requirements come from. In fact, one reason that deadline schedulers remain the exception rather than the rule even for real-time applications is that it is difficult to generate the needed information for a given real-time application, especially given the non-deterministic nature of modern hardware and variations in rendering time for different media—or even for different segments of a given item being played back. One way to solve this problem for some mobile workloads might be to use feedback-directed profiling techniques, where the application's CPU requirements are measured rather than derived.

Regardless of how things turn out for the use of deadline scheduling for mobile workloads, there is clearly some important and innovative work ahead for power-aware scheduling of mobile workloads.

Finally, Vincent Guittot led a discussion on additional inputs to the scheduler, which centered mainly around accurate load tracking (e.g., Paul Turner's patchset), which CPUs share a clock (and thus must all run at the same frequency), which CPUs share idle state, which CPUs share each level of cache, and thermal constraints. There was also a call to keep all this simple, which is clearly extremely important—and also will likely be extremely difficult. But if keeping things simple was easy, where would be the challenge?

Conclusions

Although there is still a great deal of work remaining before mobile workloads fully exploit the power and power efficiency of big.LITTLE, the past few months have seen some impressive progress in that direction. With appropriate user-mode help, some important workloads can make good use of big.LITTLE systems even given the existing scheduler, and some modest modifications (in conjunction with Paul Turner's entity load-tracking patchset) greatly improves the scheduler's ability to support big.LITTLE with minimal user-mode help. A promising design for streamlining CPU hotplug has been identified, and work in that direction has started.

This work gives some hope that the future will be bright for asymmetric multiprocessing in general and ARM's big.LITTLE architecture in particular.

Acknowledgments

We all owe a debt of gratitude to Srivatsa Bhat, Juri Lelli, Daniel Lezcano, Nicolas Pitre, Christian Daudt, Roger Teague, and George Grey for helping to make this article human readable. I owe thanks to Dave Rusling and Jim Wasko for their support of this effort.

Answers to quick quizzes

Quick Quiz 1: How can you tell whether multiple CPUs on your system are in the same clock domain?

Answer: Look at the contents of this sysfs file:

    /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus

This file lists all of the CPUs that are in the same clock domain as CPU 0. Similar files are present for the rest of the CPUs.

Back to Quick Quiz 1.

Answer: Indeed, none of these emulation methods will perfectly emulate big.LITTLE systems. Then again, the various big.LITTLE implementations will not be exactly the same as each other anyway, so significant emulation error is quite acceptable. More importantly, these emulation approaches allow big.LITTLE software work to proceed long before real big.LITTLE hardware is available.

Back to Quick Quiz 2.

Quick Quiz 3: This sounds way simpler than CPU hotplug, so why not just get rid of CPU hotplug and replace it with this mechanism?

Answer: First, CPU hotplug is still needed to isolate failing hardware, which some expect to become more prevalent as transistors continue to shrink. Second, there are a number of limitations of Nicolas's approach compared to CPU hotplug:

Interrupts must be directed away from the CPU.
Timers must be migrated off of the CPU.
If a CPU is left in this state for more than two minutes, the softlockup subsystem will start reporting errors if there are runnable tasks affinitied to this CPU. One workaround is to migrate tasks away from this CPU, but that approach starts looking a lot like CPU hotplug.
If one of the preempted threads is hard-affinitied to the CPU and is holding a mutex, then that mutex cannot be acquired on any other CPU. If the CPU remains in this state very long, a system hang could result.
The per-CPU kthreads would need to be informed of this transition if it persists for very long.

In other words, as noted above, Nicolas's approach is appropriate only for cases where the CPU is to be quiesced for a short time.

Back to Quick Quiz 3.

Quick Quiz 4: Why bother to fix CPU hotplug? Wouldn't it be better to just rip it out and replace it with something better?

Answer: Although there is always a strong urge to rip and replace, the fact is that any mechanism that successfully removes all current and future work from a CPU will really need to remove all current and future work from that CPU. And it is the removal of all such work that makes CPU hotplug difficult, not the actual powering down of the electronics. Of course, this suggests the possibility of removing all work without actually powering the CPU down, a possibility that is likely to be supported by Thomas's ongoing CPU-hotplug rework.

Back to Quick Quiz 4.

Index entries for this article
Kernel	Architectures/Arm
Kernel	big.LITTLE
Kernel	Scheduler/big.LITTLE
GuestArticles	McKenney, Paul E.