close
close

topicnews · September 24, 2024

Linux 6.12: Scheduler now extendable and EEVDF conversion complete

Linux 6.12: Scheduler now extendable and EEVDF conversion complete

Linux 6.12, expected on November 18 or 25, brings three major changes to the code that controls when and for how long processes use the processor. The most hotly anticipated is the “Extensible Scheduler Class” known as “Sched_Ext,” which allows the process scheduler to delegate many decisions about allocating processor time to BPF programs. Experts can write these themselves and load them into the kernel to adjust the time distribution to their needs without having to change Linux source code.

Advertisement


The kernel developers have also completed the redesign of the time distribution algorithm that began in Linux 6.6, with the process scheduler now using the “Earliest Eligible Virtual Deadline First” (EEVDF) method. This fine-tuning makes it possible to reduce the latency of mostly short-running applications.

The third innovation is the “SCHED_DEADLINE Server Infrastructure”: In systems with real-time processes, they are designed to better ensure that low-priority applications continue to have sufficient time. This function is the last major change by a key and respected developer who died in June at the age of 37.

The Extensible Scheduler Class was driven forward by key developers at Meta: The company itself already uses Sched_Ext to better adapt the allocation of processor time to the needs of its huge data centers. To do this, developers write BPF programs for the workloads of the various server classes; these are then loaded into the kernel at runtime, which executes them in the kernel context using the BPF virtual machine. Such BPF programs run less shielded than regular application programs, but are subject to some security restrictions – ultimately, however, they can interact with the kernel much more quickly and directly access the data it processes. BPF programs are already used in a variety of ways in the Linux environment, for example in Systemd security mechanisms, for performance analysis or for high-performance control of network data streams.

Some developers and companies are already working on Sched_Ext programs to optimize the allocation of processor time for larger user groups – for example, for gamers to avoid stuttering in resource-intensive games. It is to be expected that some distributions will include such BPF programs in the future and temporarily activate them when games are started. They must disclose the source code, because Sched_Ext programs, like the kernel code, must be subject to the GPLv2 or a compatible license.

There will probably soon be all sorts of Sched_Ext programs circulating. As is usual with such interfaces for extensions, some of them are likely to circumvent problems and functional gaps in the process scheduler that might be better addressed in the scheduler’s C code. In the worst case, this could cause problems for users – for example, if they want to use a Sched_Ext extension for gaming that may not work at all or only poorly with another one that optimizes the performance of processors with different speeds of CPU cores.

Because of these and many other issues, several developers of Linux’s processor scheduler have spoken out against the inclusion of Sched_Ext – often very strongly. The other side has argued, among other things, that Sched_Ext allows for easier experimentation with scheduler techniques and might thus encourage further development of the regular new scheduler. Linus Torvalds’ stance was uncertain for a while. He was an advocate of the “the kernel should have only one process scheduler that covers all areas of use” approach for many years – so the alternative “Brain Fuck Scheduler” (BFS) by developers like Con Kolivas and other schedulers remained for years, despite their extreme love in certain circles.

However, around a year ago, at the British Kernel Maintainer Summit, Torvalds clearly spoke out in favor of including Sched_Ext. But that didn’t happen. A few months ago, he hinted that he would ignore the developers of the regular process scheduler and integrate Sched_Ext into Linux 6.11 despite their criticism if no agreement was in sight. Both sides then tweaked a few details again so that everyone could at least come to terms with the whole thing a little better. As a result, the end date was 6.12 instead of 6.11.

Some background on the whole controversy and on Sched_Ext in general is explained in the LWN.net article “The extensible Scheduler Class” and Another push for sched_ext. Further insights are provided in the cover letter to the Sched_Ext patches, the description in the merge commit and the Sched_Ext technical documentation.

Coincidentally, at the same time, developers of the regular scheduler completed and refined the conversion to computing time distribution that began in Linux 6.6 using the “Earliest Eligible Virtual Deadline First” (EEVDF) method. Among other things, this brings advantages for applications that are supposed to react quickly but usually only run for a short time.

The kernel can now take over such processes centrally – and, if necessary, also take the CPU away from applications, even if they have not yet exhausted their currently used time slice. Up to now, the kernel only did this if it wanted to take processes with real-time priority. However, the kernel does not choose this path on its own, but only for processes that explicitly use shorter time slices via sched_setattr() and sched_attr::sched_runtime request. They are ultimately executed, which reduces latency – but also makes them shorter, because in the end they get the same amount of CPU time as other processes with the same priority to avoid unfairness.

The documentation for this implementing change explains the whole thing in detail using clever ASCII art; further details on the changes and the time distribution with shorter time slices are provided by LWN.net in the text “Completing the EEVDF Scheduler”. The merge commentary of the major changes to the scheduler also roughly outlines these and other improvements to EEVDF.

The merge comment also mentions the third major change: the SCHED_DEADLINE Server Infrastructure. It is intended for systems that run real-time applications and regulate their time allocation using the deadline scheduling class of the regular scheduler. In the past, real-time applications could largely monopolize the processor, meaning that processes with regular priorities were not given enough time. An approach called “real-time throttling” was supposed to prevent this from happening, but it often worked rather poorly. The new infrastructure takes a different approach and, in the standard configuration, ensures that regular processes receive at least five percent of the processor time. The LWN.net article “Deadline server as a real-time throttling replacement” provides further insights into the process.

The driving force behind this new infrastructure was Daniel Bristot de Oliveira. It is his last significant contribution to Linux, as he died in June at the age of 37. Bristot was a highly committed and respected developer in the Linux real-time area for many years.



Daniel Bristot de Oliveira

Portrait photo of the late Daniel Bristot de Oliveira.

(Image: bristot.me / Daniel Bristot de Oliveira)

Several dozen developers remembered him last week at a conference in a “Celebration of Life”. Less than two hours later and a few meters away, Linus Torvalds received the pull request that, after 20 years of painstaking work, now gives Linux real-time capabilities out of the box – a flaw that is also largely thanks to Bristot.


(dmk)