Modern Operating Systems - Processes, Threads, and their Scheduling

Processes#

new: The process has just been created but has not finished initialization.
ready: The process can be scheduled for execution.
running: The process is currently executing on the CPU.
blocked: The process is blocked and needs to wait for an external event to complete.
terminated: The process has finished execution and will not be scheduled again.

Process Address Space
Kernel code and data
Kernel stack
User stack
⬇️
Shared libraries, read-only (e.g., libc)
⬆️
User heap
Data segment (global variables)
Code segment

Linux uses fork to create a new child process from the current process: for the parent process, fork returns the PID of the child process; for the child process, fork returns 0. When the child process is created, it copies the data from the parent process's Process Control Block (PCB), such as the open file descriptor table. However, the file data structures pointed to by the file descriptor table in both processes' FD tables are the same (only the pointers are copied), resulting in both processes sharing the same open file pointers. This can lead to inconsistent data when reading files due to shared offsets.

By using fork, a child process that is almost identical to the parent process can be obtained. In Linux, a new program can be executed in the current process using the exec series of interfaces. When execve is used, the operating system performs the following operations:

Load the data and code segments into the address space based on the specified executable file path.
Reinitialize the heap and stack and perform address space randomization to change the starting address of the heap and stack.
Set the PC register to the entry point of the executable file's code segment.

The first process in Linux is the init process, and all processes are created through fork. Each process keeps track of its parent and child processes, forming a process tree. The parent process can monitor the exit of the child process and check its status, such as whether it exited abnormally. If the parent process has not executed or has not had time to execute the wait call, and the child process has already exited, the child process's resources will not be fully reclaimed. In Linux, this type of process is called a zombie process.

The Linux kernel defines the concept of a process group, which is a collection of multiple processes. An application process can send a signal to a process group using killpg, and this signal will be sent to all processes in the group. The collection of multiple process groups forms a session. Sessions divide process groups into foreground process groups and background process groups. Foreground process groups are generally used to receive input from the controlling terminal, such as standard input or signals. Sessions and process groups are mainly used for process management in the Shell environment.

Threads#

A thread is an independent unit that can be executed within a process. All threads within the same process share the same address space but have different execution contexts, such as register states and stacks. The distribution of address space in a multi-threaded application is as follows:

Process Address Space with Multiple Threads
Kernel code and data
Kernel stack 1
Kernel stack 2
Kernel stack 3
...
User stack 1
User stack 2
User stack 3
...
Shared libraries, read-only (e.g., libc)
⬆️
User heap
Data segment (global variables)
Code segment

Threads in the operating system are divided into user-level threads and kernel-level threads. If a user-level thread needs to make a system call, it must enter the kernel-level thread to execute, so each user-level thread needs to be allocated a corresponding kernel-level thread. There are multiple ways to allocate threads:

Many-to-One: Multiple user threads correspond to one kernel thread. The disadvantage is that only one user thread can enter the kernel at a time.
One-to-One: Each user thread corresponds to one kernel thread. The disadvantage is that the number of kernel threads will increase as the number of user processes increases.
Many-to-Many: N user threads correspond to M kernel threads (N > M). The kernel needs to manage and schedule the mapping relationship.

Each thread has a Thread Control Block (TCB) data structure to save the thread's context information.

User-level threads can create special Thread Local Storage (TLS) variables. Multiple threads can use the same TLS variable, but each thread only reads and writes its own copy of the data. For example, in the following pseudocode:

__thread int count; // Declare a TLS variable

thread1.set(&count, 1); // Thread 1 writes to count
thread2.set(&count, 2); // Thread 2 writes to count
thread1.get(&count) == 1; // Thread 1 reads the value of count as 1
thread2.get(&count) == 2; // Thread 2 reads the value of count as 2

In the x86-64 architecture, when a thread is scheduled, libpthread writes the starting address of its TLS to the FS segment register. When a thread accesses a TLS variable, it uses the address in the FS register plus the variable's offset to access the data. Therefore, when multiple threads read and write their own TLS variables, they are using different addresses.

In Linux, thread creation, termination, and other operations are implemented through the libpthread interface:

pthread_create: Creates a new thread in the current process to run the specified function. It is implemented through the clone system call.
pthread_exit: Terminates the current thread. It is implicitly called when the function execution finishes, and the return value can be set by explicitly calling it.
pthread_yield: Voluntarily suspends the current thread, yielding resources and waiting for the next scheduling. The thread remains in the ready state.
pthread_join: Waits for a specified thread to finish and retrieves its return value.
pthread_cond_wait: Waits for a specified condition variable, used for thread synchronization.

Scheduling Concepts#

Scheduling Metrics#

Throughput: The number of tasks processed per unit of time.
Turnaround time: The time from the start of execution to the completion of a task.
Response time: The time it takes for an interactive task to receive a response.
Real-time performance: The completion degree of real-time tasks (some tasks need to be completed before a specified deadline).
Power consumption
Resource utilization
Fairness
Overhead of the scheduler itself

Long-Term, Medium-Term, and Short-Term Scheduling#

Long-Term Scheduling#

Long-term scheduling mainly controls the total number of tasks that can be scheduled in the system. For example, when a large number of processes are created in a short period of time, the long-term scheduler does not immediately move all processes to the ready state (except for interactive tasks with high real-time requirements). Otherwise, it would increase the number of tasks and the scheduling overhead. Additionally, the long-term scheduler balances the schedulable processes based on the current CPU and I/O usage to avoid having a large number of I/O tasks or CPU tasks simultaneously, which would cause intense resource competition and low utilization. In general, the long-term scheduler controls the total number of tasks by deciding whether to include a task in the scheduling, reducing the pressure on short-term scheduling.

Short-Term Scheduling#

Short-term scheduling makes specific scheduling decisions to transition the state of tasks between blocked/ready/running. For example:

If a process in the running state exceeds its estimated time slice, it will be preempted and transitioned to the ready state, waiting for the next scheduling.
When a task is executing or waiting for I/O, it will be transitioned to the blocked state until the I/O event is ready, and then transitioned to the ready state.
Software/hardware interrupts, system calls, and other events can interrupt the execution of a task.

Medium-Term Scheduling#

Medium-term scheduling is part of the paging mechanism and mainly controls the memory usage of runnable and running tasks in the system to prevent excessive usage of memory resources. In practice, if tasks in the system have occupied a large amount of memory, the medium-term scheduler will select certain tasks (e.g., tasks that frequently trigger page faults) to be suspended (suspend). The paging mechanism will then tend to swap out the suspended process's memory to disk to alleviate the memory pressure on the system. Suspended tasks cannot be used by short-term scheduling, further reducing the pressure on short-term scheduling.

Overall scheduling diagram:

Operating System Process Scheduling

Scheduling Mechanisms#

Priority Scheduling#

Round-robin: Each task is executed for a specified time slice, and then switched to the next task. This scheduling strategy can lead to longer average turnaround time (e.g., if there are only two tasks in the system, round-robin will consume a lot of time on task switching). It is also necessary to balance the size of the time slice: if it is too small, the overhead of scheduling and state transitions will be significant; if it is too large, it may cause high latency for some tasks.

Considering the real-time requirements of interactive tasks, each process can be assigned a priority, and tasks with higher priority (such as interactive or real-time tasks) will be scheduled first by the scheduler.

In practice, there are often multiple tasks with high priority. Therefore, a multi-level queue can be used to assign all tasks to different priority levels. Within the same priority level, multiple tasks are placed in a queue and scheduled using round-robin or other strategies. This strategy may result in long delays for low-priority tasks.

Based on the multi-level queue, the priority can be dynamically adjusted based on the task's runtime, known as multi-level feedback queue (MLFQ). This strategy tends to assign shorter tasks (such as I/O-bound and interactive tasks that have a short runtime on the CPU) higher priority to achieve better average turnaround time.

Dynamic priority adjustment strategy: At the beginning of task execution, MLFQ assigns a higher priority to new tasks (for better interactivity and real-time tasks). On each priority level, MLFQ sets a maximum running time (total running time of the task). If the task exceeds this time on a queue, MLFQ considers it a long task and adjusts its priority to the next lower level. After multiple runs, the task will be adjusted to a suitable priority level based on its execution time.

Since low-priority tasks are generally long tasks, the time slice of the low-priority queue will also be longer, and the same applies to the time slice of the high-priority queue.

In addition to improving scheduling metrics, in some cases, it is necessary to achieve fair scheduling of resources. For example, if User A and User B share a machine at the same cost, but User A runs three processes while User B only runs one process, the system needs to ensure that User A and User B each occupy 50% of the resources, namely CPU time.

Lottery Scheduling#

Lottery scheduling assigns a certain number of tickets to each task. Each time the scheduler selects a winning ticket number and chooses the corresponding task to execute, similar to weighted random selection. The pseudocode is as follows:

loop:
    R = random(0, total_tickets)
    for sum = 0; task in task_list; {
        sum += task->tickets
        if sum > R {
            schedule(task)
            goto loop
        }
    }

By assigning the same number of tickets to each user, fair distribution of CPU time can be ensured. However, the actual distribution depends on randomness. In the case of a small number of scheduling operations, uneven distribution may occur due to uneven random data.

Stride Scheduling#

Stride scheduling assigns deterministic time slices to tasks to achieve fair distribution of resources, eliminating the randomness of lottery scheduling.

Assuming two tasks A and B should be allocated resources in a ratio of 6:5, stride scheduling uses the inverse ratio as their strides. That is, the strides of A and B are 5 and 6, respectively. Stride scheduling sets a fixed time slice, and each time it selects the task with the minimum distance traveled to execute, and then adds its stride to the distance traveled. Due to the different strides, the tasks will be executed in a ratio of 6:5, achieving stable and fair distribution. The pseudocode is as follows:

loop:
    task = queue.pop_min()
    schedule(task)
    task.pass += task.stride
    queue.push(task)
    goto loop

Multi-Core Scheduling#

In a multi-core scenario, where the system can run multiple tasks simultaneously, the scheduler needs to decide which tasks can run at the same time and on which cores. It also needs to consider the performance overhead caused by frequent task switching on the same core (such as page table switching and register access).

Cooperative Scheduling#

For scenarios where tasks have dependencies, such as two tasks with a request-response relationship, if they cannot run simultaneously (i.e., when A sends a message but B is not running), the message will have to wait until the next time B is scheduled to receive it. Moreover, when B is scheduled to run, A may be scheduled out again, causing the returned message to not be received immediately. This means that each request and response will have at least one time slice delay due to scheduling, reducing the efficiency of program execution. For such scenarios, cooperative scheduling is needed: a group of tasks is scheduled and executed in parallel. This allows tasks with dependencies (that need to run simultaneously) to be assigned to the same group for parallel execution, while tasks with dependencies (where one task must wait for another to finish) are assigned to different groups to avoid idle waiting and CPU resource waste.

Cooperative scheduling is generally used for parallel computing, where a large task is divided into multiple sub-tasks for parallel execution on multiple cores to speed up computation. However, in more general scenarios, using cooperative scheduling may result in CPU resource waste due to tasks in the same group waiting for each other (note: cooperative scheduling is based on groups, not individual tasks).

A classic strategy for cooperative scheduling is group scheduling. The basic idea is to divide all tasks into different groups, assign related tasks to the same group, and schedule tasks to run on multiple cores as a group. For example, if there are 4 CPU cores and three groups of tasks A/B/C, with 4/3/2 tasks in each group, the scheduling would be as follows:

Core 0	Core 1	Core 2	Core 3
A0	A1	A2	A3
B0	B1	B2	C1
C2	A0	A1	A2

Two-Level Scheduling#

In a multi-core scenario, considering the overhead of task switching between cores, a task can be fixed to a specific core for scheduling and execution. That is, new tasks are first assigned to a specific core by a global scheduler, and each core then uses a single-core scheduling strategy to execute the tasks assigned to it. During system operation, the global scheduler also balances the task distribution among different cores based on the load of each core. This scheduling method is called two-level scheduling.

Processor Affinity#

Some operating systems provide user-level interfaces related to processor affinity on top of their own schedulers, allowing application processes to control CPU core usage. For example, Linux provides the sched_setaffinity interface, which allows user processes to control their own execution on specific cores.