Open Source For Geeks: 03/16/14

Sunday, 16 March 2014

Processes and Threads in Linux

Linux Processes

In a very basic form, Linux process can be visualized as running instance of a program. For example, just open a text editor on your Linux box and a text editor process will be born.

First command (gedit &) opens gedit window while second ps command (ps -aef | grep gedit) checks if there is an associated process(ps command gives running processes output of which is piped to grep command which searches the required token i.e gedit in iur case). In the result you can see that there is a process associated with gedit.

You can see two entries corresponding to search of word gedit in processes. Each process in Linux is associated with a unique PID(process ID). You can see the output in the screenshot above number in 2nd column is the PID of the process. SO gedit has a pid of 2343. So whats 2039 ? Is is called PPID(Parents Process ID). We have run the gedit command/process in a terminal instance. Hence the terminal forms the parent of all the processes that we run via that terminal and gedit being one of them. So how do we verify that 2039 is indeed the PID of parent terminal process. To find the PID of the terminal you can simply type ps in your terminal.

You can see 2039 PID corresponding to bash process which is our terminal. I use bash but you may be using other shells like ksh, csh etc. To find which shell you are using you can refer to one of my earlier posts

How to find current Shell in Linux?

How PIDs are assigned to process?

As per the Wiki

Under Unix, process IDs are usually allocated on a sequential basis, beginning at 0 and rising to a maximum value which varies from system to system. Once this limit is reached, allocation restarts at zero and again increases. However, for this and subsequent passes any PIDs still assigned to processes are skipped.

But there is a small update in above. For user processes PIDs to be assigned generally start from a number RESERVED_PIDS and go till PID_MAX_DEFAULT. PIDs from 1 to RESERVED_PIDS are reserved for kernel processes. Also know that these numbers can be configured.

Processes have priority based on which kernel context switches them. A process can be pre-empted if a process with higher priority is ready to be executed.

For example, if a process is waiting for a system resource like some text from text file kept on disk then kernel can schedule a higher priority process and get back to the waiting process when data is available. This keeps the ball rolling for an operating system as a whole and gives user a feeling that tasks are being run in parallel.
Processes can talk to other processes using Inter process communication methods and can share data using techniques like shared memory.

How processes are created in Linux?

In Linux, fork() is used to create new processes. These new processes are called as child processes and each child process initially shares all the segments like text, stack, heap etc until child tries to make any change to stack or heap. In case of any change, a separate copy of stack and heap segments are prepared for child so that changes remain child specific. The text segment is read-only so both parent and child share the same text segment. C fork function article explains more about fork().

Step By Step

The fork ( ) system call does the following in a UNIX system

Allocates slot in the process table for the new process.
Assigns a unique process id to the new process.
Make a copy of the process image of the parent, with the exception of shared memory.
Increases counters for any files owned by the parent, to reflect that an additional process now also owns these files.
Assigns the child process to a ready to run state.
Returns the Process ID number (PID) of the child to the parent process and a 0 value to the child process.

Note : All these works is done in Kernel space of parent process.

Above diagram shows the process table and how each entry in it points to a process image.

A Process image consists of

User Data
User program
System Stack(Kernel space).
Process control block (PCB) containing process attributes.

PCB looks like following

It has the process state(Eb. ready to run, sleeping, preempted etc), process number or PID which we talked about earlier, registers, PC, File descriptors etc.

Process States

From forking(birth) of a process to it's end(resources being freed up and entry removed from process table), a process goes through various states. Below diagram shows the state chart of a process in UNIX.

Threads in Linux

Threads in Linux are nothing but a flow of execution of the process. A process containing multiple execution flows is known as multi-threaded process.

For a non multi-threaded process there is only execution flow that is the main execution flow and hence it is also known as single threaded process. For Linux kernel , there is no concept of thread. Each thread is viewed by kernel as a separate process but these processes are somewhat different from other normal processes. I will explain the difference in following paragraphs.

Threads are often mixed with the term Light Weight Processes or LWPs. The reason dates back to those times when Linux supported threads at user level only. This means that even a multi-threaded application was viewed by kernel as a single process only. This posed big challenges for the library that managed these user level threads because it had to take care of cases that a thread execution did not hinder if any other thread issued a blocking call.

Later on the implementation changed and processes were attached to each thread so that kernel can take care of them. But, as discussed earlier, Linux kernel does not see them as threads, each thread is viewed as a process inside kernel. These processes are known as light weight processes.

The main difference between a light weight process (LWP) and a normal process is that LWPs share same address space and other resources like open files etc. As some resources are shared so these processes are considered to be light weight as compared to other normal processes and hence the name light weight processes.

So, effectively we can say that threads and light weight processes are same. It’s just that thread is a term that is used at user level while light weight process is a term used at kernel level.

From implementation point of view, threads are created using functions exposed by POSIX compliant pthread library in Linux. Internally, the clone() function is used to create a normal as well as alight weight process. This means that to create a normal process fork() is used that further calls clone() with appropriate arguments while to create a thread or LWP, a function from pthread library calls clone() with relevant flags. So, the main difference is generated by using different flags that can be passed to clone() function.

PIDs - User and Kernel View

In the kernel, each thread has it's own ID, called a PID (although it would possibly make more sense to call this a TID) and they also have a TGID (thread group ID) which is the PID of the thread that started the whole process.

Simplistically, when a new process is created, it appears as a thread where both the PID and TGID are the same (new) number.

When a thread starts another thread, that started thread gets its own PID (so the scheduler can schedule it independently) but it inherits its TGID from the thread that created it.

That way, the kernel can happily schedule threads independent of what process they belong to, while processes (thread group IDs) are reported to you.

Fer example refer to following diagram

You can see that starting a new process gives you a new PID and a new TGID (both set to the same value), while starting a new thread gives you a new PID while maintaining the same TGID as the thread that started it.

PS : I picked up some basic knowledge from thegeekstuff and added some extra points and diagrams to make it more easily understandable.