Performance Tuning – RHEL

Performance Tuning RHEL
Performance Tuning RHEL

Performance of a server depends basically on 4 parameters. CPU, Memory, Input-Output and Network. So, to troubleshoot your server’s performance issue you need to focus on these 4 things primarily.

On a very high level, following are the four subsystems that needs to be monitored to make sure your machine doesn’t got into any performance issue :

  • CPU Usage
  • Memory Usage
  • I/O Transactions
  • Network Traffic

CPU Usage :

CPU usage is a picture of how the processors in your machine are being utilized. A single CPU refers to a single hardware (or may be virtualized) hyper-thread. There might be multiple physical processors in a machine, each with multiple cores, and each core with multiple hyperthreads.

Four critical performance metrics for CPU are :

Context Switch

  • When CPU switches from one process (or thread) to another, it is called as context switch.
  • When a process switch happens, kernel stores the current state of the CPU (of a process or thread) in the memory.
  • Kernel also retrieves the previously stored state (of a process or thread) from the memory and puts it in the CPU.
  • Context switching is very essential for multitasking of the CPU.
  • However, a higher level of context switching can cause performance issues.
  • Context switch can be checked by “sar -w” command

 

Run Queue

  • Run queue indicates the total number of active processes in the current queue for CPU.
  • When CPU is ready to execute a process, it picks it up from the run queue based on the priority of the process.
  • Please note that processes that are in sleep state, or i/o wait state are not in the run queue.
  • So, a higher number of processes in the run queue can cause performance issues.
  • Run queue can be checked by “sar -q” command

 

CPU Utilization

  • This indicates how much of the CPU is currently getting used.
  • This is fairly straight forward, and you can view the CPU utilization from Top Command. 
  • 100% CPU utilization means the system is fully loaded.
  • So, a higher %age of CPU utilization will cause performance issues.
  • CPU utilization can be checked by “top” command

 

Load Average

  • This indicates the average CPU load over a specific time period.
  • On Linux, load average is displayed for the last 1 minute, 5 minutes, and 15 minutes. This is helpful to see whether the overall load on the system is going up or down.
  • For example, a load average of “0.75 1.70 2.10” indicates that the load on the system is coming down. 0.75 is the load average in the last 1 minute. 1.70 is the load average in the last 5 minutes. 2.10 is the load average in the last 15 minutes.
  • Please note that this load average is calculated by combining both the total number of process in the queue, and the total number of processes in the uninterruptable task status.
  • Load average information is stored in /proc/loadavg file
  • Load average can be checked by “uptime” command.

 

Now, to further understand system load, we will take a few assumptions. Let’s say we have load averages below:

[root@devopscheetah ~]# uptime
02:14:26 up 0 min, 2 users, load average: 1.00, 0.40, 3.35

On a single core system this would mean:
  • The CPU was fully (100%) utilized on average; 1 processes was running on the CPU (1.00) over the last 1 minute.
  • The CPU was idle by 60% on average; no processes were waiting for CPU time (0.40) over the last 5 minutes.
  • The CPU was overloaded by 235% on average; 2.35 processes were waiting for CPU time (3.35) over the last 15 minutes.
On a dual-core system this would mean:
  • The one CPU was 100% idle on average, one CPU was being used; no processes were waiting for CPU time(1.00) over the last 1 minute.
  • The CPUs were idle by 160% on average; no processes were waiting for CPU time. (0.40) over the last 5 minutes.
  • The CPUs were overloaded by 135% on average; 1.35 processes were waiting for CPU time. (3.35) over the last 15 minutes.

Memory Usage :

  • As you know, RAM is your physical memory. If you have 4GB RAM installed on your system, you have 4GB of physical memory.
  • Virtual memory = Swap space available on the disk + Physical memory. The virtual memory contains both user space and kernel space.
  • The unused RAM will be used as file system cache by the kernel.
  • The Linux system will swap when it needs more memory. i.e when it needs more memory than the physical memory. When it swaps, it writes the least used memory pages from the physical memory to the swap space on the disk.
  • Lot of swapping can cause performance issues, as the disk is much slower than the physical memory, and it takes time to swap the memory pages from RAM to disk.
  • System memory information is stored in “/proc/meminfo” file

Please note that whenever we load a file or program into an OS, it consumes some physical memory (RAM). Since RAM is limited, then sometimes our system temporarily moves some of the program to our SWAP memory (located on our disk). This is called as swap out. and when the physical memory (RAM) gets free then those programs pulled back to physical memory that is called swap in. So, in layman language higher the value of swapout will result in performance issues.

There are few commands which can be used to check the status of your physical and swap memory. Some of the commands are below :

  • # sar -S
  • # free -m
  • # sar -R
  • # vmstat
  • # top

I/O Transaction :

  • I/O wait is the amount of time CPU is waiting for I/O. If you see consistent high i/o wait on your system, it indicates a problem in the disk subsystem.
  • You should also monitor reads/second, and writes/second. This is measured in blocks. i.e number of blocks read/write per second. These are also referred as bi and bo (block in and block out).
  • tps indicates total transactions per seconds, which is sum of rtps (read transactions per second) and wtps (write transactions per seconds).
  • I/O can be checked by “iostat” command.
  • You can check the diskwise transactions “sar -d” command. It will show you the transactions on your individual disks.

Network Traffic :

For network interfaces, you should monitor total number of packets (and bytes) received/sent through the interface, number of packets dropped, etc. Identifying the traffic on individual ports and network cards can be useful in narrowing down your performance issues on your machine.

NETSTAT (network statistics) is a command line tool for monitoring network connections both incoming and outgoing as well as viewing routing tables, interface statistics etc. Netstat is available on all Unix-like Operating Systems and also available on Windows OS as well. It is very useful in terms of network troubleshooting and performance measurement. Netstat is one of the most basic network service debugging tools, telling you what ports are open and whether any programs are listening on ports. You can use Netstat to identify various statistics like below :

Listing all ports (both TCP and UDP) using netstat -a option.

# netstat -a

Listing only TCP (Transmission Control Protocol) port connections using netstat -at.

# netstat -at

Listing only UDP (User Datagram Protocol ) port connections using netstat -au.

# netstat -au

Listing all active listening ports connections with netstat -l.

# netstat -l

Listing all active listening TCP ports by using option netstat -lt.

# netstat -lt

Listing all active listening UDP ports by using option netstat -lu.

# netstat -lu

Listing all active UNIX listening ports using netstat -lx.

# netstat -lx

Displays statistics by protocol. By default, statistics are shown for the TCP, UDP, ICMP, and IP protocols. The -s parameter can be used to specify a set of protocols.

# netstat -s

Showing statistics of only TCP protocol by using below option 

# netstat -st

Displaying service name with their PID number, using option netstat -tp will display “PID/Program Name”.

# netstat -tp

Display Kernel IP routing table with netstat and route command.

# netstat -r

check which file is using which port or which port is using which file

#netstat -anlp 

 

Following is the 4 step approach to identify and solve a performance issue.

  • Step 1 – Understand (and reproduce) the problem:

The more time you spend on understanding and defining the problem will give you enough details to look for the answers in the right place. If possible, try to reproduce the problem

  • Step 2 – Monitor and collect data: 

Monitor the system and try to collect as much data as possible on various subsystems. Based on this data, come up list of potential issues.

  • Step 3 – Eliminate and narrow down issues:

After having a list of potential issues, dive into each one of them and eliminate any non issues. Narrow it down further to see whether it is an application issue, or an infrastructure issue. If it is an infrastructure issue, narrow it down and identify the subsystem that is causing the issue. If it is an I/O subsystem issue, narrow it down to a specific partition, or raid group, or LUN, or disk. Basically, keep drilling down until you put your finger on the root cause of the issue.

  • Step 4 – One change at a time:

Don’t try to make multiple changes at one time. If you make multiple changes, you wouldn’t know which one fixed the original issue. 

Want to Read more useful blog? Click Here

Leave a Reply

Your email address will not be published. Required fields are marked *