How to do Linux NFS Performance Tuning and Optimization
First of all, this article is not my own work. It’s a copied and amended version of the original. Full credits go to Sarath Pillai for his article.
Our introductory guide to NFS did not contain some major topics, that requires special attention when we talk about NFS. These topics must always be given an extra bit of care, while configuring NFS. We purposely skipped out most of the serious topics in NFS in that tutorial guide, because of the simple reason that they are serious topics and must always be discussed separate. For those who did not read our NFS introductory guide, i will recommend reading that before beginning this tutorial.
The things which we skipped in the above tutorial are 1. NFS Performance Tuning Guidelines, and 2. Securing NFS. We will be doing a separate post for security related stuff. In this post we will be discussing topics that in some or the other way affects the performance of NFS.
NFS Performance tuning can be classified to three different areas. We will be discussing them separately in this tutorial. Lets have a look at these classifications first.
- Underlying Disk Related Performance that affects NFS
- NFS Application based Performance; and finally
- Network Related NFS tuning (NFS is a technology that relies heavily on network)
Tuning both the NFS server and NFS client, both are very much important, because they are the ones who take part in this network file system communication. So let’s begin this with some mount command options, that can be used to tune NFS performance, primarily from the client side.
Mount command Block Size Settings to improve NFS performance
The amount and size of data, that the server and the client uses, for passing data between them is very much important. Most of the NFS versions has a default value for this settings. However you can always tune these values to suite your needs. We will be working with the same NFS server and client, that we have used for our previous tutorial.
Assume that you have an NFS share mounted on one of your NFS client system. Let’s have a look at the default properties of this mount.
[root@slashroot2 ~]# mount 192.168.0.103:/data /mnt [root@slashroot2 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 38G 5.6G 31G 16% / tmpfs 252M 0 252M 0% /dev/shm 192.168.0.103:/data 38G 2.8G 34G 8% /mnt
Let’s have a look at the properties and options that the NFS client selected by default to mount this share. We can easily get that information from the file
[root@slashroot2 ~]# cat /proc/mounts 192.168.0.103:/data /mnt nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.0.103 0 0
You will also get details about other file systems mounted on your system from the above file, however i have only shown you the details reated to our NFS share mounted, to avoid confusion.
The details, shows you the default options that were used, while mounting that particular share on the client.
rwtells that the file system is mounted in read/write mode
vers=3means we are using NFS version 3 for this mount
wsize=32768specify the size of the data chunks that each RPC packet takes while reading and writing. Tuning them will sometimes increase performance and can also sometimes reduce the performance. Let’s see why.
wsize must always be done by keeping in mind the capacity of your network, as well as the processing and performance power of your client and the server. So let’s say you have decided to decrease the size of
wsize in your mount. The amount of data that needs to be sent is the same regardless. Decreasing the size of the RPC packets will increase the total number of network IP packet’s that need to be send in order to deliver that same amount of data.
For example, if there is 1 MB of data to send, dividing it into equal chunks of 32KB means that 32 chunks need to be sent ( 32 * 32 = 1024 ). If you divide that same 1MB of data into 64KB chunks, 16 chunks would need to be sent (64 * 16 = 1024 ). Every packet sent over the network incurs some overhead due to how TCP/IP works, so it may be more efficient to send the 16x 64KB packets than the 32x 32KB packets.
So our decision on modifying this parameter must always depend on the network capability. If perhaps you have a 1 Gigabit port on your NFS server and client, and your network switches connecting these servers are also capable of 1 Gigabit on those ports, I would suggest to tweak these parameters to a higher value.
You can easily modify
wsize values at the point of mounting the volume as shown below. The maximum value that can be set is 65536, which depends on the current kernel version you have.
[root@slashroot2 ~]# mount 192.168.0.104:/data /mnt -o rsize=65536,wsize=65536
Like the above mount command shows, you can modify the
wsize options in NFS. Or otherwise you can modify it permanently in the
/etc/fstab mount entry.
The best method to select a good
wsize value is to alter them to different values and do a read/write performance test. You can then select the value that gives you the best performance. You can refer to our post read/write performance test in linux, to test the speed.
Modifying Network MTU Size for NFS
MTU stands for Maximum Transmission Unit. Its the highest amount of data that can be passed in one Ethernet frame. Most of the machine’s have them configured to the default value of 1500 bytes.
To get the current value of your MTU, on your NIC cards, you can run the below command.
[root@slashroot2 ~]# netstat -i Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 227 0 0 0 235 0 0 0
Or alternatively you can also get the value of MTU from ifconfig command in Linux.
Suppose let’s say that your rsize and wsize value is 8 kilobytes, and you are using 1500 bytes MTU size, then data will still be fragmented while sending because the maximum size is 1500 bytes. If modify your MTU size to 9000 bytes, it will allow the whole 8 kilobytes to be sent in one frame without fragmenting. Doing this however means that every device between the server and client needs to be configured to 9000 MTU also. This includes network switches and firewalls. If you don’t do this, you can end up with truncated packets and communication loss between devices.
Changing MTU is quite simple in linux. You can specify the MTU size of your required interface card configuration file. Suppose you need to change the MTU for your eth0 interface. You simply need to edit the file
/etc/sysconfig/network-scripts/ifcfg-eth0, and add the line
Otherwise you can also change MTU with the help of ifconfig command as shown below.
[root@slashroot2 ~]# ifconfig eth0 mtu 9000 up [root@slashroot2 ~]# ifconfig eth0 Link encap:Ethernet HWaddr 08:00:27:55:D1:CC inet6 addr: fe80::a00:27ff:fe55:d1cc/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
Note: Changing the MTU size is quite risky in production network, because it can affect your current running applications sometimes. And also some ISPs do not accept frames that are larger than their specified MTU size.
retrans options in NFS
The above two options affect the number of retry attempts made by the client to the server in case of a delayed response from the server or sometimes no response from the server.
timeo option in NFS decides the time the client needs to wait before it comes to a conclusion that it must retransmit the packet. The default value is 0.7 (it is calculated in tenths of the second given. So if you give a value of 5 timeo then it means the client will wait for 5/10 seconds before deciding that it needs to resend the packet.)
And the second option retrans decides the total number of attempts made by the client, incase it gets a timeout (after waiting for timeo seconds you provided).
So if you give retrans value as 3, the client will resend the RPC packet 3 times (and each time it will wait for timeo seconds) before coming to a conclusion that the server is not available and will give you a message called “Server not responding”. Also after the message the counter resets and the client will still keep on trying(With the same timeo and retrans values).
You can modify timeo and retrans values as an option in mount command as shown below.
[root@slashroot2 ~]# mount 192.168.0.102:/data /mnt -o timeo=5,retrans=4 [root@slashroot2 ~]#
If you want to see the current nfst statistics for retranmission of packets, then you can use
nfsstat command as shown below.
[root@slashroot2 ~]# nfsstat -r Client rpc stats: calls retrans authrefrsh 5 0 0
On a congested network, where your client get’s a reply from the server but is a little delayed (due to which
retrans happens too often), you can increase the
timeo value. This will result in a little bit increase in performance.
Number of NFS threads on the NFS server
Another important factor that needs to be taken care of while working with NFS is the total number of NFS threads that are available on the NFS server. If you have a large number of clients that access your NFS server, then it will be better to increase the number of threads on the NFS server.
You can have a look at the current number of threads on your NFS server by the below command.
[root@slashroot1 ~]# ps aux | grep nfs root 4794 0.0 0.0 0 0 ? S< 03:18 0:00 [nfsd4] root 4795 0.0 0.0 0 0 ? S 03:18 0:00 [nfsd] root 4796 0.0 0.0 0 0 ? S 03:18 0:00 [nfsd] root 4797 0.0 0.0 0 0 ? S 03:18 0:00 [nfsd] root 4798 0.0 0.0 0 0 ? S 03:18 0:00 [nfsd] root 4799 0.0 0.0 0 0 ? S 03:18 0:00 [nfsd] root 4800 0.0 0.0 0 0 ? S 03:18 0:00 [nfsd] root 4801 0.0 0.0 0 0 ? S 03:18 0:00 [nfsd] root 4802 0.0 0.0 0 0 ? S 03:18 0:00 [nfsd]
If you count the total number of nfsd process it will be 8 (which is the default number). This means if you have a large number of clients accessing this NFS server, they will experience some amount of lag in their operations as they will be waiting for available threads.
Let’s increase this number to some higher number like 20. You can modify this value in
# Number of nfs server processes to be started. # The default is 8. RPCNFSDCOUNT=16
After modifying that value, you need to restart the nfs service. You should now get 16 instead of 8 in the process list.
Async and Sync in NFS mount
These are the two values that determines how data is written on the server on a client request.
Both has their own advantages and disadvantages. Let’s first understand what is async and sync in NFS mount.
Whatever you do on an NFS client is converted to an RPC equivalent operation, so that it can be send to the server using RPC protocol. So if you are using async option in NFS, when the server reieves an RPC operation for writing, it first converts that operation to a VFS(Virtual File System) operation to write the data in the underlying disk system.
As soon as the VFS handle’s the write operation to the underlying disk, even before getting an acknowledgement that the write operation is completed, the Server becomes ready to accept further RPC write operations. In this case the NFS server increases the performance for writing, by reducing the time needed to complete the write operation.
But this method can sometimes cause data loss and corruption, because the NFS server starts to accept more write operations even before the underlying disk system has completed doing its job.
Using sync option will do the reverse. In this case the server will reply only after a write operation has successfully completed (Which means only after the data is completely written to the disk.).
If you are dealing with critical data then i will never suggest to use async option, however async is a good choice where your data is not that highly critical.
[root@slashroot2 ~]# mount 192.168.0.101:/data /mnt -o rw,async
Similarly as shown above you can also use sync option according to your requirement. You can make this mount permanent by making an entry in fstab.
192.168.0.101:/data /mnt nfs rw,async 0 0
Tuning Input and output Socket Queue for NFS performance
Transferring large file’s over network requires high memory on the server as well as the client. However the Linux machine, by default never allocates a high amount of memory for this purpose, as it requires memory for other applications as well.
You can further tune it and allocate a higher memory, if you are having heavy input and output through network.
There are two values that can be modified to tune them. One is the socket input queue and the other is the socket output queue. Input queue is the place where requests that needs to be processed queue up.
Output queue is the place where the requests that are going out side queue up. We have already seen that increasing the number of NFS server threads on the server can improve performance. Imagine you have 16 threads on your server, and each are processing requests from separate clients. Each of them uses the same socket input and output queue (and even other applications on the server will use this queue for processing their request.). Which means if you have a higher input and output socket queue size, then all of your threads can effectively send and receive data.
You can modify those values by modifying the
sysctl.conf file, or if you want, you can directly modify the files in /proc (you need to restart nfs server after modifying this)
echo 219136 > /proc/sys/net/core/rmem_default echo 219136 > /proc/sys/net/core/rmem_max
And you can also modify the output queue by modifying the wmem_default & wmem_max values as shown below.
echo 219136 > /proc/sys/net/core/wmem_default echo 219136 > /proc/sys/net/core/wmem_max
Anything that you modify in
/proc file system is temporary, because it’s the value that’s stored in RAM, which does not persist across reboots. You can make these entries permanent by making an entry in
sysctl.conf as shown below.
[root@slashroot2 ~]#echo 'net.core.wmem_max=219136' >> /etc/sysctl.conf [root@slashroot2 ~]#echo 'net.core.rmem_max=219136' >> /etc/sysctl.conf
Underlying Disk Configuration in NFS server
The configuration and make of the underlying DISK, which you expose as an NFS share on the server plays a significant role in the performance. If you have your NFS share on a RAID array, then that can improve the read and write performance depending upon the RAID level configured.
The best RAID level to prefer is always RAID level 10, but it’s pretty costly because of the number of disks used. If you don’t write often or do large sequential writes then RAID 5/50 can be a better choice. Disks over 2TB in size are best using RAID 6/60 due to the rebuild time required if a disk fails.
Tune each and every parameter suggested in this article, by continuously performing the read/write performance test, to reach an optimum level of tuning.