Nginx Reverse Proxy DNS Issue
This article documents the DNS issue encountered when using Nginx as a Reverse Proxy.
The main problem arises from Nginx using DNS to connect to backend servers. However, when the DNS content is updated, Nginx does not re-query the DNS to obtain the latest results. This ultimately leads to all client connections being directed to the old DNS records, resulting in timeouts. The article explores this process, identifies the causes, and discusses current solutions.
Environment Setup
To simplify DNS operations and configuration, Kubernetes is used to set up a test environment. The environment consists of three services:
- Backend: A web server based on Python, representing the backend service, with multiple replicas deployed.
- Client: Attempts to access the backend through Nginx.
- Nginx: Forwards requests received from the client to the backend using the proxy method.
To simplify DNS configuration, Kubernetes Services are used to generate DNS records corresponding to the four pods.
Note: The Nginx here is a simple Nginx service and does not involve any Ingress Controller components.
The deployed services in the environment are as follows:
Note: The problem must be triggered using Headless mode; if ClusterIP is used, there is no issue. This is because in ClusterIP mode, the ClusterIP does not change due to pod restarts, and from the perspective of Nginx, there is no change.
The actual situation with IP/DNS is as follows:
Backend-related resources:
Nginx configuration:
Client attempting to access Nginx:
Problem Simulation
As mentioned at the beginning, the problem occurs when there is any change in DNS content. Currently, querying python-www
reflects the IP addresses of the current four backend pods:
$ nslookup python-www
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: python-www.default.svc.cluster.local
Address: 192.168.215.159
Name: python-www.default.svc.cluster.local
Address: 192.168.215.162
Name: python-www.default.svc.cluster.local
Address: 192.168.215.158
Name: python-www.default.svc.cluster.local
Address: 192.168.215.160
Now, if we restart all four pods and observe whether the DNS response changes:
$ kubectl rollout restart deploy python-www
deployment.apps/python-www restarted
$ nslookup python-www
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: python-www.default.svc.cluster.local
Address: 192.168.215.164
Name: python-www.default.svc.cluster.local
Address: 192.168.215.166
Name: python-www.default.svc.cluster.local
Address: 192.168.215.165
Name: python-www.default.svc.cluster.local
Address: 192.168.215.163
After the DNS change, when the client attempts to access the backend services through Nginx, it experiences timeouts:
Problem Investigation
From the Kubernetes perspective, the DNS content has been correctly updated. Without using Nginx, accessing these services works fine. Therefore, the issue is primarily with Nginx caching the DNS results. When DNS changes occur, Nginx does not attempt to re-resolve.
In the case of Nginx, with the version used being 1.10.3 and the following configuration:
location / {
proxy_pass http://python-www:8000;
}
Observing the network traffic using tcpdump at the moment Nginx server is invoked, DNS requests are captured using port 53 filtering:
It can be observed that when the Nginx server is started, DNS requests are immediately made to resolve python-www
. As this environment is on a Kubernetes Pod, there is an additional search to add "default.svc.cluster.local" at the end.
Keeping tcpdump running, when the client attempts to send packets or when the four pods are restarted to change the DNS results, no new DNS packets are seen.
From this packet behavior, it can be inferred that, under the current configuration, Nginx operates by resolving the DNS configuration when the Nginx server is invoked, and subsequent resolutions do not occur. Therefore, if there are any changes in DNS in the future, Nginx will access the old content, resulting in connection timeouts.
Problem Solution
Solution One
If Nginx resolves DNS at startup, then forcing Nginx to re-resolve through reload (nginx -s reload
) is a viable solution. The test results are as follows:
Now, the client can successfully resolve, but this method is not very convenient in practice. It is challenging for Nginx to determine when a reload is needed, making it not a practical solution.
Solution Two
This issue has been discussed extensively online, and a common solution is to use variables within proxy_pass
. By configuring Nginx in a certain way, it can be forced to re-resolve DNS every time a new connection is established, avoiding timeout issues.
The following configuration is an example using resolver
and set
:
location / {
resolver 10.96.0.10;
set $target http://python-www;
proxy_pass $target:8000;
}
With this configuration, observing tcpdump shows a new DNS request for every new connections, but the client experiences timeouts:
From the screenshot above, it can be seen that the DNS request content does not have the search fields added previously but is based on an FQDN python-www.
. In this case, Kubernetes cannot successfully resolve and respond.
The correct usage within Kubernetes requires setting the variable to the full FQDN, as shown below:
location / {
resolver 10.96.0.10;
set $target http://python-www.default.svc.cluster.local;
proxy_pass $target:8000;
}
With this configuration, the client can successfully access.
Conclusion
- When using
proxy_pass
with DNS access in Nginx, it is crucial to pay attention to the usage. Otherwise, if the DNS response changes, Nginx will retain old data, causing all new connections to be inaccessible, eventually leading to timeouts. - In Kubernetes, using Nginx with Headless services has the potential to encounter this problem.
- Solutions include periodic reloads or using variables to force Nginx to query DNS for each connection.
- When using variables, Nginx queries the variable’s name in FQDN form. This prevents the use of the search field inside Kubernetes Pods, leading to DNS query failures. Therefore, when using variables, it is necessary to use the complete name, adding something like “svc.cluster.local” (the actual content depends on the cluster deployment method).