136 lines
3.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: Nginx nchan 模块导致 SSL 证书批量续期失败
createTime: 2026/04/22 08:43:00
tags:
- nginx
- ssl
- certbot
- nchan
---
## 问题背景
在例行检查 SSL 证书自动续期时,发现 `certbot renew --dry-run` 命令出现大量失败。13 个域名中,前 6 个续期成功,后 7 个全部失败,返回 **504 Gateway Timeout** 错误。
## 环境信息
| 组件 | 版本/配置 |
| -------- | --------- |
| Nginx | 1.24.0 |
| Certbot | 2.9.0 |
| 域名数量 | 13 个 |
## 错误信息
```text
Certbot failed to authenticate some domains (authenticator: nginx).
The Certificate Authority reported these problems:
Domain: example.a.com
Type: unauthorized
Detail: 12.34.56.78: Invalid response from
http://example.a.com/.well-known/acme-challenge/xxx: 504
```
## 初步排查
### 1. 检查 Nginx 状态
Nginx 服务显示运行正常,但发现异常:
```bash
$ pgrep nginx | wc -l
147
```
**147 个 nginx 进程**,远超正常数量(通常 1 master + N workers
并且无法正常访问到任何nginx代理的服务疑似进程阻塞。
### 2. 检查错误日志
```bash
$ grep "23:18" /var/log/nginx/error.log
```
发现大量 worker 进程崩溃记录:
```text
2026/04/21 23:18:20 [alert] 149662#149662: worker process 153306 exited on signal 6 (core dumped)
2026/04/21 23:18:20 [alert] 149662#149662: shared memory zone "memstore" was locked by 153306
2026/04/21 23:18:20 [alert] 149662#149662: worker process 153307 exited on signal 6 (core dumped)
2026/04/21 23:18:20 [alert] 149662#149662: shared memory zone "memstore" was locked by 153307
...
```
统计崩溃次数:
```bash
$ grep -c "exited on signal 6" /var/log/nginx/error.log
2361
```
**2361 次 worker 进程崩溃!**
## 问题分析
```text
certbot renew → 修改 nginx 配置 → nginx reload
→ nchan 模块 bug → worker 崩溃
→ 无法处理请求 → Let's Encrypt 等待超过 60 秒 → 504 超时
```
certbot 按顺序处理证书,每个证书需要:
1. 修改 nginx 配置添加临时验证路径
2. reload nginx
3. 等待 Let's Encrypt 验证
4. 恢复配置
当处理到第 7 个证书时,频繁的 reload 触发了 nchan 模块的 bug导致 worker 进程批量崩溃。此时 nginx 无法正常响应请求,后续所有证书验证都超时失败。
根据 [Nginx Ticket #1135](https://trac.nginx.org/nginx/ticket/1135) 的记录nchan 模块在 nginx reload 时存在已知问题:
> After upgrading from 1.10.1 without ALPN support to 1.10.2 with ALPN support... we've been getting into situations where Nginx completely stops serving connections without any warning.
>
> The nginx error log on the affected hosts gets these odd messages:
>
> ```text
> worker process exited on signal 6 (core dumped)
> shared memory zone "memstore" was locked by xxx
> ```
## 解决方案
禁用 nchan 模块
```bash
# 1. 找到 nchan 模块配置
ls -la /etc/nginx/modules-enabled/ | grep nchan
# lrwxrwxrwx 1 root root 49 Apr 20 18:50 50-mod-nchan.conf -> /usr/share/nginx/modules-available/mod-nchan.conf
# 2. 删除符号链接(禁用模块)
sudo rm /etc/nginx/modules-enabled/50-mod-nchan.conf
# 3. 测试配置
sudo nginx -t
# nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
# nginx: configuration file /etc/nginx/nginx.conf test is successful
# 4. 重启 nginx
sudo service nginx restart
```
重新运行 certbot 续期测试:
```bash
sudo certbot renew --dry-run
```
结果 13 个证书全部续期成功
## 参考链接
- [Nginx Ticket #1135 - Connections timing out after upgrading to 1.10.2](https://trac.nginx.org/nginx/ticket/1135)
- [nchan 官方文档](https://nchan.io/)
- [Certbot 文档](https://eff-certbot.readthedocs.io/)