136 lines
3.7 KiB
Markdown
136 lines
3.7 KiB
Markdown
---
|
||
title: Nginx nchan 模块导致 SSL 证书批量续期失败
|
||
createTime: 2026/04/22 08:43:00
|
||
tags:
|
||
- nginx
|
||
- ssl
|
||
- certbot
|
||
- nchan
|
||
---
|
||
|
||
## 问题背景
|
||
|
||
在例行检查 SSL 证书自动续期时,发现 `certbot renew --dry-run` 命令出现大量失败。13 个域名中,前 6 个续期成功,后 7 个全部失败,返回 **504 Gateway Timeout** 错误。
|
||
|
||
## 环境信息
|
||
|
||
| 组件 | 版本/配置 |
|
||
| -------- | --------- |
|
||
| Nginx | 1.24.0 |
|
||
| Certbot | 2.9.0 |
|
||
| 域名数量 | 13 个 |
|
||
|
||
## 错误信息
|
||
|
||
```text
|
||
Certbot failed to authenticate some domains (authenticator: nginx).
|
||
The Certificate Authority reported these problems:
|
||
Domain: example.a.com
|
||
Type: unauthorized
|
||
Detail: 12.34.56.78: Invalid response from
|
||
http://example.a.com/.well-known/acme-challenge/xxx: 504
|
||
```
|
||
|
||
## 初步排查
|
||
|
||
### 1. 检查 Nginx 状态
|
||
|
||
Nginx 服务显示运行正常,但发现异常:
|
||
|
||
```bash
|
||
$ pgrep nginx | wc -l
|
||
147
|
||
```
|
||
|
||
有 **147 个 nginx 进程**,远超正常数量(通常 1 master + N workers)。
|
||
|
||
并且无法正常访问到任何nginx代理的服务,疑似进程阻塞。
|
||
|
||
### 2. 检查错误日志
|
||
|
||
```bash
|
||
$ grep "23:18" /var/log/nginx/error.log
|
||
```
|
||
|
||
发现大量 worker 进程崩溃记录:
|
||
|
||
```text
|
||
2026/04/21 23:18:20 [alert] 149662#149662: worker process 153306 exited on signal 6 (core dumped)
|
||
2026/04/21 23:18:20 [alert] 149662#149662: shared memory zone "memstore" was locked by 153306
|
||
2026/04/21 23:18:20 [alert] 149662#149662: worker process 153307 exited on signal 6 (core dumped)
|
||
2026/04/21 23:18:20 [alert] 149662#149662: shared memory zone "memstore" was locked by 153307
|
||
...
|
||
```
|
||
|
||
统计崩溃次数:
|
||
|
||
```bash
|
||
$ grep -c "exited on signal 6" /var/log/nginx/error.log
|
||
2361
|
||
```
|
||
|
||
**2361 次 worker 进程崩溃!**
|
||
|
||
## 问题分析
|
||
|
||
```text
|
||
certbot renew → 修改 nginx 配置 → nginx reload
|
||
→ nchan 模块 bug → worker 崩溃
|
||
→ 无法处理请求 → Let's Encrypt 等待超过 60 秒 → 504 超时
|
||
```
|
||
|
||
certbot 按顺序处理证书,每个证书需要:
|
||
|
||
1. 修改 nginx 配置添加临时验证路径
|
||
2. reload nginx
|
||
3. 等待 Let's Encrypt 验证
|
||
4. 恢复配置
|
||
|
||
当处理到第 7 个证书时,频繁的 reload 触发了 nchan 模块的 bug,导致 worker 进程批量崩溃。此时 nginx 无法正常响应请求,后续所有证书验证都超时失败。
|
||
|
||
根据 [Nginx Ticket #1135](https://trac.nginx.org/nginx/ticket/1135) 的记录,nchan 模块在 nginx reload 时存在已知问题:
|
||
|
||
> After upgrading from 1.10.1 without ALPN support to 1.10.2 with ALPN support... we've been getting into situations where Nginx completely stops serving connections without any warning.
|
||
>
|
||
> The nginx error log on the affected hosts gets these odd messages:
|
||
>
|
||
> ```text
|
||
> worker process exited on signal 6 (core dumped)
|
||
> shared memory zone "memstore" was locked by xxx
|
||
> ```
|
||
|
||
## 解决方案
|
||
|
||
禁用 nchan 模块
|
||
|
||
```bash
|
||
# 1. 找到 nchan 模块配置
|
||
ls -la /etc/nginx/modules-enabled/ | grep nchan
|
||
# lrwxrwxrwx 1 root root 49 Apr 20 18:50 50-mod-nchan.conf -> /usr/share/nginx/modules-available/mod-nchan.conf
|
||
|
||
# 2. 删除符号链接(禁用模块)
|
||
sudo rm /etc/nginx/modules-enabled/50-mod-nchan.conf
|
||
|
||
# 3. 测试配置
|
||
sudo nginx -t
|
||
# nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
|
||
# nginx: configuration file /etc/nginx/nginx.conf test is successful
|
||
|
||
# 4. 重启 nginx
|
||
sudo service nginx restart
|
||
```
|
||
|
||
重新运行 certbot 续期测试:
|
||
|
||
```bash
|
||
sudo certbot renew --dry-run
|
||
```
|
||
|
||
结果 13 个证书全部续期成功
|
||
|
||
## 参考链接
|
||
|
||
- [Nginx Ticket #1135 - Connections timing out after upgrading to 1.10.2](https://trac.nginx.org/nginx/ticket/1135)
|
||
- [nchan 官方文档](https://nchan.io/)
|
||
- [Certbot 文档](https://eff-certbot.readthedocs.io/)
|