596 lines
22 KiB
Markdown
596 lines
22 KiB
Markdown
|
|
# v0.7.0 Hotfix #63 回滚方案
|
||
|
|
|
||
|
|
> **场景**: 生产 backend 容器 `wecom_it_backend` 已回滚到 `v0.7.0-backup-pre-qrfix` 镜像(因 `v0.7.0.1-hotfix1` 失败)。现在通过 jumpserver 终端用 base64 分段 echo 上传 `auth_qrcode.py` + `qrcode_service.py` 到 `/tmp/`,然后 `docker cp` 到容器,`pip install qrcode[pil]`,`restart`。本文件给出**失败时的回滚方案**。
|
||
|
|
>
|
||
|
|
> **目标读者**: 运维小白(用户)。每步带中文注释,失败兜底齐全。
|
||
|
|
>
|
||
|
|
> **生效条件**: 当且仅当 `curl /api/auth_qrcode/create` 行为异常时触发。
|
||
|
|
>
|
||
|
|
> **回滚总目标**: 1 分钟内把 backend 拉回到 `v0.7.0-backup-pre-qrfix` 镜像,业务不中断。
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 0. 当前状态快照(回滚前必看)
|
||
|
|
|
||
|
|
回滚前先确认现在到底在跑哪个镜像、哪 2 个文件、pip 装了什么。**3 条命令 30 秒**:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1) 看当前容器用的镜像 ID
|
||
|
|
docker inspect wecom_it_backend --format '{{.Image}}' | head -c 12
|
||
|
|
# 期望: 现在(回滚后)应该是 v0.7.0-backup-pre-qrfix 镜像 ID
|
||
|
|
# 如果 hotfix 装好,可能是 wecom-it-desk-backend:patched 或 latest
|
||
|
|
|
||
|
|
# 2) 看容器内 2 个文件的修改时间(确认 hotfix 是否真生效)
|
||
|
|
docker exec wecom_it_backend stat -c '%Y %n' \
|
||
|
|
/app/app/api/auth_qrcode.py \
|
||
|
|
/app/app/services/qrcode_service.py
|
||
|
|
# 期望 hotfix 装好后: 数字是最近的(今天/刚刚);否则是 6/15 左右的旧时间
|
||
|
|
|
||
|
|
# 3) 看 qrcode 是否真装上
|
||
|
|
docker exec wecom_it_backend pip show qrcode 2>&1 | head -5
|
||
|
|
# 期望装好: Name: qrcode Version: 7.4.2
|
||
|
|
# 没装: WARNING: Package(s) not found: qrcode
|
||
|
|
```
|
||
|
|
|
||
|
|
把这 3 个输出截图给 Claude,后续诊断直接定位问题。
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. 失败可能性清单(7 种 + 回滚命令)
|
||
|
|
|
||
|
|
| # | 失败模式 | 现象 | 检测命令 | 回滚命令 |
|
||
|
|
|---|---------|------|---------|---------|
|
||
|
|
| F1 | `qrcode` pip 安装失败 | `restart` 后容器立刻 exit | `docker ps -a \| grep wecom_it_backend` 看到 `Restarting` 或 `Exited` | 见 §1.1 |
|
||
|
|
| F2 | 容器启动失败(模块导入报错) | backend 启动循环重启 | `docker logs wecom_it_backend --tail 30` 看到 `ModuleNotFoundError` / `ImportError` / `SyntaxError` | 见 §1.2 |
|
||
|
|
| F3 | `curl` `/api/auth_qrcode/create` 返回 500 | 容器 healthy 但端点挂 | `curl -k -X POST https://itsupport.servyou.com.cn/api/auth_qrcode/create` | 见 §1.3 |
|
||
|
|
| F4 | `curl` 返回 200 但**没** `qrcode_png_base64` 字段 | 代码覆盖不彻底(还是旧文件) | `curl ... \| python -m json.tool \| grep qrcode_png_base64` | 见 §1.4 |
|
||
|
|
| F5 | `curl` 返回 502/504 | nginx 找不到 backend 容器 | `docker ps \| grep backend` | 见 §1.5 |
|
||
|
|
| F6 | 端口冲突(8000 被占) | 容器一直 restarting | `docker logs wecom_it_backend --tail 50 \| grep -i "address already"` | 见 §1.6 |
|
||
|
|
| F7 | 镜像 ID 错乱/标签漂移 | `restart` 后跑的镜像不是预期的 | `docker images \| grep wecom-it-desk-backend` | 见 §1.7 |
|
||
|
|
|
||
|
|
### 1.1 F1: qrcode pip 安装失败回滚
|
||
|
|
|
||
|
|
**原因**: `pip install qrcode[pil]` 网络抽风 / 镜像精简版没 gcc / 版本冲突。
|
||
|
|
|
||
|
|
**回滚命令**(jumpserver 终端执行,root 用户):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1) 停容器
|
||
|
|
docker stop wecom_it_backend
|
||
|
|
|
||
|
|
# 2) 删容器(保留数据卷 / 网络)
|
||
|
|
docker rm wecom_it_backend
|
||
|
|
|
||
|
|
# 3) 用回滚镜像起新容器(关键: 命令行要跟当前生产容器完全一致)
|
||
|
|
# 抄一下当前容器的完整 run 命令,免得环境变量 / 挂载丢了
|
||
|
|
docker run -d \
|
||
|
|
--name wecom_it_backend \
|
||
|
|
--restart=always \
|
||
|
|
--network wecom_it_network \
|
||
|
|
-e DATABASE_URL='...' \
|
||
|
|
-e REDIS_URL='...' \
|
||
|
|
-e WECOM_CORP_ID='...' \
|
||
|
|
-v /opt/wecom-it-desk/backend:/app:rw \
|
||
|
|
wecom-it-desk-backend:v0.7.0-backup-pre-qrfix
|
||
|
|
|
||
|
|
# 4) 验证
|
||
|
|
docker ps | grep wecom_it_backend
|
||
|
|
# 期望: STATUS = Up X seconds (healthy)
|
||
|
|
```
|
||
|
|
|
||
|
|
> **简化方案**(如果你之前记录了完整 run 命令):
|
||
|
|
>
|
||
|
|
> ```bash
|
||
|
|
> # 直接用 docker commit 出来的镜像
|
||
|
|
> docker run -d --name wecom_it_backend <完整原参数> \
|
||
|
|
> wecom-it-desk-backend:v0.7.0-backup-pre-qrfix
|
||
|
|
> ```
|
||
|
|
|
||
|
|
### 1.2 F2: 模块导入报错回滚
|
||
|
|
|
||
|
|
**原因**: `auth_qrcode.py` 或 `qrcode_service.py` 上传时 base64 解码坏掉 / Python 缩进错。
|
||
|
|
|
||
|
|
**检测**:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker logs wecom_it_backend --tail 30 2>&1 | grep -E "(ModuleNotFoundError|ImportError|SyntaxError|IndentationError)"
|
||
|
|
```
|
||
|
|
|
||
|
|
**回滚命令**(比 F1 简单,只用覆盖文件 + 重启,不用换镜像):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1) 从 backup 镜像里把原版文件拷出来
|
||
|
|
docker create --name tmp_rollback wecom-it-desk-backend:v0.7.0-backup-pre-qrfix
|
||
|
|
docker cp tmp_rollback:/app/app/api/auth_qrcode.py /tmp/auth_qrcode.py.bak
|
||
|
|
docker cp tmp_rollback:/app/app/services/qrcode_service.py /tmp/qrcode_service.py.bak
|
||
|
|
docker rm tmp_rollback
|
||
|
|
|
||
|
|
# 2) 覆盖回滚(注意: bind mount 模式下必须改宿主机路径)
|
||
|
|
docker cp /tmp/auth_qrcode.py.bak wecom_it_backend:/app/app/api/auth_qrcode.py
|
||
|
|
docker cp /tmp/qrcode_service.py.bak wecom_it_backend:/app/app/services/qrcode_service.py
|
||
|
|
|
||
|
|
# 3) 重启
|
||
|
|
docker restart wecom_it_backend
|
||
|
|
|
||
|
|
# 4) 验证
|
||
|
|
sleep 5
|
||
|
|
docker ps | grep wecom_it_backend
|
||
|
|
curl -k -X POST https://itsupport.servyou.com.cn/api/auth_qrcode/create | python -m json.tool
|
||
|
|
```
|
||
|
|
|
||
|
|
### 1.3 F3: create 端点 500 回滚
|
||
|
|
|
||
|
|
**原因**: `qrcode_service.py` 内的 `_render_qrcode_png` 抛异常(`qrcode` 没装好 / PIL 缺包)。
|
||
|
|
|
||
|
|
**检测**:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 拿返回内容
|
||
|
|
curl -k -X POST https://itsupport.servyou.com.cn/api/auth_qrcode/create -v 2>&1 | tail -20
|
||
|
|
|
||
|
|
# 看后端日志,找 traceback
|
||
|
|
docker logs wecom_it_backend --tail 50 2>&1 | grep -A 20 "Traceback"
|
||
|
|
```
|
||
|
|
|
||
|
|
**回滚命令**: 同 §1.2(覆盖文件 + restart)。如果还 500,升级到 §1.1(换镜像)。
|
||
|
|
|
||
|
|
### 1.4 F4: 没 qrcode_png_base64 字段回滚
|
||
|
|
|
||
|
|
**原因**: `docker cp` 后容器内文件**没真覆盖**(典型 bind mount / overlay fs 坑)。
|
||
|
|
|
||
|
|
**检测**:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -k -X POST https://itsupport.servyou.com.cn/api/auth_qrcode/create | python -m json.tool
|
||
|
|
# 看 data 字段里有没有 "qrcode_png_base64"
|
||
|
|
# 没有 → 文件没真覆盖
|
||
|
|
```
|
||
|
|
|
||
|
|
**回滚命令**(强制覆盖):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1) 确认宿主机上 bind mount 的文件位置
|
||
|
|
docker inspect wecom_it_backend --format '{{range .Mounts}}{{.Source}} -> {{.Destination}}{{"\n"}}{{end}}' | grep app
|
||
|
|
# 输出: /opt/wecom-it-desk/backend -> /app
|
||
|
|
|
||
|
|
# 2) 直接改宿主机路径(这是 bind mount 唯一能稳定生效的方式)
|
||
|
|
ls -la /opt/wecom-it-desk/backend/app/api/auth_qrcode.py /opt/wecom-it-desk/backend/app/services/qrcode_service.py
|
||
|
|
|
||
|
|
# 3) 如果是新文件没生效,先 rm 再 cp
|
||
|
|
rm -f /opt/wecom-it-desk/backend/app/api/auth_qrcode.py
|
||
|
|
rm -f /opt/wecom-it-desk/backend/app/services/qrcode_service.py
|
||
|
|
cp /tmp/auth_qrcode.py /opt/wecom-it-desk/backend/app/api/
|
||
|
|
cp /tmp/qrcode_service.py /opt/wecom-it-desk/backend/app/services/
|
||
|
|
|
||
|
|
# 4) 必须 restart 容器(overlay 不会自动 sync bind mount)
|
||
|
|
docker restart wecom_it_backend
|
||
|
|
|
||
|
|
# 5) 验证
|
||
|
|
sleep 5
|
||
|
|
curl -k -X POST https://itsupport.servyou.com.cn/api/auth_qrcode/create | python -m json.tool | grep qrcode_png_base64
|
||
|
|
```
|
||
|
|
|
||
|
|
### 1.5 F5: 502/504 回滚
|
||
|
|
|
||
|
|
**原因**: nginx 解析到旧 backend 容器,或容器网络断了。
|
||
|
|
|
||
|
|
**检测**:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1) 看 backend 容器在不在
|
||
|
|
docker ps | grep wecom_it_backend
|
||
|
|
|
||
|
|
# 2) nginx 容器内直接测 backend
|
||
|
|
docker exec wecom_it_nginx wget -qO- --timeout=3 http://wecom_it_backend:8000/api/ready
|
||
|
|
# 期望: {"status":"ready",...}
|
||
|
|
# 502 → 网络通但 backend 内部挂
|
||
|
|
# timeout → 网络都不通
|
||
|
|
```
|
||
|
|
|
||
|
|
**回滚命令**(全链路重拉):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1) 停 backend
|
||
|
|
docker stop wecom_it_backend
|
||
|
|
|
||
|
|
# 2) 删容器
|
||
|
|
docker rm wecom_it_backend
|
||
|
|
|
||
|
|
# 3) 用回滚镜像起(完整参数)
|
||
|
|
docker run -d --name wecom_it_backend \
|
||
|
|
--restart=always --network wecom_it_network \
|
||
|
|
<完整原参数> \
|
||
|
|
wecom-it-desk-backend:v0.7.0-backup-pre-qrfix
|
||
|
|
|
||
|
|
# 4) 重新加载 nginx(让 upstream 刷新)
|
||
|
|
docker exec wecom_it_nginx nginx -s reload
|
||
|
|
|
||
|
|
# 5) 验证
|
||
|
|
sleep 10
|
||
|
|
curl -k https://itsupport.servyou.com.cn/api/ready
|
||
|
|
```
|
||
|
|
|
||
|
|
### 1.6 F6: 端口冲突回滚
|
||
|
|
|
||
|
|
**原因**: 旧容器没删干净 / 8000 被别的进程占。
|
||
|
|
|
||
|
|
**检测**:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker logs wecom_it_backend --tail 50 2>&1 | grep -i "address already in use"
|
||
|
|
# 或
|
||
|
|
ss -tlnp | grep 8000
|
||
|
|
```
|
||
|
|
|
||
|
|
**回滚命令**:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1) 看谁占 8000
|
||
|
|
ss -tlnp | grep ':8000'
|
||
|
|
|
||
|
|
# 2) 通常是僵尸容器,删它
|
||
|
|
docker ps -a | grep ":8000" # 不一定能直接看到
|
||
|
|
docker rm -f wecom_it_backend # 强制删当前容器
|
||
|
|
|
||
|
|
# 3) 再起
|
||
|
|
docker run -d --name wecom_it_backend <完整原参数> wecom-it-desk-backend:v0.7.0-backup-pre-qrfix
|
||
|
|
```
|
||
|
|
|
||
|
|
### 1.7 F7: 镜像 ID 错乱回滚
|
||
|
|
|
||
|
|
**原因**: `docker run` 时没指定 tag,默认拉 `latest`,可能不是预期的。
|
||
|
|
|
||
|
|
**检测**:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker images --format '{{.Repository}}:{{.Tag}} {{.ID}} {{.CreatedSince}}' | grep wecom-it-desk-backend
|
||
|
|
# 应该看到 3 个:
|
||
|
|
# wecom-it-desk-backend:v0.7.0-backup-pre-qrfix (回滚用的)
|
||
|
|
# wecom-it-desk-backend:latest (可能等于上面那个,也可能等于 patched)
|
||
|
|
# wecom-it-desk-backend:patched (hotfix 试装版,如果有)
|
||
|
|
```
|
||
|
|
|
||
|
|
**回滚命令**(显式指定 tag):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 拿到回滚镜像的精确 ID
|
||
|
|
ROLLBACK_IMAGE=$(docker images -q wecom-it-desk-backend:v0.7.0-backup-pre-qrfix)
|
||
|
|
echo "回滚镜像 ID: $ROLLBACK_IMAGE"
|
||
|
|
|
||
|
|
# 删旧容器
|
||
|
|
docker stop wecom_it_backend && docker rm wecom_it_backend
|
||
|
|
|
||
|
|
# 用**精确 ID** 起(避免 tag 被覆盖)
|
||
|
|
docker run -d --name wecom_it_backend <完整原参数> $ROLLBACK_IMAGE
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. 健康检查命令速查
|
||
|
|
|
||
|
|
### 2.1 容器层
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 状态(看是不是 healthy)
|
||
|
|
docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Image}}' | grep wecom_it_backend
|
||
|
|
# 期望: wecom_it_backend Up X minutes (healthy) wecom-it-desk-backend:v0.7.0-backup-pre-qrfix
|
||
|
|
|
||
|
|
# 看 healthcheck 详细日志
|
||
|
|
docker inspect wecom_it_backend --format '{{json .State.Health}}' | python -m json.tool
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2.2 进程层
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Python 进程在不在
|
||
|
|
docker exec wecom_it_backend ps aux | grep -E "uvicorn|gunicorn" | grep -v grep
|
||
|
|
# 期望: 1 行 uvicorn 进程
|
||
|
|
|
||
|
|
# 端口监听
|
||
|
|
docker exec wecom_it_backend ss -tlnp | grep 8000
|
||
|
|
# 期望: LISTEN 0 128 0.0.0.0:8000 ...
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2.3 端点层
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# readiness 端点(由 /api/ready 提供)
|
||
|
|
curl -k https://itsupport.servyou.com.cn/api/ready
|
||
|
|
# 期望: {"code":200,"data":{"status":"ready","checks":{...}}}
|
||
|
|
|
||
|
|
# health 端点
|
||
|
|
curl -k https://itsupport.servyou.com.cn/api/health
|
||
|
|
# 期望: {"status":"ok"}
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2.4 业务层(create 端点)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 标准 create 调用
|
||
|
|
curl -k -X POST https://itsupport.servyou.com.cn/api/auth_qrcode/create \
|
||
|
|
-H 'Content-Type: application/json' | python -m json.tool
|
||
|
|
```
|
||
|
|
|
||
|
|
期望返回(200 + data 里**有** `qrcode_png_base64`):
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"code": 200,
|
||
|
|
"message": "ok",
|
||
|
|
"data": {
|
||
|
|
"ticket": "AbCdEf123456...",
|
||
|
|
"qrcode_url": "https://open.weixin.qq.com/connect/oauth2/authorize?...",
|
||
|
|
"qrcode_png_base64": "iVBORw0KGgoAAAANSUhEUgAA...(超长 base64 字符串)...",
|
||
|
|
"expires_in": 120,
|
||
|
|
"expires_at": "2026-06-22T10:30:45.123456"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**关键判定**:
|
||
|
|
- HTTP 200 + `qrcode_png_base64` 长度 > 100 字符 = hotfix 生效 ✅
|
||
|
|
- HTTP 200 + 字段缺失 = §1.4 文件没覆盖
|
||
|
|
- HTTP 500 = §1.3
|
||
|
|
- HTTP 502/504 = §1.5
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. 验证 hotfix 真正生效(5 步)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Step 1: 文件 md5 对比(确认是 hotfix 版)
|
||
|
|
docker exec wecom_it_backend md5sum /app/app/api/auth_qrcode.py /app/app/services/qrcode_service.py
|
||
|
|
# 跟宿主机 /tmp/ 里那 2 个文件的 md5 对比,必须一致
|
||
|
|
md5sum /tmp/auth_qrcode.py /tmp/qrcode_service.py
|
||
|
|
|
||
|
|
# Step 2: 关键代码片段存在性
|
||
|
|
docker exec wecom_it_backend grep -n "_render_qrcode_png\|qrcode_png_base64" \
|
||
|
|
/app/app/api/auth_qrcode.py /app/app/services/qrcode_service.py
|
||
|
|
# 期望: 至少 3 行匹配(import / def / return)
|
||
|
|
|
||
|
|
# Step 3: qrcode 装上了
|
||
|
|
docker exec wecom_it_backend python -c "import qrcode; print(qrcode.__version__)"
|
||
|
|
# 期望: 7.4.2
|
||
|
|
|
||
|
|
# Step 4: create 端点返回 qrcode_png_base64
|
||
|
|
RESP=$(curl -k -s -X POST https://itsupport.servyou.com.cn/api/auth_qrcode/create)
|
||
|
|
echo "$RESP" | python -c "import json,sys; d=json.load(sys.stdin); print('has_field:', 'qrcode_png_base64' in d.get('data',{})); print('len:', len(d.get('data',{}).get('qrcode_png_base64','')))"
|
||
|
|
# 期望: has_field: True len: 500~2000
|
||
|
|
|
||
|
|
# Step 5: 浏览器实测(用户手工)
|
||
|
|
# 打开 https://itsupport.servyou.com.cn/itportal/
|
||
|
|
# 应该看到二维码图片(不是空白)
|
||
|
|
```
|
||
|
|
|
||
|
|
**5 步全过 = hotfix 真生效**。任何一步失败,跳到 §1 对应章节回滚。
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. 决策树:何时回滚 vs 何时修复
|
||
|
|
|
||
|
|
```
|
||
|
|
┌──────────────────────────┐
|
||
|
|
│ hotfix 装好,开始验证 │
|
||
|
|
│ (curl /api/auth_qrcode/ │
|
||
|
|
│ create) │
|
||
|
|
└────────────┬─────────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌──────────────────────────┐
|
||
|
|
│ HTTP 200 + 有 base64 字段? │
|
||
|
|
└────┬──────────────┬──────┘
|
||
|
|
│ │
|
||
|
|
Yes No
|
||
|
|
│ │
|
||
|
|
▼ ▼
|
||
|
|
┌─────────────────┐ ┌──────────────────┐
|
||
|
|
│ ✅ hotfix 生效 │ │ 看 HTTP 状态码 │
|
||
|
|
│ 跑 §3 后 5 步 │ └────┬───────┬─────┘
|
||
|
|
│ 浏览器实测 │ │ │
|
||
|
|
└─────────────────┘ 500 502/504
|
||
|
|
│ │
|
||
|
|
▼ ▼
|
||
|
|
┌──────────┐ ┌──────────────┐
|
||
|
|
│ 看 trace │ │ 看容器在不在 │
|
||
|
|
│ 见 §1.3 │ │ 见 §1.5 │
|
||
|
|
└────┬─────┘ └──────┬───────┘
|
||
|
|
│ │
|
||
|
|
┌────────┴────┐ │
|
||
|
|
▼ ▼ │
|
||
|
|
修不好(< 5 分钟) 修得好 │
|
||
|
|
│ │ │
|
||
|
|
▼ ▼ │
|
||
|
|
§1.2 覆盖文件 继续验证 │
|
||
|
|
完整走完 §3 ┌────┴────────┘
|
||
|
|
│
|
||
|
|
▼
|
||
|
|
┌──────────────────┐
|
||
|
|
│ 走完 §3 五步验证 │
|
||
|
|
└────┬─────────┬───┘
|
||
|
|
│ │
|
||
|
|
全过(5/5) 有失败
|
||
|
|
│ │
|
||
|
|
▼ ▼
|
||
|
|
浏览器实测 §1.4 文件覆盖
|
||
|
|
/itportal/ (bind mount)
|
||
|
|
看到二维码
|
||
|
|
│
|
||
|
|
┌────┴────┐
|
||
|
|
▼ ▼
|
||
|
|
看到二维码 还是空白
|
||
|
|
│ │
|
||
|
|
▼ ▼
|
||
|
|
✅ 成功 截图给 Claude
|
||
|
|
走 §1.1 换镜像
|
||
|
|
```
|
||
|
|
|
||
|
|
**何时回滚的硬性触发条件**(任一即回滚):
|
||
|
|
|
||
|
|
1. ❌ **容器健康检查连续 3 次失败**(每 30s 一次,> 90s 不 healthy)
|
||
|
|
2. ❌ **其他业务端点挂掉**(扫一下 /api/ready / /api/health / 别的 create 端点)
|
||
|
|
3. ❌ **修复尝试超过 5 分钟无进展**
|
||
|
|
4. ❌ **用户报告前端页面打不开 / 报 500**
|
||
|
|
|
||
|
|
**何时继续修复的判断**:
|
||
|
|
|
||
|
|
- 容器 healthy + 仅 `create` 端点 500 → 尝试 §1.2 覆盖文件,5 分钟内没好就走 §1.1
|
||
|
|
- 容器 healthy + `create` 端点正常 + 没 base64 字段 → §1.4 强制覆盖(这是文件问题,不是代码问题)
|
||
|
|
- 容器 not healthy + 启动报错 → 直接 §1.1 换镜像(别浪费时间)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. 回滚后清理步骤(2 步)
|
||
|
|
|
||
|
|
回滚成功 + 业务恢复后,把现场收拾干净。
|
||
|
|
|
||
|
|
### 5.1 恢复 image tag
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1) 看现在有哪些镜像
|
||
|
|
docker images | grep wecom-it-desk-backend
|
||
|
|
# 期望看到:
|
||
|
|
# REPOSITORY TAG IMAGE ID CREATED
|
||
|
|
# wecom-it-desk-backend v0.7.0-backup-pre-qrfix abc123... 3 days ago
|
||
|
|
# wecom-it-desk-backend patched def456... 10 minutes ago (hotfix 试装版)
|
||
|
|
# wecom-it-desk-backend latest abc123... 3 days ago (跟 backup 同 ID)
|
||
|
|
|
||
|
|
# 2) 把 latest 重新指向回滚镜像
|
||
|
|
docker tag wecom-it-desk-backend:v0.7.0-backup-pre-qrfix wecom-it_desk-backend:latest
|
||
|
|
# 防止下次 pull latest 时拉到错版本
|
||
|
|
|
||
|
|
# 3) 给 hotfix 试装镜像打孤 tag(留底,后面排查用)
|
||
|
|
docker tag wecom-it-desk-backend:patched wecom-it-desk-backend:hotfix-63-failed
|
||
|
|
# 避免被下次构建覆盖
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5.2 清理多余镜像(谨慎)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1) 先看磁盘占用
|
||
|
|
docker system df
|
||
|
|
|
||
|
|
# 2) 看哪些镜像没人用
|
||
|
|
docker images --filter "dangling=true" # 悬空镜像(<none>:<none>)
|
||
|
|
# 期望: 如果有 hotfix 中间层,会列出来
|
||
|
|
|
||
|
|
# 3) 删悬空镜像(安全)
|
||
|
|
docker image prune -f
|
||
|
|
|
||
|
|
# 4) 看 patched 镜像是否还有容器引用
|
||
|
|
docker ps -a --filter "ancestor=wecom-it-desk-backend:patched" --format '{{.ID}} {{.Names}} {{.Status}}'
|
||
|
|
# 期望: 0 行(回滚后应该没容器在用 patched)
|
||
|
|
|
||
|
|
# 5) 删 patched 镜像
|
||
|
|
docker rmi wecom-it-desk-backend:patched
|
||
|
|
|
||
|
|
# 6) 删 failed 留底(可选,建议先保留 7 天)
|
||
|
|
# docker rmi wecom-it-desk-backend:hotfix-63-failed
|
||
|
|
|
||
|
|
# 7) 再看一次
|
||
|
|
docker images | grep wecom-it-desk-backend
|
||
|
|
# 期望只剩 v0.7.0-backup-pre-qrfix + latest(同 ID)
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5.3 清理宿主机临时文件
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 删 /tmp/ 里那 2 个 base64 上传用的文件
|
||
|
|
rm -f /tmp/auth_qrcode.py /tmp/qrcode_service.py
|
||
|
|
rm -f /tmp/auth_qrcode.py.bak /tmp/qrcode_service.py.bak # 回滚时产生的
|
||
|
|
ls -la /tmp/ | grep -E "(qrcode|auth_qrcode)"
|
||
|
|
# 期望: 无输出
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 6. 一键回滚脚本(把 §1.1 打包)
|
||
|
|
|
||
|
|
如果手动操作太烦,把回滚流程封装成一个脚本(jumpserver 上直接跑):
|
||
|
|
|
||
|
|
**文件**: `/opt/wecom-it-desk/rollback-hotfix63.sh`
|
||
|
|
|
||
|
|
```bash
|
||
|
|
#!/bin/bash
|
||
|
|
# v0.7.0 hotfix #63 一键回滚
|
||
|
|
# 用法: bash /opt/wecom-it-desk/rollback-hotfix63.sh
|
||
|
|
|
||
|
|
set -e # 任一命令失败立即退出
|
||
|
|
|
||
|
|
echo "===== hotfix #63 一键回滚 ====="
|
||
|
|
|
||
|
|
# 1) 停 + 删当前容器
|
||
|
|
docker stop wecom_it_backend
|
||
|
|
docker rm wecom_it_backend
|
||
|
|
|
||
|
|
# 2) 用 backup 镜像起
|
||
|
|
docker run -d \
|
||
|
|
--name wecom_it_backend \
|
||
|
|
--restart=always \
|
||
|
|
--network wecom_it_network \
|
||
|
|
$(cat /opt/wecom-it-desk/backend-run.env) \
|
||
|
|
wecom-it-desk-backend:v0.7.0-backup-pre-qrfix
|
||
|
|
|
||
|
|
# 3) 等 5 秒让容器启动
|
||
|
|
sleep 5
|
||
|
|
|
||
|
|
# 4) 健康检查
|
||
|
|
echo "===== 验证 ====="
|
||
|
|
docker ps | grep wecom_it_backend
|
||
|
|
curl -kf https://itsupport.servyou.com.cn/api/ready && echo "READY OK" || echo "READY FAIL"
|
||
|
|
|
||
|
|
echo "===== 回滚完成 ====="
|
||
|
|
```
|
||
|
|
|
||
|
|
**部署方式**(在 jumpserver 终端):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1) 创建文件
|
||
|
|
cat > /opt/wecom-it-desk/rollback-hotfix63.sh << 'EOF'
|
||
|
|
# (上面那段内容)
|
||
|
|
EOF
|
||
|
|
|
||
|
|
# 2) 加执行权限
|
||
|
|
chmod +x /opt/wecom-it-desk/rollback-hotfix63.sh
|
||
|
|
|
||
|
|
# 3) 提取当前 backend 容器的 run 参数(给脚本里的 $(cat ...) 用)
|
||
|
|
docker inspect wecom_it_backend --format '{{range .Config.Env}}export {{.}}{{"\n"}}{{end}}' \
|
||
|
|
> /opt/wecom-it-desk/backend-run.env 2>/dev/null || true
|
||
|
|
|
||
|
|
# 4) 跑回滚
|
||
|
|
bash /opt/wecom-it-desk/rollback-hotfix63.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
> **注意**: `--env-file` / `-e` 在 `docker run` 里比脚本里 export 更稳。**生产建议把完整 `docker run` 命令存到 `/opt/wecom-it-desk/backend-run.sh`,回滚脚本里直接 `bash backend-run.sh`**。这个留给后续优化。
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 7. 回滚后通知清单
|
||
|
|
|
||
|
|
回滚完 = 业务恢复,但**还要做 3 件事**:
|
||
|
|
|
||
|
|
1. **更新 `CURRENT-FOCUS.md`**: 在「最近搞定」加一行 `❌ v0.7.0 hotfix #63 失败已回滚到 v0.7.0-backup-pre-qrfix,前端 /itportal/ 二维码仍不显示,等下一轮修复`
|
||
|
|
2. **记入 memory**: 在 `memory/` 加 `hotfix-63-rollback-2026-06-22.md`,写清楚: 失败在哪一步 / 用了哪个回滚命令 / 跟 Claude 复盘结论
|
||
|
|
3. **贴 logs 给 Claude**: 把 `docker logs wecom_it_backend --tail 200` 输出贴回来,分析根因,准备下一轮 hotfix 方案(v0.7.0.2-hotfix2)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 8. 速查表(贴在屏幕边上)
|
||
|
|
|
||
|
|
| 我看到 | 跑这个 |
|
||
|
|
|--------|--------|
|
||
|
|
| 容器 restarting | `docker logs wecom_it_backend --tail 30` 看启动错误 → §1.1 |
|
||
|
|
| 容器 healthy 但 create 500 | §1.3 拿 traceback → §1.2 覆盖文件 |
|
||
|
|
| 容器 healthy + create 200 + 无 base64 | §1.4 强制 bind mount 覆盖 |
|
||
|
|
| 502/504 | §1.5 看网络 + 容器 |
|
||
|
|
| 8000 占用 | §1.6 |
|
||
|
|
| 完全不知道啥情况 | §1.1 一键换镜像(最稳) |
|
||
|
|
| 不知道回滚到哪个镜像 | `docker images \| grep backup` |
|
||
|
|
| 不知道完整 run 命令 | `docker inspect wecom_it_backend --format '{{.Config.Cmd}} {{json .Config.Env}}' \| head -c 500` |
|
||
|
|
| 想一键回滚 | `bash /opt/wecom-it-desk/rollback-hotfix63.sh` |
|
||
|
|
| 验证 hotfix 生效 | §3 五步全过 = ✅ |
|
||
|
|
| 回滚后清理 | §5 三步 |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**文档结束**。所有命令都在 jumpserver 终端以 root 跑,`docker exec` 都假设容器名叫 `wecom_it_backend`(生产实际名,见 `memory/container-names-wecom-it-backend.md`)。如果容器名变了,先跑 `docker ps --format '{{.Names}}' \| grep backend` 确认。
|