Files
wecom_it_smart_desk/docs/审计报告/健康检查+错误码+日志结构化.md
T
Simon 93ba41ed79 feat: 审批流程模块 (T审批A审批)
- 新增 backend/app/api/approval.py 审批API
- 前端H5支持发起审批、审批操作
- 添加审批卡片弹窗组件
- 路由注册审批模块
2026-06-15 09:32:41 +08:00

18 KiB

健康检查 + 错误码 + 日志结构化 审计与改进方案

审计日期: 2026-06-15 审计人: Claude(满载跑批) 关联: 风险跟踪表 / 后端架构 / Dockerfile优化与镜像审计


📌 1. 健康检查现状

1.1 当前实现

端点: backend/app/main.py:506

@app.get("/health", tags=["系统"])
async def health_check():
    """健康检查端点。"""
    return {"status": "ok", "service": "wecom-it-smart-desk"}

Docker compose healthcheck:

healthcheck:
  test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health').read()"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

1.2 问题清单

# 问题 严重度 影响
H-1 /health 不验证 DB 连接 🟠 DB 挂了,但 healthcheck 还显示 OK
H-2 /health 不验证 Redis 连接 🟠 Redis 挂了,但 healthcheck 还显示 OK
H-3 /health 不报告版本/build 🟡 排障不便
H-4 /health 永远是 200,无 degraded 状态 🟡 难区分"在线但降级"
H-5 /ready/live 区分 🟡 K8s 不友好
H-6 Docker healthcheck 改用 urllib 已修 (P1-3) 🟢 已 done

1.3 改进版(完整 healthcheck)

# backend/app/api/health.py(新建)

import time
import psutil
from typing import Dict, Any
from fastapi import APIRouter, HTTPException
from sqlalchemy import text
from app.database import async_session_maker
from app.config import settings
from app.utils.token_manager import get_token_manager

router = APIRouter(tags=["系统"])
START_TIME = time.time()


@router.get("/health")
async def health_check():
    """Liveness probe - 进程是否存活
    
    适用: K8s livenessProbe / Docker healthcheck
    返回: 总是 200,只要进程没崩
    """
    return {
        "status": "ok",
        "service": "wecom-it-smart-desk",
        "uptime_seconds": int(time.time() - START_TIME),
    }


@router.get("/ready")
async def readiness_check():
    """Readiness probe - 进程是否准备好接流量
    
    适用: K8s readinessProbe / 负载均衡
    验证: DB + Redis 实际连通性
    """
    checks = {
        "database": False,
        "redis": False,
        "wecom_token": False,
    }
    
    # 1. DB 检查
    try:
        async with async_session_maker() as session:
            result = await session.execute(text("SELECT 1"))
            result.scalar()
        checks["database"] = True
    except Exception as e:
        checks["database_error"] = str(e)[:200]
    
    # 2. Redis 检查
    try:
        tm = get_token_manager()
        client = await tm.get_redis()
        await client.ping()
        checks["redis"] = True
    except Exception as e:
        checks["redis_error"] = str(e)[:200]
    
    # 3. 企微 token 检查(可选)
    try:
        tm = get_token_manager()
        token = await tm.get_access_token()
        checks["wecom_token"] = bool(token)
    except Exception as e:
        checks["wecom_error"] = str(e)[:200]
    
    all_ok = all(v for k, v in checks.items() if not k.endswith("_error"))
    status_code = 200 if all_ok else 503
    
    return JSONResponse(
        status_code=status_code,
        content={
            "status": "ready" if all_ok else "degraded",
            "service": "wecom-it-smart-desk",
            "uptime_seconds": int(time.time() - START_TIME),
            "checks": checks,
            "timestamp": datetime.now().isoformat(),
        }
    )


@router.get("/metrics")
async def metrics():
    """Prometheus metrics 端点(轻量版)
    
    适用: Prometheus 抓取
    输出: 关键业务/技术指标
    """
    process = psutil.Process()
    
    return {
        "process": {
            "cpu_percent": process.cpu_percent(),
            "memory_mb": process.memory_info().rss / 1024 / 1024,
            "threads": process.num_threads(),
            "uptime_seconds": int(time.time() - START_TIME),
        },
        "system": {
            "cpu_percent": psutil.cpu_percent(),
            "memory_percent": psutil.virtual_memory().percent,
            "disk_percent": psutil.disk_usage('/').percent,
        },
    }


@router.get("/version")
async def version():
    """版本信息端点
    
    用途: 排障 / 部署确认
    """
    import os
    return {
        "service": "wecom-it-smart-desk",
        "version": os.getenv("APP_VERSION", "dev"),
        "git_sha": os.getenv("GIT_SHA", "unknown")[:8],
        "build_time": os.getenv("BUILD_TIME", "unknown"),
        "python": "3.12",
    }

1.4 Docker compose 更新

healthcheck:
  test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health').read()"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

# 高级(可选,等 K8s 迁移时)
readiness:
  test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/ready').read()"]
  interval: 10s
  timeout: 5s
  retries: 3

📌 2. 错误码体系现状与改进

2.1 现状(backend/app/utils/response.py)

已有错误码(18 个):

  • 1000+ 通用 (5): ERR_PARAMS / UNAUTHORIZED / NOT_FOUND / FORBIDDEN / INTERNAL
  • 2000+ 企微 (6): WECOM_TOKEN / SEND / DECRYPT / ENCRYPT / VERIFY / USER_INFO
  • 3000+ 业务 (7): AGENT_OFFLINE / CONVERSATION_RESOLVED / CONVERSATION_NOT_FOUND / AGENT_NOT_FOUND / AGENT_BUSY / DUPLICATE_ASSIGN / GRAB_*

格式:

{"code": 0, "data": {}, "message": "success"}

2.2 问题清单

# 问题 严重度 解决
E-1 错误码无标准枚举类(只常量) 🟡 ErrorCode Enum
E-2 HTTP 200 + code 非 0(违反 REST 习惯) 🟡 评估:4xx 5xx 也可,跟前端约定
E-3 没错误追踪 ID(correlation_id) 🟠 trace_id 字段
E-4 错误响应没 documentation_url 🟢 加上,链到文档
E-5 i18n 缺失(中文硬编码) 🟡 错误消息 i18n 化
E-6 前端错误处理分散(无统一拦截) 🟠 加 axios 拦截器 + 错误码映射表

2.3 改进版错误码体系

新建 backend/app/utils/error_codes.py:

# =============================================================================
# 错误码体系 - 标准枚举
# =============================================================================
# 规范:
# - 0       = 成功
# - 1xxx    = 通用错误
# - 2xxx    = 鉴权/会话
# - 3xxx    = 企微 API
# - 4xxx    = 业务 - 会话
# - 5xxx    = 业务 - 坐席
# - 6xxx    = 业务 - 配置
# - 7xxx    = 集成外部系统
# - 9xxx    = 兜底
# =============================================================================

from enum import Enum


class ErrorCode(int, Enum):
    """统一错误码枚举"""
    
    # 0: 成功
    SUCCESS = 0
    
    # 1xxx: 通用错误
    PARAMS_INVALID = 1001       # 参数错误
    UNAUTHORIZED = 1002         # 未授权
    NOT_FOUND = 1003            # 资源不存在
    FORBIDDEN = 1004            # 无权限
    INTERNAL = 1005             # 服务器错误
    RATE_LIMITED = 1006         # 限流
    SERVICE_UNAVAILABLE = 1007  # 服务不可用
    TIMEOUT = 1008              # 超时
    
    # 2xxx: 鉴权
    AUTH_TOKEN_MISSING = 2001   # token 缺失
    AUTH_TOKEN_EXPIRED = 2002   # token 过期
    AUTH_TOKEN_INVALID = 2003   # token 无效
    AUTH_OTP_REQUIRED = 2004    # 需要 OTP
    AUTH_OTP_INVALID = 2005     # OTP 错误
    AUTH_PASSWORD_WRONG = 2006  # 密码错误
    AUTH_AGENT_DISABLED = 2007  # 坐席已禁用
    
    # 3xxx: 企微 API
    WECOM_TOKEN_FAIL = 3001     # 企微 token 获取失败
    WECOM_SEND_FAIL = 3002      # 企微消息发送失败
    WECOM_DECRYPT_FAIL = 3003   # 企微消息解密失败
    WECOM_ENCRYPT_FAIL = 3004   # 企微消息加密失败
    WECOM_VERIFY_FAIL = 3005    # 企微回调签名验证失败
    WECOM_USER_INFO_FAIL = 3006 # 企微用户信息获取失败
    WECOM_API_ERROR = 3099      # 企微 API 通用错误
    
    # 4xxx: 业务 - 会话
    CONV_NOT_FOUND = 4001       # 会话不存在
    CONV_RESOLVED = 4002        # 会话已结单
    CONV_NO_AGENT = 4003        # 无可用坐席
    CONV_DUPLICATE_ASSIGN = 4004 # 重复分配
    CONV_GRAB_DENIED = 4005     # 抢单失败
    
    # 5xxx: 业务 - 坐席
    AGENT_NOT_FOUND = 5001      # 坐席不存在
    AGENT_OFFLINE = 5002        # 坐席离线
    AGENT_BUSY = 5003           # 坐席满载
    AGENT_GRAB_SELF = 5004      # 不能接手自己的会话
    AGENT_GRAB_NOT_SERVING = 5005  # 只能接手服务中的会话
    
    # 6xxx: 业务 - 配置
    CONFIG_NOT_FOUND = 6001     # 配置不存在
    CONFIG_INVALID = 6002       # 配置值无效
    
    # 7xxx: 集成外部
    HUORONG_API_FAIL = 7001     # 火绒 API
    LIANRUAN_API_FAIL = 7002    # 联软 API
    ATRUST_API_FAIL = 7003      # aTrust API
    EHR_API_FAIL = 7004         # eHR API
    DIFY_API_FAIL = 7005        # Dify API
    
    # 9xxx: 兜底
    UNKNOWN = 9999


# 错误码 → HTTP 状态码(可选,默认 200)
HTTP_STATUS_MAP = {
    ErrorCode.SUCCESS: 200,
    ErrorCode.PARAMS_INVALID: 422,
    ErrorCode.UNAUTHORIZED: 401,
    ErrorCode.NOT_FOUND: 404,
    ErrorCode.FORBIDDEN: 403,
    ErrorCode.INTERNAL: 500,
    ErrorCode.RATE_LIMITED: 429,
    ErrorCode.SERVICE_UNAVAILABLE: 503,
    ErrorCode.TIMEOUT: 504,
    
    ErrorCode.AUTH_TOKEN_MISSING: 401,
    ErrorCode.AUTH_TOKEN_EXPIRED: 401,
    ErrorCode.AUTH_TOKEN_INVALID: 401,
    ErrorCode.AUTH_OTP_REQUIRED: 401,
    ErrorCode.AUTH_OTP_INVALID: 401,
    ErrorCode.AUTH_PASSWORD_WRONG: 401,
    ErrorCode.AUTH_AGENT_DISABLED: 403,
    
    # 业务错误默认 200,通过 code 区分
    # 但具体可调,如 4xxx 资源类 404,5xxx 状态类 409
}

更新 response.py:

from app.utils.error_codes import ErrorCode, HTTP_STATUS_MAP


def error_response(
    code: ErrorCode, 
    message: str, 
    data: Any = None,
    trace_id: str = None,
) -> Dict[str, Any]:
    """构建错误响应(增加 trace_id)"""
    return {
        "code": int(code),
        "message": message,
        "data": data or {},
        "trace_id": trace_id,
        "timestamp": datetime.now().isoformat(),
    }


class AppException(Exception):
    def __init__(
        self, 
        code: ErrorCode, 
        message: str, 
        data: Any = None,
        http_status: int = None,
    ):
        self.code = code
        self.message = message
        self.data = data
        self.http_status = http_status or HTTP_STATUS_MAP.get(code, 200)
        super().__init__(message)


async def app_exception_handler(request: Request, exc: AppException) -> JSONResponse:
    # 生成 trace_id
    import uuid
    trace_id = request.headers.get("X-Request-ID") or str(uuid.uuid4())
    
    # 记录到日志
    logger.warning(
        f"[{trace_id}] {request.method} {request.url.path} "
        f"-> {exc.code.value} {exc.message}"
    )
    
    return JSONResponse(
        status_code=exc.http_status,
        content=error_response(exc.code, exc.message, exc.data, trace_id),
        headers={"X-Trace-ID": trace_id},
    )

2.4 前端错误码映射(axios 拦截器)

新建 frontend-admin/src/api/error-handler.ts(每个前端类似):

import { ElMessage } from 'element-plus'

// 错误码 → 用户提示
const ERROR_MESSAGES: Record<number, string> = {
  1001: '参数错误,请检查输入',
  1002: '登录已过期,请重新登录',
  1003: '资源不存在',
  1004: '无权限访问',
  1005: '服务器错误,请稍后重试',
  1006: '操作过快,请稍候再试',
  2001: '请先登录',
  2002: '登录已过期',
  2003: '身份验证失败',
  2004: '请输入动态码',
  2005: '动态码错误',
  2006: '密码错误',
  4001: '会话不存在',
  4002: '会话已结束',
  5001: '坐席不存在',
  5002: '坐席离线',
  5003: '坐席已满载',
  9999: '未知错误',
}

export function handleError(code: number, message: string, traceId?: string) {
  const userMsg = ERROR_MESSAGES[code] || message || '操作失败'
  
  // 特殊处理
  if ([1002, 2001, 2002, 2003].includes(code)) {
    // 跳登录
    localStorage.removeItem('token')
    window.location.href = '/login'
  }
  
  ElMessage.error(userMsg)
  
  // 开发环境显示 trace_id
  if (import.meta.env.DEV && traceId) {
    console.error(`[TraceID: ${traceId}] Code: ${code}, Message: ${message}`)
  }
}

📌 3. 日志结构化

3.1 现状

# backend/app/main.py
logging.basicConfig(
    level=logging.INFO,
    format="[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s",
)

问题:

  • 🟡 文本格式,不易查询/聚合
  • 🟡 无 trace_id
  • 🟡 无 request/response 记录
  • 🟡 无结构化字段(用户/会话/操作)

3.2 改进版

新建 backend/app/utils/logging_config.py:

import json
import logging
import sys
import time
from contextvars import ContextVar
from typing import Any, Dict, Optional

# 请求上下文
request_id_var: ContextVar[Optional[str]] = ContextVar('request_id', default=None)
user_id_var: ContextVar[Optional[str]] = ContextVar('user_id', default=None)


class JSONFormatter(logging.Formatter):
    """JSON 格式化器 - 适合 ELK / Loki / CloudWatch 解析"""
    
    def format(self, record: logging.LogRecord) -> str:
        log_data = {
            "timestamp": self.formatTime(record),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
        }
        
        # 上下文
        if request_id_var.get():
            log_data["request_id"] = request_id_var.get()
        if user_id_var.get():
            log_data["user_id"] = user_id_var.get()
        
        # 额外字段
        if hasattr(record, "extra_data"):
            log_data.update(record.extra_data)
        
        # 异常
        if record.exc_info:
            log_data["exception"] = self.formatException(record.exc_info)
        
        return json.dumps(log_data, ensure_ascii=False)


def setup_logging(level: str = "INFO", json_format: bool = True):
    """配置日志"""
    root = logging.getLogger()
    root.setLevel(level)
    
    # 清除已有 handler
    for handler in root.handlers[:]:
        root.removeHandler(handler)
    
    handler = logging.StreamHandler(sys.stdout)
    if json_format:
        handler.setFormatter(JSONFormatter())
    else:
        handler.setFormatter(logging.Formatter(
            "[%(asctime)s] [%(levelname)s] [%(name)s] %(message)s"
        ))
    root.addHandler(handler)


# 业务日志辅助函数
def log_business(
    event: str,
    *,
    user_id: str = None,
    conversation_id: str = None,
    agent_id: str = None,
    **kwargs
):
    """记录业务日志(结构化)"""
    extra_data = {
        "event": event,
        "user_id": user_id,
        "conversation_id": conversation_id,
        "agent_id": agent_id,
        **kwargs,
    }
    logger.info(f"business_event: {event}", extra={"extra_data": extra_data})


def log_security(
    event: str,
    *,
    user_id: str = None,
    ip: str = None,
    **kwargs
):
    """记录安全日志(单独级别,便于审计)"""
    extra_data = {
        "event": event,
        "category": "security",
        "user_id": user_id,
        "ip": ip,
        **kwargs,
    }
    logger.warning(f"security_event: {event}", extra={"extra_data": extra_data})

中间件 - 注入 request_id:

# backend/app/main.py
from app.utils.logging_config import setup_logging, request_id_var, user_id_var
import uuid

@app.middleware("http")
async def request_id_middleware(request: Request, call_next):
    # 拿/创 trace_id
    trace_id = request.headers.get("X-Request-ID") or str(uuid.uuid4())
    request_id_var.set(trace_id)
    
    # 记录请求开始
    start = time.time()
    logger.info(
        f"request_start: {request.method} {request.url.path}",
        extra={"extra_data": {
            "method": request.method,
            "path": request.url.path,
            "client": request.client.host if request.client else "?",
        }}
    )
    
    response = await call_next(request)
    
    # 记录请求结束
    duration = time.time() - start
    logger.info(
        f"request_end: {response.status_code} in {duration:.3f}s",
        extra={"extra_data": {
            "method": request.method,
            "path": request.url.path,
            "status": response.status_code,
            "duration_ms": int(duration * 1000),
        }}
    )
    
    response.headers["X-Request-ID"] = trace_id
    return response

3.3 日志聚合方案

方案 适用 接入成本
stdout + Docker logs 小规模 / 排障 🟢 0
Loki + Promtail 中规模 / 查日志 🟡
ELK (Elasticsearch + Logstash + Kibana) 大规模 / 全文搜索 🟠
CloudWatch / 阿里云 SLS 公有云 🟡 看云

短期: 走 stdout,Docker 收集到 /var/log/wecom-it-desk/*.log,脚本 + grep 查 中期: Loki + Grafana(本地 NAS 部署) 长期: ELK / 云原生日志


📌 4. 实施路径

4.1 立即(本次跑批)

  • 审计报告写完(本文件)
  • backend/app/utils/error_codes.py (Enum)
  • backend/app/utils/logging_config.py (JSON formatter)
  • 更新 main.py/ready /metrics /version 端点
  • 加 request_id 中间件

4.2 下周

  • 4 前端加 api/error-handler.ts
  • 加 4 前端 axios 拦截器(捕获 trace_id)
  • .env 配置 LOG_LEVEL=INFO + LOG_FORMAT=json

4.3 季度

  • Loki + Promtail 部署
  • Grafana 仪表盘(Loki 数据源)
  • 关键业务事件告警(登录失败/坐席离线)

📌 5. 关联文档


本审计是 2026-06-15 Claude 满载跑批产出,待评审