2GB 内存跑深度学习模型？在 99 元服务器上的部署实战

项目	链接
在线 APP	polyustar.github.io/sonox
论文	Switch: A Semi-Supervised Learning Framework for Ultrasound Image Segmentation
源代码	github.com/jinggqu/Switch

背景

今年买了阿里云 99 元/年的新老同享 ECS，想在上面部署 SonoX — 一个基于 Switch 半监督学习框架的超声影像病灶分割与分类平台。

服务器配置如下：

项目	规格
实例型号	ecs.e-c1m1.large
CPU	2 vCPU（物理单核超线程）
内存	2 GB
公网带宽	3 Mbps (出) / 200 Mbps (入)
系统	Debian 13.5
费用	99 元/年

项目本身的资源需求：

组件	规模
分割模型 (UNet)	4 个数据集 × 6 个半监督变体 + baseline = 28 个，每个 ~7MB
分类模型	ViT-Base ~328MB + 融合头 ~15MB + Radiomics ~6MB
模型权重总大小	~540MB
后端代码	单文件 ~770 行 FastAPI

核心矛盾：~540MB 模型权重 + PyTorch 运行时，要塞进 2GB 内存的服务器里，还要留余地给推理计算。

下面记录优化过程中踩过的坑和实际有效的措施。

选 Debian 而不是 Ubuntu

同一台机器，Ubuntu Server 基础安装就要吃掉 ~400MB（snapd、networkd 等），而 Debian 最小化安装只需 ~128MB。二者都用 apt，二进制兼容，迁移成本几乎为零。

	Ubuntu	Debian
基础内存占用	~400MB	~128MB
Snap 守护进程	有 (~50MB)	无
默认服务	多	少

在 2GB 服务器上，Debian 比 Ubuntu 省出约 250MB — 对一个内存紧张的深度学习服务来说，这就是能跑和不能跑的区别。

关闭 crashkernel，回收「公摊内存」

开机 free -h 一看，标称 2GB 的服务器实际可用只有约 1.65GB。少掉的 ~350MB 主要来自两块：内核自身开销 + crashkernel 预留。对个人项目来说，服务器真 crash 了直接重启即可，不需要 kdump。

1
2
3
4
5
6
7
# 查看 kdump 占用
sudo kdump-config show

# 编辑 /etc/default/grub，删除 crashkernel=... 那段
# 然后更新 grub 并重启
sudo update-grub
sudo reboot

重启后可用内存涨到 ~1.8GB，找回 100MB+。

创建 Swap：最后的保险丝

上面的优化都是在物理内存上腾挪。但模型推理是尖刺负载——平时风平浪静，一次分类请求可能瞬间多占几百 MB。2GB 物理内存没有容错空间，一旦超了就 OOM Kill。

加一个 2GB 的 Swap 文件作为缓冲：

1
2
3
4
5
6
7
8
# 创建 2GB Swap 文件
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# 写入 /etc/fstab 实现开机自动挂载
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

验证：

1
2
3
4
free -h
#               total        used        free      shared  buff/cache   available
# Mem:           1.6Gi       451Mi       627Mi       2.5Mi       685Mi       1.1Gi
# Swap:          2.0Gi          0B       2.0Gi

但阿里云的磁盘性能本来就很垃圾，开启 Swap 纯属心理安慰聊胜于无吧。

模型加载策略：懒加载 + 用完即释放

28 个分割模型如果全部预加载，内存直接爆炸。核心策略：

分割模型：按数据集按需加载。用户请求 lymph-node 时才加载该数据集下的 6 个模型，处理完后根据配置决定是否卸载。

分类模型：更大的 ViT (328MB) 采用懒加载 — 第一次分类请求到达时才加载，使用完毕后立即释放：

1
2
3
4
5
6
7
def classify(image, mask):
    _ensure_cls_models()      # 首次调用才加载
    try:
        # ... 推理 ...
    finally:
        if not KEEP_CLS_MODELS_LOADED:
            unload_cls_models()  # 用完释放

实际运行中，平时内存占用约 500MB（仅常驻一个数据集的分割模型），峰值不超过 1.4GB。比全量预加载节省约 300-400MB。

CPU 动态量化：328MB ViT 降到 ~85MB

在无 GPU 的 CPU 服务器上跑 ViT-Base 推理是一件奢侈的事。PyTorch 的动态量化可以把模型中的 nn.Linear 权重从 float32 压缩到 int8，内存占用减少约 75%，推理速度反而提升。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import torch

def _maybe_quantize_dynamic(module, module_name):
    engine = "fbgemm"  # x86 CPU 最佳
    torch.backends.quantized.engine = engine
    module = torch.quantization.quantize_dynamic(
        module, {torch.nn.Linear}, dtype=torch.qint8
    )
    return module

# 对 ViT、RadiomicsNN、ClassificationHead 三个模型都做量化
cls_model_vit = _maybe_quantize_dynamic(cls_model_vit, "classification ViT")

配合 SONOX_CPU_THREADS=1（限制 PyTorch 线程数避免争抢），在 2 vCPU 上单次分类推理约 3-5 秒，内存峰值可控。

磁盘缓存：避免重复推理

SonoX 的 API 按原始文件名标识同一张图片。如果用户反复查看同一张图的 segmentation/classification 结果，完全没有必要重新跑一遍推理。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
CACHE_DIR = Path("cache")

def _cache_path(dataset, filename):
    return CACHE_DIR / dataset / f"{filename}.json"

def _read_cache(dataset, filename):
    # 命中则直接返回 JSON，跳过全部模型推理
    ...

def _write_cache(dataset, filename, data):
    # 推理完成后写入缓存
    ...

async def api_segment(image, dataset, filename):
    if filename:
        cached = _read_cache(dataset, filename)
        if cached and "segment" in cached:
            return cached["segment"]  # 直接返回，0 计算
    # ... 否则正常推理并写缓存

缓存文件存于 cache/ 目录，JSON 格式，一条记录约 200KB（含 base64 图片 overlay）。使用 run.sh 重启服务时自动清理。

前后端拆分后，给 API 补上 HTTPS

GitHub Pages 必须请求 HTTPS API。否则会触发 mixed content，或把裸 IP 解析成相对路径：

1
2
错误: https://jinggqu.github.io/SonoX/47.119.180.6/api/datasets
正确: https://47.119.180.6/api/datasets

做法很简单：后端只监听 127.0.0.1:8000，Nginx 对外提供 80/443，证书交给 certbot，安全组放行 80/443。

后端与 Nginx

后端仅监听本机端口：

1
2
3
4
cd ~/proj/SonoX
tmux new -d -s sonox-api \
  'export HOST=127.0.0.1 PORT=8000 SONOX_SERVE_FRONTEND=0 && ./run.sh'
curl http://127.0.0.1:8000/api/health

安装并配置 Nginx：

1
2
apt update
apt install -y nginx

新建配置 /etc/nginx/sites-available/sonox-api:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
server {
    listen 80;
    listen [::]:80;
    server_name 47.119.180.6;

    location /.well-known/acme-challenge/ { root /var/www/html; }
    location / { return 301 https://$host$request_uri; }
}

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    server_name 47.119.180.6;

    ssl_certificate /etc/letsencrypt/live/47.119.180.6/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/47.119.180.6/privkey.pem;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

重载 Nginx

1
2
3
ln -sf /etc/nginx/sites-available/sonox-api /etc/nginx/sites-enabled/sonox-api
nginx -t
systemctl reload nginx

证书与续期

1
2
3
4
5
6
7
8
9
apt install -y python3 python3-venv libaugeas0
python3 -m venv /opt/certbot
/opt/certbot/bin/pip install --upgrade pip certbot
ln -sf /opt/certbot/bin/certbot /usr/bin/certbot

mkdir -p /var/www/html
certbot certonly --webroot -w /var/www/html -d 47.119.180.6
nginx -t
systemctl reload nginx

证书路径：

1
2
/etc/letsencrypt/live/47.119.180.6/fullchain.pem
/etc/letsencrypt/live/47.119.180.6/privkey.pem

自动续期：

1
2
3
cat >/etc/cron.d/sonox-certbot-renew <<'EOF'
0 */12 * * * root certbot renew --quiet --deploy-hook 'systemctl reload nginx'
EOF

验证与前端配置

按顺序验证：

1
2
3
4
curl http://127.0.0.1:8000/api/health
curl -k https://127.0.0.1/api/health
curl -I http://47.119.180.6/api/health
curl -vk https://47.119.180.6/api/health

若最后一步超时，通常是阿里云安全组未放行 443/tcp。若保留 HTTP 跳转，也需放行 80/tcp。

GitHub Pages 侧应该添加如下的仓库环境变量：

1
2
3
SONOX_API_BASE_URL=https://47.119.180.6
# or
SONOX_API_BASE_URL=47.119.180.6

开发环境优化

uv 镜像源

国内服务器拉 PyPI 慢到怀疑人生。在 pyproject.toml 中直接配置阿里云镜像：

1
2
3
4
[[tool.uv.index]]
name = "aliyun"
url = "https://mirrors.aliyun.com/pypi/simple/"
default = true  # 很重要

SSH 代理转发拉 GitHub 代码

从国内服务器 git clone GitHub 经常 TLS 超时。最稳定方案：SSH 远程端口转发 + Git 代理。

本地 ~/.ssh/config：

1
2
3
4
Host aliyun
  HostName <公网IP>
  User root
  RemoteForward 7890 127.0.0.1:7890

服务器端：

1
2
3
4
5
6
# 确认转发端口在监听
ss -tlnp | grep 7890

# 配置 Git 走本地代理
git config --global http.proxy http://127.0.0.1:7890
git config --global https.proxy http://127.0.0.1:7890

如果转发失败，检查 /etc/ssh/sshd_config 中是否开启了 AllowTcpForwarding yes：

1
2
3
AllowAgentForwarding yes
AllowTcpForwarding yes
GatewayPorts yes

更省事的方案：配置 SSH Key 后用 git clone git@github.com:...，完全绕过代理。

Git 信息持久化

1
2
3
git config --global user.name "xxx"
git config --global user.email "xxx"
git config --global credential.helper store

zsh + 插件

1
2
3
4
5
apt install -y build-essential vim curl wget git tmux zsh tree lsof
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions
git clone https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting
# .zshrc 中：plugins=(git zsh-autosuggestions zsh-syntax-highlighting)

VSCode 中 Codex 插件字体大小与颜色优化

字体大小：Codex 插件对话字体默认偏小，可在 settings.json 中通过 chat.fontSize 和 chat.editor.fontSize 调整：

1
2
"chat.fontSize": 16,
"chat.editor.fontSize": 14

字体颜色：部分主题下 Codex 对话文字呈灰色、对比度不足。根因是插件通过 --vscode-* CSS 变量继承主题颜色。排查方法：Cmd/Ctrl + Shift + P → Developer: Open Webview Developer Tools → 点击箭头选中灰色文字 → 在 Computed 面板展开 color 变量链，找到最终指向的 --vscode-xxx-yyy 变量，去掉 --vscode- 前缀并将第一条横杠换点号即可覆盖。最终的结论就是字体颜色由 foreground 控制，将其更新为 #000000 即可。

完整配置如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "workbench.colorCustomizations": {
    "foreground": "#000000",
    "sideBar.foreground": "#000000",
    "descriptionForeground": "#000000",
    "chat.requestBackground": "#f0f0f0",
    "textPreformat.foreground": "#000000",
    "textBlockQuote.foreground": "#000000"
  }
}

效果总结

指标	优化前	优化后
系统可用内存	~1.65 GB	~1.80 GB
系统基础占用	~400MB (Ubuntu)	~128MB (Debian)
模型常驻内存	~500MB (全量预加载)	~150MB (按需加载)
分类模型内存	~350MB (FP32)	~100MB (INT8)
空闲内存占用	Overflow	~80%
重复请求耗时	3-5s	<10ms (缓存命中)

最终在 99 元/年的机器上跑起了一个包含 28 个分割模型 + 3 个分类模型的超声影像分析服务，内存有余量，响应可接受。

核心思路：省着用（懒加载）、压着用（量化）、别重复（缓存）。