[12/15] 아이펠 리서치 15기 TIL | Model Serving (TorchServe)

이번 포스팅은 TorchServe를 이용한 모델 서빙에 대해 정리한 내용이다.

모델 학습(Pretrained + Fine-tuning)에 관한 내용은 지난 포스팅에서 다루었기 때문에, 이번에는 학습/평가 과정은 간략히 기술하고, 모델 서빙 과정을 집중적으로 다룬다.

모델 학습/평가

# 라이브러리 설치 및 로드
!pip install datasets torchserve torch-model-archiver torch-workflow-archiver nvgpu

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import seaborn as sns

from torch.utils.data import DataLoader
from transformers import BatchEncoding, BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW
from sklearn.metrics import confusion_matrix
from datasets import load_dataset
from tqdm import tqdm
from typing import TypedDict


# 데이터 로드
dataset = load_dataset("ag_news")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4)
optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = torch.nn.CrossEntropyLoss()

class DatasetItem(TypedDict):
    text: str
    label: str


def preprocess_data(dataset_item: DatasetItem) -> dict[str, torch.Tensor]:
    return tokenizer(dataset_item["text"], truncation=True, padding="max_length", return_tensors="pt")


train_dataset = dataset["train"].select(range(1200)).map(preprocess_data, batched=True)
test_dataset = dataset["test"].select(range(800)).map(preprocess_data, batched=True)

train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)


# 모델 학습
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

num_epochs = 3
losses: list[float] = []

for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader, desc=f"Epoch {epoch + 1}"):
        inputs = {key: batch[key].to(device) for key in batch}
        labels = inputs.pop("label")
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()
        losses.append(loss.item())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    average_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch + 1}, Average Loss: {average_loss}")
    
  
 # 모델 평가
 model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in tqdm(test_loader, desc="Evaluating"):
        inputs = {key: batch[key].to(device) for key in batch}
        labels = inputs.pop("label")
        outputs = model(**inputs, labels=labels)
        logits = outputs.logits
        predicted_labels = torch.argmax(logits, dim=1)
        correct += (predicted_labels == labels).sum().item()
        total += labels.size(0)

accuracy = correct / total

print("")
print(f"Test Accuracy: {accuracy * 100:.2f}%")


all_predictions: list[int] = []
all_labels: list[int] = []

with torch.no_grad():
    for batch in tqdm(test_loader, desc="Evaluating"):
        inputs = {key: batch[key].to(device) for key in batch}
        labels = inputs.pop("label")
        outputs = model(**inputs)
        logits = outputs.logits
        predicted_labels = torch.argmax(logits, dim=1)

        all_predictions.extend(predicted_labels.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())
        
        
conf_matrix = confusion_matrix(all_labels, all_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="g", cmap=sns.light_palette("#4285f4", as_cmap=True))
plt.xlabel("Predicted labels")
plt.ylabel("True labels")
plt.title("Confusion Matrix Heatmap")
plt.show()

주요 결과:

Epoch 1: 100%|██████████| 150/150 [01:43<00:00,  1.45it/s]
Epoch 1, Average Loss: 0.6668181281288464
Epoch 2: 100%|██████████| 150/150 [01:50<00:00,  1.35it/s]
Epoch 2, Average Loss: 0.32336516576508684
Epoch 3: 100%|██████████| 150/150 [01:50<00:00,  1.36it/s]
Epoch 3, Average Loss: 0.16716235954935352

Evaluating: 100%|██████████| 100/100 [00:21<00:00,  4.56it/s]
Test Accuracy: 85.00%

모델 서빙

1. 모델 가중치 저장

model_save_path = "bert_news_classification_model.pth"
torch.save(model.state_dict(), model_save_path)

2. 핸들러 생성

모델 초기화 및 가중치 적용 & eval 모드로 전환
받은 데이터를 모델에 입력할 텐서로 변환
추론을 통해 logit을 얻고, softmax로 확률값으로 변환 후 최종 결과 반환

%%writefile model_handler.py
import json

import torch
from ts.context import Context
from ts.torch_handler.base_handler import BaseHandler
from transformers import BatchEncoding, BertTokenizer, BertForSequenceClassification

class ModelHandler(BaseHandler):
    def __init__(self):
        self.initialized = False
        self.tokenizer = None
        self.model = None

    def initialize(self, context: Context):
        self.initialized = True
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=4)
        self.model.load_state_dict(torch.load("bert_news_classification_model.pth"))
        self.model.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
        self.model.eval()

    def preprocess(self, data: list[dict[str, bytearray]]) -> BatchEncoding:
        model_input_texts: list[str] = sum([json.loads(item.get("body").decode("utf-8"))["data"] for item in data], [])
        inputs = self.tokenizer(model_input_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        return inputs.to(device)

    def inference(self, input_batch: BatchEncoding) -> torch.Tensor:
        with torch.no_grad():
            outputs = self.model(**input_batch)
            return outputs.logits

    def postprocess(self, inference_output: torch.Tensor) -> list[dict[str, float]]:
        probabilities = torch.nn.functional.softmax(inference_output, dim=1)
        return [{"label": int(torch.argmax(prob)), "probability": float(prob.max())} for prob in probabilities]

3. 설정 파일 생성

이 주소를 정의하지 않을 경우, 8080이나 8000 port로 실행이 된다.

하지만 현재 TorchServe를 코랩에서 실행 중이고, 코랩은 8000번 port를 사용한다.

이렇게 되면 port가 겹치기 때문에 주소를 바꿔줘야 한다.

(실제 서빙 환경에서는 해당 파일이 필요 없을 수 있음.)

%%writefile config.properties
inference_address=http://0.0.0.0:5000
management_address=http://0.0.0.0:5001
metrics_address=http://0.0.0.0:5002

4. 모델 패키징

모델을 서빙하기 위해 필요한 파일들(체크포인트, 핸들러, 설정, BERT vocab 파일 등)을 하나의 mar 파일에 묶는다.

# bert vocab 파일 (아티팩트) 다운로드

!wget https://raw.githubusercontent.com/microsoft/SDNet/master/bert_vocab_files/bert-base-uncased-vocab.txt \
    -O bert-base-uncased-vocab.txt

# Torch serve archiver의 최종 파일 저장 위치.
# .mar 확장자 파일을 생성하여 패키징 하게 된다.

!mkdir -p model-store\

!torch-model-archiver \
    --model-name bert_news_classification \
    --version 1.0 \
    --serialized-file bert_news_classification_model.pth \
    --handler ./model_handler.py \
    --extra-files "bert-base-uncased-vocab.txt" \
    --export-path model-store \
    -f

5. TorchServe 실행 및 모델 등록

패키징된 mar 파일을 사용하여 TorchServe 서버를 시작하고, 모델을 등록하여 추론 준비를 한다.

%%script bash --bg
# 서버를 백그라운드에서 실행

PYTHONPATH=/usr/lib/python3.10 torchserve \
    --start \
    --ncs \
    --ts-config config.properties \
    --model-store model-store \
    --models bert_news_classification=bert_news_classification.mar \
    --disable-token-auth

!curl -X GET localhost:5000/ping

# 결과
{
  "status": "Healthy"
}

서버가 정상적으로 실행되었다는 메시지가 뜬다.

6. 평가

%%shell

# 모델 실제 평가를 위해 외부 뉴스 기사 데이터 셋 파일 생성.

cat > request_sports.json <<EOF
{
  "data": [
    "Bleary-eyed from 16 hours on a Greyhound bus, he strolled into the stadium running on fumes. He’d barely slept in two days. The ride he was supposed to hitch from Charlotte to Indianapolis canceled at the last minute, and for a few nervy hours, Antonio Barnes started to have his doubts. The trip he’d waited 40 years for looked like it wasn’t going to happen.ADVERTISEMENTBut as he moved through the concourse at Lucas Oil Stadium an hour before the Colts faced the Raiders, it started to sink in. His pace quickened. His eyes widened. His voice picked up.“I got chills right now,” he said. “Chills.”Barnes, 57, is a lifer, a Colts fan since the Baltimore days. He wore No. 25 on his pee wee football team because that’s the number Nesby Glasgow wore on Sundays. He was a talent in his own right, too: one of his old coaches nicknamed him “Bird” because of his speed with the ball.Back then, he’d catch the city bus to Memorial Stadium, buy a bleacher ticket for $5 and watch Glasgow and Bert Jones, Curtis Dickey and Glenn Doughty. When he didn’t have any money, he’d find a hole in the fence and sneak in. After the game was over, he’d weasel his way onto the field and try to meet the players. “They were tall as trees,” he remembers.He remembers the last game he went to: Sept. 25, 1983, an overtime win over the Bears. Six months later the Colts would ditch Baltimore in the middle of the night, a sucker-punch some in the city never got over. But Barnes couldn’t quit them. When his entire family became Ravens fans, he refused. “The Colts are all I know,” he says.For years, when he couldn’t watch the games, he’d try the radio. And when that didn’t work, he’d follow the scroll at the bottom of a screen.“There were so many nights I’d just sit there in my cell, picturing what it’d be like to go to another game,” he says. “But you’re left with that thought that keeps running through your mind: I’m never getting out.”It’s hard to dream when you’re serving a life sentence for conspiracy to commit murder.It started with a handoff, a low-level dealer named Mickey Poole telling him to tuck a Ziploc full of heroin into his pocket and hide behind the Murphy towers. This was how young drug runners were groomed in Baltimore in the late 1970s. This was Barnes’ way in.ADVERTISEMENTHe was 12.Back then he idolized the Mickey Pooles of the world, the older kids who drove the shiny cars, wore the flashy jewelry, had the girls on their arms and made any working stiff punching a clock from 9 to 5 look like a fool. They owned the streets. Barnes wanted to own them, too.“In our world,” says his nephew Demon Brown, “the only successful people we saw were selling drugs and carrying guns.”So whenever Mickey would signal for a vial or two, Barnes would hurry over from his hiding spot with that Ziploc bag, out of breath because he’d been running so hard."
  ]
}
EOF

cat > request_business.json <<EOF
{
  "data": [
    "DETROIT – America maintained its love affair with pickup trucks in 2023 — but a top-selling vehicle from Toyota Motor nearly ruined their tailgate party.Sales of the Toyota RAV4 compact crossover came within 10,000 units of Stellantis’ Ram pickup truck last year, a near-No. 3 ranking that would have marked the first time since 2014 that a non-pickup claimed one of the top three U.S. sales podium positions.The RAV4 has rapidly closed the gap: In 2020, the vehicle undersold the Ram truck by more than 133,000 units. Last year, it lagged by just 9,983. Stellantis sold 444,926 Ram pickups last year, a 5% decline from 2022.“Trucks are always at the top because they’re bought by not only individuals, but also fleet buyers and we saw heavy fleet buying last year,” said Michelle Krebs, an executive analyst at Cox Automotive. “The RAV4 shows that people want affordable, smaller SUVs, and the fact that there’s also a hybrid version of that makes it popular with people.”"
  ]
}
EOF

cat > request_sci_tech.json <<EOF
{
  "data": [
    "OpenVoice comprises two AI models working together for text-to-speech conversion and voice tone cloning.The first model handles language style, accents, emotion, and other speech patterns. It was trained on 30,000 audio samples with varying emotions from English, Chinese, and Japanese speakers. The second “tone converter” model learned from over 300,000 samples encompassing 20,000 voices.By combining the universal speech model with a user-provided voice sample, OpenVoice can clone voices with very little data. This helps it generate cloned speech significantly faster than alternatives like Meta’s Voicebox.Californian startup OpenVoice comes from California-based startup MyShell, founded in 2023. With $5.6 million in early funding and over 400,000 users already, MyShell bills itself as a decentralised platform for creating and discovering AI apps.  In addition to pioneering instant voice cloning, MyShell offers original text-based chatbot personalities, meme generators, user-created text RPGs, and more. Some content is locked behind a subscription fee. The company also charges bot creators to promote their bots on its platform.By open-sourcing its voice cloning capabilities through HuggingFace while monetising its broader app ecosystem, MyShell stands to increase users across both while advancing an open model of AI development."
  ]
}
EOF

# 스포츠 기사 평가 (레이블: 1)
!curl -X POST \
    -H "Accept: application/json" \
    -T "request_sports.json" \
    http://localhost:5000/predictions/bert_news_classification
    
{
  "label": 1,
  "probability": 0.996630847454071
}

# 비즈니스 기사 평가 (레이블: 2)
!curl -X POST \
    -H "Accept: application/json" \
    -T "request_business.json" \
    http://localhost:5000/predictions/bert_news_classification

{
  "label": 2,
  "probability": 0.9636610746383667
}

# 테크 기사 평가 (레이블: 3)
!curl -X POST \
    -H "Accept: application/json" \
    -T "request_sci_tech.json" \
    http://localhost:5000/predictions/bert_news_classification

{
  "label": 3,
  "probability": 0.9847061038017273
}

결과를 보니 분류를 잘 수행하고 있는 것 같다.

7. ngrok을 통한 외부 연결

ngrok은 로컬에 서비스되는 데몬/서버에 외부 도메인을 연결시켜, 외부에서 접근하기 쉽도록 만들어주는 서비스다.

5000번으로 호스팅한 서빙 서버 접근을 위해 ngrok을 연결한다.

!pip install pyngrok

from pyngrok import ngrok

ngrok.set_auth_token("토큰")
inference_tunnel = ngrok.connect("5000")
inference_tunnel

출력:
<NgrokTunnel: "https://simone-unrigorous-acceleratedly.ngrok-free.dev" -> "http://localhost:5000">

!curl -X POST \
    -H "Accept: application/json" \
    -T "request_sci_tech.json" \
    https://simone-unrigorous-acceleratedly.ngrok-free.dev/predictions/bert_news_classification
    
{
  "label": 3,
  "probability": 0.9847061038017273
}

ngrok 서비스를 이용해 로컬 포트 5000에 외부 접근 가능한 HTTPS 터널을 생성한 다음,

그 생성된 공개 도메인을 통해 외부 사용자가 모델 추론 API에 접근할 수 있게 한다.

'AI > MLOps' 카테고리의 다른 글

[12/12] 아이펠 리서치 15기 TIL \| Kubernetes(K8s): minikube (1)	2025.12.12
[12/12] 아이펠 리서치 15기 TIL \| Docker Swarm (1)	2025.12.12
[12/11] 아이펠 리서치 15기 TIL \| Docker 이미지 구조 (0)	2025.12.11
[12/11] 아이펠 리서치 15기 TIL \| Docker 빌드 (0)	2025.12.11
[12/10] 아이펠 리서치 15기 TIL \| Docker: 컨테이너 환경 구성 및 이미지 활용 (0)	2025.12.11

Wonjin's Blog

[12/15] 아이펠 리서치 15기 TIL | Model Serving (TorchServe)

모델 학습/평가

모델 서빙

1. 모델 가중치 저장

2. 핸들러 생성

3. 설정 파일 생성

4. 모델 패키징

5. TorchServe 실행 및 모델 등록

6. 평가

7. ngrok을 통한 외부 연결

'AI > MLOps' 카테고리의 다른 글

티스토리툴바

[12/15] 아이펠 리서치 15기 TIL | Model Serving (TorchServe)

모델 학습/평가

모델 서빙

1. 모델 가중치 저장

2. 핸들러 생성

3. 설정 파일 생성

4. 모델 패키징

5. TorchServe 실행 및 모델 등록

6. 평가

7. ngrok을 통한 외부 연결

'AI > MLOps' 카테고리의 다른 글

'AI/MLOps' Related Articles

티스토리툴바