모니터링: LangSmith vs Langfuse

핵심 개념

LLM 호출 추적·비용 모니터링·품질 평가 — 프로덕션 RAG 품질 대시보드.

본문

LangSmith vs Langfuse

항목	LangSmith	Langfuse
호스팅	클라우드	클라우드 + 셀프호스트
가격	무료 5K trace/mo, $39+/mo	무료 50K obs/mo, $59+/mo
통합	LangChain 네이티브	OpenAI/Anthropic SDK
특화	추적·평가	추적·평가·실험
오픈소스	❌	✅ MIT

LangSmith 빠른 시작

PYTHON📋 코드 (16줄)

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..."
os.environ["LANGCHAIN_PROJECT"] = "my-rag-bot"

# 기존 LangChain 코드 그대로 — 자동 추적
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-sonnet-4-6")
result = llm.invoke("Hello")

# https://smith.langchain.com 에서 모든 호출 시각화
# - 요청/응답 전체
# - 토큰·비용 자동 계산
# - 지연 시간 분석
# - 트리 구조 (중첩 체인)

Langfuse — 셀프호스트 가능

PYTHON📋 코드 (19줄)

# pip install langfuse
import os
from langfuse.decorators import observe
from langfuse.openai import openai

os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-..."
os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com"
# 또는 셀프호스트 URL


@observe()
def my_rag_chain(query: str):
    # ... LangChain 또는 OpenAI SDK
    pass


# 자동으로 trace 생성
my_rag_chain("LangChain 시작 방법?")

평가 (Evaluation) — LLM-as-Judge

PYTHON📋 코드 (42줄)

from langsmith.evaluation import evaluate
from langsmith import Client

client = Client()

# 1. 데이터셋 준비
dataset_name = "rag-qa-test"
client.create_dataset(dataset_name)

examples = [
    {"input": {"q": "LangChain은?"}, "output": {"a": "LLM 앱 프레임워크"}},
    {"input": {"q": "RAG는?"}, "output": {"a": "Retrieval-Augmented Generation"}},
]
for ex in examples:
    client.create_example(
        inputs=ex["input"],
        outputs=ex["output"],
        dataset_name=dataset_name,
    )


# 2. 평가 함수
def correctness_evaluator(run, example):
    expected = example.outputs["a"]
    actual = run.outputs["a"]
    # LLM으로 의미 일치 평가
    judge_llm = ChatAnthropic(model="claude-opus-4-6")
    response = judge_llm.invoke(f"""
이 두 답변이 같은 의미인가요? 정확히 yes 또는 no로:
A: {expected}
B: {actual}
""")
    is_correct = "yes" in response.content.lower()
    return {"key": "correctness", "score": int(is_correct)}


# 3. 평가 실행
evaluate(
    lambda inputs: my_rag_chain(inputs["q"]),
    data=dataset_name,
    evaluators=[correctness_evaluator],
)

커스텀 메트릭

PYTHON📋 코드 (24줄)

from langfuse.decorators import observe, langfuse_context

@observe()
def rag_with_metrics(query):
    # ... 답변 생성
    response = "..."
    confidence = 0.85
    citations = 3

    # 커스텀 메트릭 기록
    langfuse_context.update_current_observation(
        metadata={
            "confidence": confidence,
            "citation_count": citations,
            "model": "claude-sonnet-4-6",
            "user_segment": "pro",
        },
        scores={
            "confidence": confidence,
            "citation_quality": citations / 5,
        },
    )

    return response

A/B 테스트

PYTHON📋 코드 (25줄)

import random

PROMPTS = {
    "v1": "당신은 도움 챗봇입니다. 친절히 답변하세요.",
    "v2": "당신은 OHS 학습 멘토입니다. 단계별로 명확히 답변하세요.",
}


@observe(name="rag_ab_test")
def rag_with_ab(query):
    # 50:50 분할
    variant = random.choice(["v1", "v2"])
    prompt = PROMPTS[variant]

    response = llm.invoke([
        {"role": "system", "content": prompt},
        {"role": "user", "content": query},
    ]).content

    langfuse_context.update_current_observation(
        metadata={"variant": variant},
        tags=[f"variant:{variant}"],
    )

    return response

비용 대시보드 구축

PYTHON📋 코드 (32줄)

# 매일 자동 리포트 생성
import asyncio
from datetime import datetime, timedelta
from langsmith import Client

client = Client()


def daily_cost_report():
    yesterday = datetime.now() - timedelta(days=1)
    runs = list(client.list_runs(
        project_name="my-rag-bot",
        start_time=yesterday,
        execution_order=1,  # top-level만
    ))

    total_cost = sum(run.total_cost or 0 for run in runs)
    total_tokens = sum(run.total_tokens or 0 for run in runs)
    avg_latency = sum(run.latency_ms or 0 for run in runs) / len(runs) if runs else 0

    return {
        "date": yesterday.date().isoformat(),
        "total_runs": len(runs),
        "total_cost": f"${total_cost:.2f}",
        "total_tokens": total_tokens,
        "avg_latency_ms": f"{avg_latency:.0f}",
    }


report = daily_cost_report()
print(report)
# Slack/이메일로 자동 발송 가능

다음 챕터

CH.8 "파인튜닝 vs RAG" — 언제 무엇을.

AI 프롬프트

🤖 AI에게 잘 물어보는 법 — 모델·전략별 프롬프트

무료 모델

Gemini 2.5 Flash(무료) + Claude Sonnet 4.6(무료) + Grok 4.1(무료)

무료 Langfuse(셀프호스트) + 무료 LLM으로
RAG 모니터링 시스템 구축법을 0원으로 알려줘.

소자본 모델

Claude API + Cursor $20/mo + Make.com — 월 10~30만원

LangSmith $39/mo + Claude API로
프로덕션 RAG 모니터링 + 평가
자동화 패턴을 알려줘.

프로덕션 모델

Claude Opus + CrewAI + LangGraph — 월 100만원+

LangSmith Enterprise + Langfuse 듀얼
+ 자체 BI 통합 엔터프라이즈 LLM
Observability 아키텍처를 설계해줘.

스택 프롬프트

0원→$20/mo→$100/mo 단계별 스택 비교

0원(셀프호스트 Langfuse)→$39/mo(LangSmith)→
$500/mo(엔터프라이즈) 단계 비교를 만들어줘.

⭐ 이것만 기억하세요

모니터링: LangSmith vs Langfuse는 이 3가지만 확실히 잡으세요

1.LangSmith는 LangChain 친화적, Langfuse는 오픈소스 + 셀프호스트 가능 — 모두 자동 trace 추적

2.평가는 LLM-as-Judge로 자동화 + 커스텀 메트릭으로 비즈니스 KPI 추적

3.다음 챕터 CH.8에서 파인튜닝 vs RAG — 언제 어느 것을

💬 이 챕터 질문 보기

AI-ORCHESTRATION · CH.56 — 질문하거나 답변을 확인하세요

→

진행도 56 / 59

← 커리큘럼으로 ← 목록으로 (AI Orchestration)