Whisper 자막 추출 3단계 — 유튜브 영상 자동 자막 무료 생성 가이드

유튜브 강의 영상의 핵심 내용을 텍스트로 정리하고 싶을 때, 자막 추출이 가장 빠른 방법이다. 1시간짜리 컨퍼런스 발표를 노트 정리하거나, 영어 강의를 한국어로 번역하기 전에 원문 텍스트를 뽑아야 할 때, 또는 팟캐스트 내용을 블로그로 재가공할 때 — 이 모든 작업의 시작점이 음성-텍스트 변환이다. OpenAI가 오픈소스로 공개한 Whisper를 쓰면 API 비용 없이 로컬에서 이 작업을 전부 처리할 수 있다.

나는 실제로 주간 스터디 모임에서 녹화한 유튜브 영상 30개 이상을 이 방법으로 텍스트화했는데, 수동으로 받아적던 시절과 비교하면 차원이 다르다. 이 글에서는 yt-dlp로 오디오를 추출하고, Whisper로 자막을 생성하고, SRT 파일까지 만드는 전체 파이프라인을 처음부터 끝까지 다룬다.

📑 목차

자막 추출이 필요한 실내 활용 시나리오
환경 설정: Python, yt-dlp, Whisper, ffmpeg 설치
Step 1 — yt-dlp로 유튜브 오디오 다운로드
Step 2 — Whisper로 음성을 텍스트로 변환하기
Step 3 — 출력 포맷 다루기 (txt, srt, vtt)
전체 자동화 파이프라인 스크립트
실전 팁: GPU 가속, 배치 처리, Apple Silicon
자주 만나는 에러와 해결법

자막 추출이 필요한 실제 활용 시나리오

단순히 “자막 뽑기”라고 하면 쉬워 보이지만, 실내 현업에서 쓰이는 맥락은 다양하다. 관련 내용은 Mac Python AI 개발환경 세팅 가이드에서도 다루고 있다.

강의 아카이빙 — 유튜브에 올린 사내 교육 영상을 텍스트로 변환해서 검색 가능한 문서로 보관
회의록 자동 생성 — Zoom/Google Meet 녹화본을 유튜브에 업로드한 뒤 자막 추출 후 요약
콘텐츠 재가공 — 팟캐스트나 브이로그 내용을 블로그 포스트로 변환
번역 전처리 — 외국어 영상의 원문 텍스트를 먼저 뽑아서 번역 작업의 입력으로 사용
접근성 개선 — 자막이 없는 영상에 SRT 파일을 생성해서 자막 제공

유튜브 자체 자동 자막도 있지만, 한국어 인식률이 들쭉날쭉하고 타임스탬프 정밀도가 떨어진다. Whisper large-v3 모델은 한국어 인식에서 체감상 훨씬 나은 결과를 보여준다.

환경 설정: Python, yt-dlp, Whisper, ffmpeg 설치

전체 파이프라인에 필요한 도구는 네 가지이다. Python 3.9 이상이 필요하고, 나머지는 pip과 패키지 매니저로 설치한다.

# 1. Python 버전 확인
python3 --version
# Python 3.11.7  (3.9 이상이면 OK)

# 2. ffmpeg 설치 (Whisper가 내부적으로 사용)
# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg -y

# 설치 확인
ffmpeg -version
# ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers

# 3. yt-dlp 설치 (youtube-dl의 활발한 포크)
pip install yt-dlp

# 버전 확인
yt-dlp --version
# 2024.12.23

# 4. Whisper 설치
pip install openai-whisper

# 의존성 포함 설치 확인
python3 -c "import whisper; print(whisper.__version__)"
# 20240930

여기서 한 가지 주의할 점이 있다. pip install whisper라고 치면 전혀 다른 패키지가 설치된다. 반드시 openai-whisper로 설치해야 한다. 나도 처음에 이걸로 10분 날렸다.

가상환경을 만들어서 격리하는 것을 권장한다:

# 가상환경 생성 및 활성화
python3 -m venv whisper-env
source whisper-env/bin/activate  # Windows: whisper-env\Scripts\activate

# 가상환경 안에서 설치
pip install openai-whisper yt-dlp

Step 1 — yt-dlp로 유튜브 오디오 다운로드

Whisper에 넣으려면 먼저 유튜브 영상에서 오디오 트랙만 추출해야 한다. yt-dlp는 이 작업에 최적화되어 있다.

# 기본 오디오 다운로드 (최고 품질 오디오 → mp3 변환)
yt-dlp -x --audio-format mp3 -o "%(title)s.%(ext)s" "https://www.youtube.com/watch?v=VIDEO_ID"

# 실행 결과 예시:
# [youtube] Extracting URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
# [youtube] dQw4w9WgXcQ: Downloading webpage
# [youtube] dQw4w9WgXcQ: Downloading ios player API JSON
# [info] dQw4w9WgXcQ: Downloading 1 format(s): 251
# [download] Destination: Rick Astley - Never Gonna Give You Up.webm
# [download] 100% of 3.28MiB in 00:00:01 at 2.45MiB/s
# [ExtractAudio] Destination: Rick Astley - Never Gonna Give You Up.mp3
# Deleting original file Rick Astley - Never Gonna Give You Up.webm

# Whisper 최적화 옵션 (16kHz WAV — Whisper 내부 리샘플링 불필요)
yt-dlp -x --audio-format wav --postprocessor-args "-ar 16000 -ac 1" \
  -o "audio/%(id)s.%(ext)s" "https://www.youtube.com/watch?v=VIDEO_ID"

# 재생목록 전체 다운로드
yt-dlp -x --audio-format mp3 -o "playlist/%(playlist_index)03d-%(title)s.%(ext)s" \
  "https://www.youtube.com/playlist?list=PLAYLIST_ID"

# 특정 구간만 다운로드 (시작~끝 초 단위)
yt-dlp -x --audio-format mp3 \
  --download-sections "*00:05:00-00:15:00" \
  -o "clip.%(ext)s" "https://www.youtube.com/watch?v=VIDEO_ID"

WAV 16kHz 모노로 다운로드하면 Whisper가 내부에서 리샘플링하는 단계를 건너뛰어서 처리 시간이 약간 줄어듭니다. 다만 파일 크기가 mp3 대비 10배 이상 커지므로, 디스크 여유가 넉넉하지 않다면 mp3로 받아도 무방하다.

Step 2 — Whisper로 음성을 텍스트로 변환하기

오디오 파일이 준비되면 Whisper를 실행한다. CLI와 Python API 두 가지 방식으로 사용할 수 있다.

모델 크기별 비교

모델	파라미터	VRAM	영어 정확도	한국어 체감	10분 영상 처리 시간 (CPU)
tiny	39M	~1GB	보통	많이 부족	~1분
base	74M	~1GB	양호	부족	~2분
small	244M	~2GB	좋음	쓸만함	~5분
medium	769M	~5GB	매우 좋음	좋음	~15분
large-v3	1550M	~10GB	최상	매우 좋음	~40분

한국어 영상이라면 medium 이상을 추천한다. small까지는 조사나 어미가 뭉개지는 경우가 꽤 있었다. GPU가 없는 환경에서 빠르게 테스트만 해보고 싶다면 base로 시작하자.

CLI로 실행하기

# 기본 실행 (모델 자동 다운로드 — 최초 1회)
whisper audio.mp3 --model medium --language ko

# 실행 결과:
# Detecting language using up to the first 30 seconds. Detected language: Korean (0.98)
# [00:00.000 --> 00:04.320]  안녕하세요 오늘은 파이썬으로 웹 크롤링하는 방법을 알아보겠습니다
# [00:04.320 --> 00:08.640]  먼저 BeautifulSoup 라이브러리를 설치해야 합니다
# [00:08.640 --> 00:12.480]  터미널에서 pip install beautifulsoup4를 입력하세요
# ...

# 출력 포맷 지정 (여러 개 동시 가능)
whisper audio.mp3 --model medium --language ko \
  --output_format all --output_dir ./subtitles/

# 생성되는 파일:
# subtitles/audio.txt   — 순수 텍스트
# subtitles/audio.srt   — SRT 자막 (타임스탬프 포함)
# subtitles/audio.vtt   — WebVTT 자막
# subtitles/audio.tsv   — 탭 구분 데이터
# subtitles/audio.json  — 세그먼트별 상세 JSON

# 특정 모델 경로 지정 (미리 다운로드한 경우)
whisper audio.mp3 --model_dir ~/.cache/whisper/ --model large-v3

Python API로 세밀하게 제어하기

import whisper
import json
from pathlib import Path

# 모델 로드 (최초 실행 시 자동 다운로드)
model = whisper.load_model("medium")

# 기본 변환
result = model.transcribe("audio.mp3", language="ko")

# 전체 텍스트 출력
print(result["text"])

# 세그먼트별 타임스탬프 확인
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"].strip()
    print(f"[{start:.1f}s ~ {end:.1f}s] {text}")

# 출력 예시:
# [0.0s ~ 4.3s] 안녕하세요 오늘은 파이썬으로 웹 크롤링하는 방법을 알아보겠습니다
# [4.3s ~ 8.6s] 먼저 BeautifulSoup 라이브러리를 설치해야 합니다
# [8.6s ~ 12.5s] 터미널에서 pip install beautifulsoup4를 입력하세요

# 고급 옵션 사용
result = model.transcribe(
    "audio.mp3",
    language="ko",
    task="transcribe",        # "translate"로 바꾸면 영어로 번역
    temperature=0,            # 0이면 greedy decoding (재현성 높음)
    word_timestamps=True,     # 단어 단위 타임스탬프
    condition_on_previous_text=True,  # 이전 문맥 참조
    initial_prompt="이 영상은 파이썬 프로그래밍 강의입니다.",  # 도메인 힌트
)

# JSON으로 저장
output_path = Path("result.json")
output_path.write_text(json.dumps(result, ensure_ascii=False, indent=2))
print(f"결과 저장 완료: {output_path}")

initial_prompt 파라미터가 꽤 유용하다. Whisper에게 영상의 맥락을 미리 알려주면 전문 용어 인식률이 눈에 띄게 올라간다. 프로그래밍 강의에 “이 영상은 Python 프로그래밍 강의이다”라고 넣으면 “파이선”이나 “파이든”으로 인식되는 빈도가 확 줄어듭니다.

Step 3 — 출력 포맷 다루기 (txt, srt, vtt)

Whisper는 여러 포맷으로 출력할 수 있다. 각 포맷의 특성과 후처리 방법을 알아본다.

SRT 포맷 구조

가장 범용적인 자막 포맷이다. 영상 편집 도구, 유튜브 업로드, 미디어 플레이어 전부에서 지원한다.

def seconds_to_srt_time(seconds: float) -> str:
    """초 단위를 SRT 타임스탬프 형식으로 변환"""
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds - int(seconds)) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"


def generate_srt(segments: list) -> str:
    """Whisper 세그먼트 리스트를 SRT 문자열로 변환"""
    srt_lines = []
    for i, seg in enumerate(segments, start=1):
        start = seconds_to_srt_time(seg["start"])
        end = seconds_to_srt_time(seg["end"])
        text = seg["text"].strip()
        srt_lines.append(f"{i}\n{start} --> {end}\n{text}\n")
    return "\n".join(srt_lines)


# 사용 예시
import whisper

model = whisper.load_model("medium")
result = model.transcribe("lecture.mp3", language="ko")

srt_content = generate_srt(result["segments"])

# SRT 파일 저장 (BOM 포함 UTF-8 — 일부 플레이어 호환성)
with open("lecture.srt", "w", encoding="utf-8-sig") as f:
    f.write(srt_content)

print(f"SRT 파일 생성 완료: {len(result['segments'])}개 세그먼트")

# 생성된 SRT 파일 예시:
# 1
# 00:00:00,000 --> 00:00:04,320
# 안녕하세요 오늘은 파이썬으로 웹 크롤링하는 방법을 알아보겠습니다
#
# 2
# 00:00:04,320 --> 00:00:08,640
# 먼저 BeautifulSoup 라이브러리를 설치해야 합니다

WebVTT 포맷

웹 환경에서 <video> 태그와 함께 쓸 때는 VTT가 표준이다. SRT와 거의 비슷하지만 헤더와 타임스탬프 구분자가 다릅니다.

def generate_vtt(segments: list) -> str:
    """Whisper 세그먼트 리스트를 WebVTT 문자열로 변환"""
    vtt_lines = ["WEBVTT\n"]
    for seg in segments:
        start = seconds_to_srt_time(seg["start"]).replace(",", ".")
        end = seconds_to_srt_time(seg["end"]).replace(",", ".")
        text = seg["text"].strip()
        vtt_lines.append(f"{start} --> {end}\n{text}\n")
    return "\n".join(vtt_lines)


# VTT 저장
vtt_content = generate_vtt(result["segments"])
with open("lecture.vtt", "w", encoding="utf-8") as f:
    f.write(vtt_content)

전체 자동화 파이프라인 스크립트

지금까지의 단계를 하나의 스크립트로 합칩니다. 유튜브 URL만 넣으면 오디오 다운로드부터 자막 파일 생성까지 자동으로 처리된다.

#!/usr/bin/env python3
"""
youtube_subtitle.py — 유튜브 영상 자동 자막 추출 파이프라인
사용법: python youtube_subtitle.py [URL] [--model MODEL] [--lang LANG] [--output-dir DIR]
"""

import argparse
import subprocess
import sys
from pathlib import Path

import whisper


def download_audio(url: str, output_dir: Path) -> Path:
    """yt-dlp로 유튜브 오디오를 다운로드하고 파일 경로를 반환"""
    output_dir.mkdir(parents=True, exist_ok=True)
    output_template = str(output_dir / "%(id)s.%(ext)s")

    cmd = [
        "yt-dlp",
        "-x",
        "--audio-format", "mp3",
        "--audio-quality", "0",
        "-o", output_template,
        "--no-playlist",        # 단일 영상만
        "--print", "filename",  # 저장된 파일명 출력
        url,
    ]

    print(f"[1/3] 오디오 다운로드 중: {url}")
    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.returncode != 0:
        print(f"다운로드 실패:\n{result.stderr}", file=sys.stderr)
        sys.exit(1)

    # yt-dlp가 출력한 파일 경로에서 확장자를 mp3로 교체
    raw_path = result.stdout.strip().splitlines()[-1]
    audio_path = Path(raw_path).with_suffix(".mp3")

    if not audio_path.exists():
        # 확장자 변환 전 파일 탐색
        candidates = list(output_dir.glob("*.*"))
        audio_path = candidates[-1] if candidates else audio_path

    print(f"   다운로드 완료: {audio_path} ({audio_path.stat().st_size / 1024 / 1024:.1f}MB)")
    return audio_path


def transcribe_audio(audio_path: Path, model_name: str, language: str) -> dict:
    """Whisper로 오디오를 텍스트로 변환"""
    print(f"[2/3] Whisper 변환 중 (모델: {model_name}, 언어: {language})")
    print(f"   모델 로딩 중...")

    model = whisper.load_model(model_name)

    print(f"   변환 시작...")
    result = model.transcribe(
        str(audio_path),
        language=language,
        temperature=0,
        condition_on_previous_text=True,
        verbose=False,
    )

    seg_count = len(result["segments"])
    duration = result["segments"][-1]["end"] if seg_count > 0 else 0
    print(f"   변환 완료: {seg_count}개 세그먼트, 총 {duration:.0f}초")
    return result


def seconds_to_srt_time(seconds: float) -> str:
    """초 → SRT 타임스탬프"""
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds - int(seconds)) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"


def save_outputs(result: dict, output_dir: Path, stem: str):
    """변환 결과를 txt, srt, vtt 파일로 저장"""
    output_dir.mkdir(parents=True, exist_ok=True)
    segments = result["segments"]

    # 1. TXT — 순수 텍스트
    txt_path = output_dir / f"{stem}.txt"
    txt_path.write_text(result["text"].strip(), encoding="utf-8")

    # 2. SRT — 타임스탬프 자막
    srt_lines = []
    for i, seg in enumerate(segments, 1):
        start = seconds_to_srt_time(seg["start"])
        end = seconds_to_srt_time(seg["end"])
        text = seg["text"].strip()
        srt_lines.append(f"{i}\n{start} --> {end}\n{text}\n")

    srt_path = output_dir / f"{stem}.srt"
    srt_path.write_text("\n".join(srt_lines), encoding="utf-8-sig")

    # 3. VTT — WebVTT
    vtt_lines = ["WEBVTT\n"]
    for seg in segments:
        start = seconds_to_srt_time(seg["start"]).replace(",", ".")
        end = seconds_to_srt_time(seg["end"]).replace(",", ".")
        text = seg["text"].strip()
        vtt_lines.append(f"{start} --> {end}\n{text}\n")

    vtt_path = output_dir / f"{stem}.vtt"
    vtt_path.write_text("\n".join(vtt_lines), encoding="utf-8")

    print(f"[3/3] 파일 저장 완료:")
    print(f"   텍스트: {txt_path}")
    print(f"   SRT:    {srt_path}")
    print(f"   VTT:    {vtt_path}")


def main():
    parser = argparse.ArgumentParser(description="유튜브 영상 자동 자막 추출")
    parser.add_argument("url", help="유튜브 영상 URL")
    parser.add_argument("--model", default="medium", help="Whisper 모델 (tiny/base/small/medium/large-v3)")
    parser.add_argument("--lang", default="ko", help="언어 코드 (ko, en, ja, ...)")
    parser.add_argument("--output-dir", default="./output", help="출력 디렉토리")
    args = parser.parse_args()

    output_dir = Path(args.output_dir)

    # Step 1: 오디오 다운로드
    audio_path = download_audio(args.url, output_dir / "audio")

    # Step 2: Whisper 변환
    result = transcribe_audio(audio_path, args.model, args.lang)

    # Step 3: 출력 파일 저장
    save_outputs(result, output_dir / "subtitles", audio_path.stem)

    print("\n모든 작업이 완료되었습니다.")


if __name__ == "__main__":
    main()

실행 방법은 간단하다:

# 기본 실행 (한국어, medium 모델)
python youtube_subtitle.py "https://www.youtube.com/watch?v=VIDEO_ID"

# 영어 영상, large 모델
python youtube_subtitle.py "https://www.youtube.com/watch?v=VIDEO_ID" \
  --model large-v3 --lang en --output-dir ./english_subs

# 실행 결과:
# [1/3] 오디오 다운로드 중: https://www.youtube.com/watch?v=VIDEO_ID
#    다운로드 완료: output/audio/VIDEO_ID.mp3 (8.3MB)
# [2/3] Whisper 변환 중 (모델: medium, 언어: ko)
#    모델 로딩 중...
#    변환 시작...
#    변환 완료: 142개 세그먼트, 총 623초
# [3/3] 파일 저장 완료:
#    텍스트: output/subtitles/VIDEO_ID.txt
#    SRT:    output/subtitles/VIDEO_ID.srt
#    VTT:    output/subtitles/VIDEO_ID.vtt
#
# 모든 작업이 완료되었습니다.

실전 팁: GPU 가속, 배치 처리, Apple Silicon

NVIDIA GPU 가속 (CUDA)

GPU가 있으면 처리 속도가 극적으로 빨라진다. medium 모델 기준으로 CPU 대비 5-10배 차이가 난다.

# PyTorch CUDA 버전 설치 (기존 CPU 버전 교체)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# CUDA 사용 가능 여부 확인
python3 -c "import torch; print(f'CUDA: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"
# CUDA: True, Device: NVIDIA GeForce RTX 4090

# Python에서 GPU 명시적 지정
import whisper
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"사용 디바이스: {device}")

model = whisper.load_model("large-v3", device=device)
result = model.transcribe("audio.mp3", language="ko", fp16=(device == "cuda"))

# fp16=True는 GPU에서만 동작하며, 속도와 메모리 효율이 향상됨

Apple Silicon (M1/M2/M3) 최적화

Mac 사용자라면 MLX Whisper를 고려해볼 만하다. Apple의 MLX 프레임워크 위에서 동작하며, Metal GPU를 활용한다.

# MLX Whisper 설치
pip install mlx-whisper

# 사용법 (API가 거의 동일)
python3 -c "
import mlx_whisper

result = mlx_whisper.transcribe(
    'audio.mp3',
    path_or_hf_repo='mlx-community/whisper-medium-mlx',
    language='ko',
)
print(result['text'][:200])
"

M2 Pro 기준으로 medium 모델이 CPU 대비 약 3배 빠르게 돌아갔다. M3 Max에서는 large-v3도 실시간 속도에 근접한다는 벤치마크를 본 적 있는데, 직접 확인하지는 못했다.

여러 영상 배치 처리

#!/usr/bin/env python3
"""batch_subtitle.py — 여러 유튜브 URL을 일괄 처리"""

import whisper
from pathlib import Path
from youtube_subtitle import download_audio, save_outputs

URLS = [
    "https://www.youtube.com/watch?v=VIDEO_1",
    "https://www.youtube.com/watch?v=VIDEO_2",
    "https://www.youtube.com/watch?v=VIDEO_3",
]

def batch_process(urls: list, model_name: str = "medium", language: str = "ko"):
    output_dir = Path("./batch_output")

    # 모델을 한 번만 로드 (핵심!)
    print(f"모델 로딩: {model_name}")
    model = whisper.load_model(model_name)

    for i, url in enumerate(urls, 1):
        print(f"\n{'='*50}")
        print(f"[{i}/{len(urls)}] 처리 중: {url}")
        print(f"{'='*50}")

        try:
            audio_path = download_audio(url, output_dir / "audio")
            result = model.transcribe(
                str(audio_path),
                language=language,
                temperature=0,
                verbose=False,
            )
            save_outputs(result, output_dir / "subtitles", audio_path.stem)
            print(f"완료: {len(result['segments'])}개 세그먼트")
        except Exception as e:
            print(f"실패: {e}")
            continue

    print(f"\n전체 배치 처리 완료: {len(urls)}개 영상")

if __name__ == "__main__":
    batch_process(URLS)

배치 처리에서 가장 중요한 점은 모델을 루프 바깥에서 한 번만 로드하는 것이다. large-v3 모델이 약 3GB인데, 매번 로드하면 영상마다 30초씩 낭비된다.

자주 만나는 에러와 해결법

ffmpeg를 찾을 수 없다는 에러

# 에러 메시지
# RuntimeError: ffmpeg was not found but is required to load audio files

# 해결: ffmpeg 설치 확인
which ffmpeg
# 출력이 없으면 설치 안 된 것

# macOS
brew install ffmpeg

# Ubuntu
sudo apt install ffmpeg

# 설치 후에도 안 되면 PATH 확인
echo $PATH
export PATH="/opt/homebrew/bin:$PATH"  # Apple Silicon Mac

CUDA out of memory

# 에러 메시지
# torch.cuda.OutOfMemoryError: CUDA out of memory.
# Tried to allocate 256.00 MiB (GPU 0; 8.00 GiB total capacity)

# 해결 1: 더 작은 모델 사용
whisper audio.mp3 --model small --language ko

# 해결 2: CPU로 폴백
whisper audio.mp3 --model medium --language ko --device cpu

# 해결 3: fp16 비활성화 (VRAM 절약)
python3 -c "
import whisper
model = whisper.load_model('medium', device='cuda')
result = model.transcribe('audio.mp3', fp16=False)
"

yt-dlp HTTP 403 에러

# 에러 메시지
# ERROR: unable to download video data: HTTP Error 403: Forbidden

# 해결: yt-dlp 업데이트 (유튜브가 자주 API를 변경)
pip install -U yt-dlp

# 여전히 안 되면 쿠키 사용
yt-dlp --cookies-from-browser chrome -x --audio-format mp3 "URL"

한국어 인식이 영어로 나올 때

# 언어 자동 감지가 실패하는 경우
# Detected language: English (0.52)  ← 확률이 낮으면 오감지

# 해결: 언어를 명시적으로 지정
whisper audio.mp3 --model medium --language ko

# Python에서
result = model.transcribe("audio.mp3", language="ko")
# language 파라미터를 넣으면 감지 단계를 건너뜀

특히 코드 리뷰 영상처럼 한국어와 영어가 섞인 경우, --language ko를 지정해야 한국어 기반으로 영어 단어를 섞어서 인식한다. 지정하지 않으면 영어 모드로 빠져서 한국어 부분이 전부 깨진다.

Whisper 할루시네이션 (반복 텍스트)

조용한 구간이 길거나 배경 음악만 나오는 부분에서 Whisper가 같은 문장을 반복 생성하는 현상이 있다.

# 할루시네이션 필터링
def filter_hallucination(segments: list, max_repeat: int = 3) -> list:
    """연속 반복 세그먼트를 제거"""
    filtered = []
    repeat_count = 0
    prev_text = ""

    for seg in segments:
        text = seg["text"].strip()
        if text == prev_text:
            repeat_count += 1
            if repeat_count >= max_repeat:
                continue  # 반복 세그먼트 스킵
        else:
            repeat_count = 0
        prev_text = text
        filtered.append(seg)

    removed = len(segments) - len(filtered)
    if removed > 0:
        print(f"할루시네이션 필터링: {removed}개 반복 세그먼트 제거")
    return filtered

# 적용
result = model.transcribe("audio.mp3", language="ko")
clean_segments = filter_hallucination(result["segments"])

이 문제는 condition_on_previous_text=False로 설정하면 줄어들지만, 대신 문맥 연결이 끊겨서 전체적인 자연스러움이 떨어질 수 있다. 상황에 따라 트레이드오프를 판단해야 한다.

Whisper와 yt-dlp 조합은 자막 추출 자동화의 가장 실용적인 방법이다. 설치부터 배치 처리까지 이 글의 코드를 그대로 복사해서 쓸 수 있도록 구성했으니, 우선 짧은 영상 하나로 테스트해보길 권한다. 한국어 인식 품질에서 medium과 large-v3의 차이를 직접 비교해보면, 어느 모델을 기본으로 쓸지 감이 잡힐 것이다. 자세한 내용은 OpenAI Whisper GitHub를 참고하자.