Documents
Start

πŸš€ Getting Started

πŸ“‹ API Configuration

VideoLingo requires an LLM and TTS(optional). For the best quality, use claude-sonnet-4.6 or gpt-5.4 with Azure TTS. Alternatively, for a fully local setup with no API key needed, use Ollama for the LLM and Edge TTS for dubbing. In this case, set max_workers to 1 and summary_length to a low value like 2000 in config.yaml.

1. Get API_KEY for LLM:

Recommended ModelVendorQualityCost-efficiency
claude-sonnet-4-6Anthropic (opens in a new tab)🀩⭐⭐⭐
claude-opus-4-6Anthropic (opens in a new tab)πŸ†β­β­
gpt-5.2OpenAI (opens in a new tab)🀩⭐⭐⭐
gemini-3-flashGoogle (opens in a new tab)πŸ˜ƒβ­β­β­β­β­
gemini-3.1-proGoogle (opens in a new tab)🀩⭐⭐⭐
minimax-m2.5MiniMax (opens in a new tab)πŸ˜ƒβ­β­β­β­β­
kimi-k2.5Moonshot AI (opens in a new tab)πŸ˜ƒβ­β­β­β­
deepseek-v3DeepSeek (opens in a new tab)πŸ₯³β­β­β­β­
qwen3-32bOllama (opens in a new tab) self-hostedπŸ˜ƒβ™ΎοΈ Free

Tip: Model pricing changes frequently. Check each vendor's website for current rates. models.dev (opens in a new tab) offers cross-vendor price and capability comparison.

API proxy: If you cannot access overseas APIs directly, OpenRouter (opens in a new tab) is recommended (supports all models above, unified OpenAI-format API, pay-per-use with no monthly fee).

Note: Supports OpenAI format, you can try different models at your risk. However, the process involves multi-step reasoning chains and complex JSON formats, not recommended to use models smaller than 30B.

2. TTS API

VideoLingo provides multiple TTS integration methods. Here's a comparison (skip if only using translation without dubbing)

TTS SolutionProviderProsConsChinese EffectNon-Chinese Effect
πŸ”Š Azure TTS ⭐302AI (opens in a new tab)Natural effectLimited emotionsπŸ€©πŸ˜ƒ
πŸŽ™οΈ OpenAI TTS302AI (opens in a new tab)Realistic emotionsChinese sounds foreignπŸ˜•πŸ€©
🎀 Fish TTS302AI (opens in a new tab)Authentic nativeLimited official modelsπŸ€©πŸ˜‚
πŸŽ™οΈ SiliconFlow FishTTSSiliconFlow (opens in a new tab)Voice CloneUnstable cloning effectπŸ˜ƒπŸ˜ƒ
πŸ—£ Edge TTSLocalCompletely freeAverage effect😐😐
πŸ—£οΈ GPT-SoVITSLocalBest voice cloningOnly supports Chinese/English, requires local inference, complex setupπŸ†πŸš«

Wanna use your own TTS? Modify in core/all_tts_functions/custom_tts.py!

SiliconFlow FishTTS Tutorial

Currently supports 3 modes:

  1. preset: Uses fixed voice, can preview on Official Playground (opens in a new tab), default is anna.
  2. clone(stable): Corresponds to fishtts api's custom, uses voice from uploaded audio, automatically samples first 10 seconds of video for voice, better voice consistency.
  3. clone(dynamic): Corresponds to fishtts api's dynamic, uses each sentence as reference audio during TTS, may have inconsistent voice but better effect.
How to choose OpenAI voices?

Voice list can be found on the official website (opens in a new tab), such as alloy, echo, nova, etc. Modify openai_tts.voice in config.yaml.

How to choose Azure voices?

Recommended to try voices in the online demo (opens in a new tab). You can find the voice code in the code on the right, e.g. zh-CN-XiaoxiaoMultilingualNeural

How to choose Fish TTS voices?

Go to the official website (opens in a new tab) to listen and choose voices. Find the voice code in the URL, e.g. Dingzhen is 54a5170264694bfc8e9ad98df7bd89c3. Popular voices are already added in config.yaml. To use other voices, modify the fish_tts.character_id_dict dictionary in config.yaml.

GPT-SoVITS-v2 Tutorial
  1. Check requirements and download the package from official Yuque docs (opens in a new tab).

  2. Place GPT-SoVITS-v2-xxx and VideoLingo in the same directory. Note they should be parallel folders.

  3. Choose one of the following ways to configure the model:

    a. Self-trained model:

    • After training, tts_infer.yaml under GPT-SoVITS-v2-xxx\GPT_SoVITS\configs will have your model path auto-filled. Copy and rename it to your_preferred_english_character_name.yaml
    • In the same directory as the yaml file, place reference audio named your_preferred_english_character_name_reference_audio_text.wav or .mp3, e.g. Huanyuv2_Hello, this is a test audio.wav
    • In VideoLingo's sidebar, set GPT-SoVITS Character to your_preferred_english_character_name.

    b. Use pre-trained model:

    • Download my model from here (opens in a new tab), extract and overwrite to GPT-SoVITS-v2-xxx.
    • Set GPT-SoVITS Character to Huanyuv2.

    c. Use other trained models:

    • Place xxx.ckpt in GPT_weights_v2 folder and xxx.pth in SoVITS_weights_v2 folder.

    • Following method a, rename tts_infer.yaml and modify t2s_weights_path and vits_weights_path under custom to point to your models, e.g.:

      # Example config for method b:
      t2s_weights_path: GPT_weights_v2/Huanyu_v2-e10.ckpt
      version: v2
      vits_weights_path: SoVITS_weights_v2/Huanyu_v2_e10_s150.pth
    • Following method a, place reference audio in the same directory as the yaml file, named your_preferred_english_character_name_reference_audio_text.wav or .mp3, e.g. Huanyuv2_Hello, this is a test audio.wav. The program will auto-detect and use it.

    • ⚠️ Warning: Please use English for character_name to avoid errors. reference_audio_text can be in Chinese. Currently in beta, may produce errors.

    # Expected directory structure:
    .
    β”œβ”€β”€ VideoLingo
    β”‚   └── ...
    └── GPT-SoVITS-v2-xxx
        β”œβ”€β”€ GPT_SoVITS
        β”‚   └── configs
        β”‚       β”œβ”€β”€ tts_infer.yaml
        β”‚       β”œβ”€β”€ your_preferred_english_character_name.yaml
        β”‚       └── your_preferred_english_character_name_reference_audio_text.wav
        β”œβ”€β”€ GPT_weights_v2
        β”‚   └── [your GPT model file]
        └── SoVITS_weights_v2
            └── [your SoVITS model file]

After configuration, select Reference Audio Mode in the sidebar (see Yuque docs for details). During dubbing, VideoLingo will automatically open GPT-SoVITS inference API port in the command line, which can be closed manually after completion. Note that stability depends on the base model chosen.

πŸ› οΈ Quick Start

VideoLingo supports Windows, macOS and Linux systems, and can run on CPU or GPU.

Note: To use NVIDIA GPU acceleration on Windows, please complete the following steps first:

  1. Install CUDA Toolkit 12.6 (opens in a new tab) or newer (12.8 / 12.9 / 13.x all work β€” the install script auto-adapts)
  2. Install CUDNN 9.3.0 (opens in a new tab)
  3. Add C:\Program Files\NVIDIA\CUDNN\v9.3\bin\12.6 to your system PATH
  4. Restart your computer

⚠️ Pitfall: The install script uses nvidia-smi to detect your driver's CUDA version and auto-selects the best PyTorch wheel (cu129 / cu128 / cu126). For RTX 50 series (Blackwell) GPUs, cu129 wheels with sm_100 kernels are selected automatically. Do NOT manually install cu130/cu131 PyTorch β€” this causes ctranslate2 to fail with cublas64_12.dll not found.

Note: FFmpeg is required. Please install it via package managers:

⚠️ Pitfall: Do NOT use conda-forge ffmpeg (it lacks the libmp3lame encoder). Use the system package manager to install a full build.

Option A: Using uv (Recommended)

uv (opens in a new tab) is a fast Python package manager that automatically downloads the correct Python version and creates an isolated environment. No need to install Python or Anaconda yourself. (~30 MB vs ~4 GB for Anaconda, 10-100x faster package installs.)

  1. Clone the project:

    git clone https://github.com/Huanshere/VideoLingo.git
    cd VideoLingo
  2. One-command setup (installs uv + Python 3.10 + all dependencies):

    python setup_env.py

    ⚠️ Install order matters: install.py (called automatically by setup_env.py) installs dependencies in the correct order: PyTorch first (locks CUDA version), then demucs with --no-deps (prevents torchaudio downgrade), then the rest. Do not rearrange manually.

  3. πŸŽ‰ Launch Streamlit app:

    .venv\Scripts\streamlit run st.py        # Windows
    .venv/bin/streamlit run st.py            # macOS / Linux

    Or double-click OneKeyStart_uv.bat on Windows.

  4. Set key in sidebar of popup webpage and start using~

Option B: Using Conda

⚠️ Not recommended. This method will not be maintained going forward. Please use uv (Option A) above.

Click to expand Conda installation steps

Before installing VideoLingo, ensure you have installed Git and Anaconda.

  1. Clone the project:

    git clone https://github.com/Huanshere/VideoLingo.git
    cd VideoLingo
  2. Create and activate virtual environment (must be python=3.10.0):

    conda create -n videolingo python=3.10.0 -y
    conda activate videolingo

    ⚠️ Pitfall: Make sure pip is using the conda env's site-packages. On Windows, if the site-packages directory is not writable (e.g. under C:\ProgramData\anaconda3\), pip silently installs to the user directory instead. If this happens, run the terminal as administrator.

  3. Run installation script:

    python install.py

    ⚠️ Install order matters: install.py installs dependencies in the correct order: PyTorch first (locks CUDA version), then demucs with --no-deps (prevents torchaudio downgrade), then the rest. Do not rearrange manually.

  4. πŸŽ‰ Launch Streamlit app by running the command or double-clicking OneKeyStart.bat:

    streamlit run st.py
  5. Set key in sidebar of popup webpage and start using~

tutorial

  1. (Optional) More settings can be manually modified in config.yaml, watch command line output during operation. To use custom terms, add them to custom_terms.xlsx before processing, e.g. Baguette | French bread | Not just any bread!.

Need help? Our AI Assistant (opens in a new tab) is here to guide you through any issues!

🏭 Batch Mode (beta)

Document: English | Chinese

Note: This section is still in early development and may have limited functionality

🚨 Common Errors & Pitfalls

  1. 'All array must be of the same length' or 'Key Error' during translation:

    • Reason 1: Weaker models have poor JSON format compliance causing response parsing errors.
    • Reason 2: LLM may refuse to translate sensitive content. Solution: Check response and msg fields in output/gpt_log/error.json, delete the output/gpt_log folder and retry.
  2. 'Retry Failed', 'SSL', 'Connection', 'Timeout': Usually network issues. Solution: Users in mainland China please switch network nodes and retry.

  3. local_files_only=True: Model download failure due to network issues, need to verify network can ping huggingface.co.

  4. cublas64_12.dll not found: Installed CUDA 13.x and used cu130/cu131 PyTorch wheels. Solution: Must use cu129, cu128, or cu126 wheels (install.py handles this automatically via nvidia-smi detection) because ctranslate2 only supports CUDA 12. Re-run python install.py.

  5. Whisper model loading segfaults silently: ctranslate2 version mismatches cuDNN version. Solution: Ensure ctranslate2>=4.5.0 (supports cuDNN 9, which PyTorch 2.6+ ships with).

  6. RuntimeError: Weights only load failed: PyTorch β‰₯2.6 changed torch.load default behavior. Solution: Already fixed via monkey-patch in whisperX_local.py. If you see this, your code is not up to date.

  7. WhisperX transcription hangs in Streamlit (CPU/GPU idle): librosa.load() deadlocks in Streamlit's non-main thread. Solution: Already fixed by replacing with whisperx.audio.load_audio() (ffmpeg subprocess). If you see this, your code is not up to date.

  8. spacy Can't find model 'xx_core_web_md' (but pip says installed): pip installed the model to user directory instead of conda env. Solution: Run terminal as administrator, or manually install with conda env's python:

    python -m pip install xx-core-web-md --no-user --force-reinstall --no-deps
  9. torchaudio version drops to 1.x or 2.1.x after pip install: demucs's torchaudio<2.2 constraint causes downgrade. Solution: Never pip install demucs directly β€” must use --no-deps. install.py handles this correctly.


2026 Β© VideoLingo.