Documents
Start

πŸš€ Getting Started

πŸ“‹ API Configuration

This project requires Large Language Models and TTS. For best quality, please use claude-3-5-sonnet-20240620 with Azure TTS. Recommended to use 302AI (opens in a new tab), which offers both LLM and TTS services with a single API key. You can also choose a fully local experience by using Ollama for LLM and Edge TTS for dubbing, with no API key required (In this case, you need to set max_workers to 1 and summary_length low to 2000 in config.yaml).

1. Get API_KEY for Large Language Models:

Recommended ModelRecommended Providerbase_urlPriceEffect
gemini-2.0-flash-exp302AI (opens in a new tab)https://api.302.ai (opens in a new tab)$0.3 / 1M tokensπŸ₯³
claude-3-5-sonnet-20240620302AI (opens in a new tab)https://api.302.ai (opens in a new tab)$15 / 1M tokens🀩
deepseek-coder302AI (opens in a new tab)https://api.302.ai (opens in a new tab)Β₯2 / 1M tokensπŸ˜ƒ
qwen2.5-coder:32bOllama (opens in a new tab)http://localhost:11434 (opens in a new tab)LocalπŸ˜ƒ

Note: Supports OpenAI interface, you can try different models. However, the process involves multi-step reasoning chains and complex JSON formats, not recommended to use models smaller than 30B.

2. TTS API

VideoLingo provides multiple TTS integration methods. Here's a comparison (skip if only using translation without dubbing)

TTS SolutionProviderProsConsChinese EffectNon-Chinese Effect
πŸ”Š Azure TTS ⭐302AI (opens in a new tab)Natural effectLimited emotionsπŸ€©πŸ˜ƒ
πŸŽ™οΈ OpenAI TTS302AI (opens in a new tab)Realistic emotionsChinese sounds foreignπŸ˜•πŸ€©
🎀 Fish TTS302AI (opens in a new tab)Authentic nativeLimited official modelsπŸ€©πŸ˜‚
πŸŽ™οΈ SiliconFlow FishTTSSiliconFlow (opens in a new tab)Voice CloneUnstable cloning effectπŸ˜ƒπŸ˜ƒ
πŸ—£ Edge TTSLocalCompletely freeAverage effect😐😐
πŸ—£οΈ GPT-SoVITSLocalBest voice cloningOnly supports Chinese/English, requires local inference, complex setupπŸ†πŸš«

Want to use your own TTS API? Edit in core/all_tts_functions/custom_tts.py!

SiliconFlow FishTTS Tutorial

Currently supports 3 modes:

  1. preset: Uses fixed voice, can preview on Official Playground (opens in a new tab), default is anna.
  2. clone(stable): Corresponds to fishtts api's custom, uses voice from uploaded audio, automatically samples first 10 seconds of video for voice, better voice consistency.
  3. clone(dynamic): Corresponds to fishtts api's dynamic, uses each sentence as reference audio during TTS, may have inconsistent voice but better effect.
How to choose OpenAI voices?

Voice list can be found on the official website (opens in a new tab), such as alloy, echo, nova, etc. Modify openai_tts.voice in config.yaml.

How to choose Azure voices?

Recommended to try voices in the online demo (opens in a new tab). You can find the voice code in the code on the right, e.g. zh-CN-XiaoxiaoMultilingualNeural

How to choose Fish TTS voices?

Go to the official website (opens in a new tab) to listen and choose voices. Find the voice code in the URL, e.g. Dingzhen is 54a5170264694bfc8e9ad98df7bd89c3. Popular voices are already added in config.yaml. To use other voices, modify the fish_tts.character_id_dict dictionary in config.yaml.

GPT-SoVITS-v2 Tutorial
  1. Check requirements and download the package from official Yuque docs (opens in a new tab).

  2. Place GPT-SoVITS-v2-xxx and VideoLingo in the same directory. Note they should be parallel folders.

  3. Choose one of the following ways to configure the model:

    a. Self-trained model:

    • After training, tts_infer.yaml under GPT-SoVITS-v2-xxx\GPT_SoVITS\configs will have your model path auto-filled. Copy and rename it to your_preferred_english_character_name.yaml
    • In the same directory as the yaml file, place reference audio named your_preferred_english_character_name_reference_audio_text.wav or .mp3, e.g. Huanyuv2_Hello, this is a test audio.wav
    • In VideoLingo's sidebar, set GPT-SoVITS Character to your_preferred_english_character_name.

    b. Use pre-trained model:

    • Download my model from here (opens in a new tab), extract and overwrite to GPT-SoVITS-v2-xxx.
    • Set GPT-SoVITS Character to Huanyuv2.

    c. Use other trained models:

    • Place xxx.ckpt in GPT_weights_v2 folder and xxx.pth in SoVITS_weights_v2 folder.

    • Following method a, rename tts_infer.yaml and modify t2s_weights_path and vits_weights_path under custom to point to your models, e.g.:

      # Example config for method b:
      t2s_weights_path: GPT_weights_v2/Huanyu_v2-e10.ckpt
      version: v2
      vits_weights_path: SoVITS_weights_v2/Huanyu_v2_e10_s150.pth
    • Following method a, place reference audio in the same directory as the yaml file, named your_preferred_english_character_name_reference_audio_text.wav or .mp3, e.g. Huanyuv2_Hello, this is a test audio.wav. The program will auto-detect and use it.

    • ⚠️ Warning: Please use English for character_name to avoid errors. reference_audio_text can be in Chinese. Currently in beta, may produce errors.

    # Expected directory structure:
    .
    β”œβ”€β”€ VideoLingo
    β”‚   └── ...
    └── GPT-SoVITS-v2-xxx
        β”œβ”€β”€ GPT_SoVITS
        β”‚   └── configs
        β”‚       β”œβ”€β”€ tts_infer.yaml
        β”‚       β”œβ”€β”€ your_preferred_english_character_name.yaml
        β”‚       └── your_preferred_english_character_name_reference_audio_text.wav
        β”œβ”€β”€ GPT_weights_v2
        β”‚   └── [your GPT model file]
        └── SoVITS_weights_v2
            └── [your SoVITS model file]

After configuration, select Reference Audio Mode in the sidebar (see Yuque docs for details). During dubbing, VideoLingo will automatically open GPT-SoVITS inference API port in the command line, which can be closed manually after completion. Note that stability depends on the base model chosen.

πŸ› οΈ Quick Start

VideoLingo supports Windows, macOS and Linux systems, and can run on CPU or GPU.

Note: To use NVIDIA GPU acceleration on Windows, please complete the following steps first:

  1. Install CUDA Toolkit 12.6 (opens in a new tab)
  2. Install CUDNN 9.3.0 (opens in a new tab)
  3. Add C:\Program Files\NVIDIA\CUDNN\v9.3\bin\12.6 to your system PATH
  4. Restart your computer

Note: FFmpeg is required. Please install it via package managers:

Before installing VideoLingo, ensure you have installed Git and Anaconda.

  1. Clone the project:

    git clone https://github.com/Huanshere/VideoLingo.git
    cd VideoLingo
  2. Create and activate virtual environment (must be python=3.10.0):

    conda create -n videolingo python=3.10.0 -y
    conda activate videolingo
  3. Run installation script:

    python install.py
  4. πŸŽ‰ Launch Streamlit app:

    streamlit run st.py
  5. Set key in sidebar of popup webpage and start using~

    tutorial

  6. (Optional) More settings can be manually modified in config.yaml, watch command line output during operation. To use custom terms, add them to custom_terms.xlsx before processing, e.g. Baguette | French bread | Not just any bread!.

🏭 Batch Mode (beta)

Documentation: English | Chinese

🚨 Common Errors

  1. 'All array must be of the same length' or 'Key Error' during translation:

    • Reason 1: Weaker models have poor JSON format compliance causing response parsing errors.
    • Reason 2: LLM may refuse to translate sensitive content. Solution: Check response and msg fields in output/gpt_log/error.json, delete the output/gpt_log folder and retry.
  2. 'Retry Failed', 'SSL', 'Connection', 'Timeout': Usually network issues. Solution: Users in mainland China please switch network nodes and retry.

  3. local_files_only=True: Model download failure due to network issues, need to verify network can ping huggingface.co.


2024 Β© VideoLingo.