Go to file

meizhong986 6647489484 test(upgrade): Add comprehensive automated tests for upgrade feature Add 80 unit tests covering the upgrade system: tests/test_upgrade_version_checker.py (44 tests): - Version parsing (X.Y.Z, v-prefix, suffixes like .post4, -rc1) - Version comparison logic - Update notification levels (critical/major/minor/patch) - Notification display timing and dismiss intervals - HTTP request mocking with urllib.request - Cache save/load/expiry behavior - Windows installer asset URL detection - Environment variable configuration overrides - Integration tests simulating test stub repo responses tests/test_upgrade_cli.py (36 tests): - CLI argument parsing (--check, --yes, --wheel-only, --force, --version) - Installation directory detection (sys.prefix, LOCALAPPDATA) - Current version extraction via subprocess - Network connectivity checking - Package upgrade with mocked subprocess (success/failure/timeout) - Wheel-only mode (--no-deps) - Old version file cleanup - CLI output formatting for update notifications - Main entry point behavior - Full upgrade flow integration tests Testing approach: - Uses unittest.mock for urllib.request (not responses library) - pytest tmp_path fixture for directory isolation - monkeypatch for environment variable and sys.prefix isolation - MagicMock for subprocess calls to avoid real pip installs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>		2026-01-02 13:46:18 +00:00
.github	Ensemble mode 2-pass refatored to two isolated subprocesses; Smart Merge algorithm modified; Cleasnup and memory management revamped	2025-12-05 13:13:42 +00:00
crash_traces	v1.7.3 tests	2025-12-20 01:45:56 +00:00
docs	Housekeeping	2025-12-31 16:05:27 +00:00
installer	feat(upgrade): Add configurable endpoints for testing	2026-01-02 12:40:11 +00:00
notebook	Minor title change	2026-01-01 14:59:01 +00:00
tests	test(upgrade): Add comprehensive automated tests for upgrade feature	2026-01-02 13:46:18 +00:00
whisperjav	feat(upgrade): Add configurable endpoints for testing	2026-01-02 12:40:11 +00:00
.gitignore	Housekeeping	2025-12-31 16:05:27 +00:00
LICENSE	Initial commit	2023-10-18 21:37:14 +01:00
MANIFEST.in	kotoba and other models added	2025-11-29 22:53:08 +00:00
README.md	Updated Speech Enhancement section in README	2025-12-30 10:08:31 +00:00
RELEASE_NOTES_v1.7.4.md	bump to version 1.7.4	2025-12-28 20:02:55 +00:00
analyze_audio_scene_detection.py	ensemble worker temp file cleanup	2025-12-07 00:07:37 +00:00
fix_notebook.py	bumped version 1.7.1	2025-12-03 13:31:52 +00:00
install.py	fix(install): Auto-detect GPU and switch to CPU-only when no GPU found	2025-12-29 15:12:01 +00:00
pyproject.toml	Version 1.1.1	2025-07-03 16:02:40 +01:00
requirements.txt	Fix clearvoice dependency conflict with NumPy 2.x	2025-12-20 17:52:23 +00:00
setup.py	bump to version 1.7.4	2025-12-28 20:02:55 +00:00

README.md

WhisperJAV

A subtitle generator for Japanese Adult Videos.

What is the idea:

Transformer-based ASR architectures like Whisper suffer significant performance degradation when applied to the spontaneous and noisy domain of JAV. This degradation is driven by specific acoustic and temporal characteristics that defy the statistical distributions of standard training data.

1. The Acoustic Profile

JAV audio is defined by "acoustic hell" and a low Signal-to-Noise Ratio (SNR), characterized by:

Non-Verbal Vocalisations (NVVs): A high density of physiological sounds (heavy breathing, gasps, sighs) and "obscene sounds" that lack clear harmonic structure.
Spectral Mimicry: These vocalizations often possess "curve-like spectrum features" that mimic the formants of fricative consonants or Japanese syllables (e.g., fu), acting as accidental adversarial examples that trick the model into recognizing words where none exist.
Extreme Dynamics: Volatile shifts in audio intensity, ranging from faint whispers (sasayaki) to high-decibel screams, which confuse standard gain control and attention mechanisms.
Linguistic Variance: The prevalence of theatrical onomatopoeia and Role Language (Yakuwarigo) containing exaggerated intonations and slang absent from standard corpora.

2. Temporal Drift and Hallucination

While standard ASR models are typically trained on short, curated clips, JAV content comprises long-form media often exceeding 120 minutes. Research indicates that processing such extended inputs causes contextual drift and error accumulation. Specifically, extended periods of "ambiguous audio" (silence or rhythmic breathing) cause the Transformer's attention mechanism to collapse, triggering repetitive hallucination loops where the model generates unrelated text to fill the acoustic void.

3. The Pre-processing Paradox & Fine-Tuning Risks

Standard audio engineering intuition—such as aggressive denoising or vocal separation—often fails in this domain. Because Whisper relies on specific log-Mel spectrogram features, generic normalization tools can inadvertently strip high-frequency transients essential for distinguishing consonants, resulting in "domain shift" and erroneous transcriptions. Consequently, audio processing requires a "surgical," multi-stage approach (like VAD clamping) rather than blanket filtering.

Furthermore, while fine-tuning models on domain-specific data can be effective, it presents a high risk of overfitting. Due to the scarcity of high-quality, ethically sourced JAV datasets, fine-tuned models often become brittle, losing their generalization capabilities and leading to inconsistent "hit or miss" quality outputs.

WhisperJAV is an attempt to address above failure points. The inference pipelines do:

Acoustic Filtering: Deploys scene-based segmentation and VAD clamping under the hypothesis that distinct scenes possess uniform acoustic characteristics, ensuring the model processes coherent audio environments rather than mixed streams [1-3].
Linguistic Adaptation: Normalizes domain-specific terminology and preserves onomatopoeia, specifically correcting dialect-induced tokenization errors (e.g., in Kansai-ben) that standard BPE tokenizers fail to parse [4, 5].
Defensive Decoding: Tunes log-probability thresholding and no_speech_threshold to systematically discard low-confidence outputs (hallucinations), while utilizing regex filters to clean non-lexical markers (e.g., (moans)) from the final subtitle track [6, 7].

Quick Start

GUI (Recommended for most users)

whisperjav-gui

A window opens. Add your files, pick a mode, click Start.

Command Line

# Basic usage
whisperjav video.mp4

# Specify mode and sensitivity
whisperjav audio.mp3 --mode balanced --sensitivity aggressive

# Process a folder
whisperjav /path/to/media_folder --output-dir ./subtitles

Features

Processing Modes

Mode	Backend	Scene Detection	VAD	Best For
faster	stable-ts (turbo)	No	No	Speed priority, clean audio
fast	stable-ts	Yes	No	General use, mixed quality
balanced	faster-whisper	Yes	Yes	Default. Noisy audio, dialogue-heavy
fidelity	OpenAI Whisper	Yes	Yes (Silero)	Maximum accuracy, slower
transformers	HuggingFace	Optional	Internal	Japanese-optimized model, customizable

Sensitivity Settings

Conservative: Higher thresholds, fewer hallucinations. Good for noisy content.
Balanced: Default. Works for most content.
Aggressive: Lower thresholds, catches more dialogue. Good for whisper/ASMR content.

Transformers Mode (New in v1.7)

Uses HuggingFace's kotoba-tech/kotoba-whisper-v2.2 model, which is optimized for Japanese conversational speech:

whisperjav video.mp4 --mode transformers

# Customize parameters
whisperjav video.mp4 --mode transformers --hf-beam-size 5 --hf-chunk-length 20

Transformers-specific options:

--hf-model-id: Model (default: kotoba-tech/kotoba-whisper-v2.2)
--hf-chunk-length: Seconds per chunk (default: 15)
--hf-beam-size: Beam search width (default: 5)
--hf-temperature: Sampling temperature (default: 0.0)
--hf-scene: Scene detection method (none, auditok, silero, semantic)

Two-Pass Ensemble Mode (New in v1.7)

Runs your video through two different pipelines and merges results. Different models catch different things.

# Pass 1 with transformers, Pass 2 with balanced
whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced

# Custom sensitivity per pass
whisperjav video.mp4 --ensemble --pass1-pipeline balanced --pass1-sensitivity aggressive --pass2-pipeline fidelity

Merge strategies:

smart_merge (default): Intelligent overlap detection
pass1_primary / pass2_primary: Prioritize one pass, fill gaps from other
full_merge: Combine everything from both passes

Speech Enhancement tools (New in v1.7.3)

Pre-process audio scenes. When selected runs per-scene after scene detection. Note: Only use for surgical reasons. In general any audio processing that may alter mel-spectogram has the potential to introduce more artefacts and hallucination.

# ClearVoice denoising (48kHz, best quality)
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer clearvoice

# ClearVoice with specific 16kHz model
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer clearvoice:FRCRN_SE_16K

# FFmpeg DSP filters (lightweight, always available)
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer ffmpeg-dsp:loudnorm,denoise

# ZipEnhancer (lightweight SOTA)
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer zipenhancer

# BS-RoFormer vocal isolation
whisperjav video.mp4 --mode balanced --pass1-speech-enhancer bs-roformer

# Ensemble with different enhancers per pass
whisperjav video.mp4 --ensemble \
    --pass1-pipeline balanced --pass1-speech-enhancer clearvoice \
    --pass2-pipeline transformers --pass2-speech-enhancer none

Available backends:

Backend	Description	Models/Options
`none`	No enhancement (default)	-
`ffmpeg-dsp`	FFmpeg audio filters	`loudnorm`, `denoise`, `compress`, `highpass`, `lowpass`, `deess`
`clearvoice`	ClearerVoice denoising	`MossFormer2_SE_48K` (default), `FRCRN_SE_16K`
`zipenhancer`	ZipEnhancer 16kHz	`torch` (GPU), `onnx` (CPU)
`bs-roformer`	Vocal isolation	`vocals`, `other`

Syntax: --pass1-speech-enhancer <backend> or --pass1-speech-enhancer <backend>:<model>

GUI Parameter Customization

The GUI has three tabs:

Transcription Mode: Select pipeline, sensitivity, language
Advanced Options: Model override, scene detection method, debug settings
Two-Pass Ensemble: Configure both passes with full parameter customization via JSON editor

The Ensemble tab lets you customize beam size, temperature, VAD thresholds, and other ASR parameters without editing config files.

AI Translation

Generate subtitles and translate them in one step:

# Generate and translate
whisperjav video.mp4 --translate

# Or translate existing subtitles
whisperjav-translate -i subtitles.srt --provider deepseek

Supports DeepSeek (cheap), Gemini (free tier), Claude, GPT-4, and OpenRouter.

Resume Support: If translation is interrupted, just run the same command again. It automatically resumes from where it left off using the .subtrans project file.

What Makes It Work for JAV

Scene Detection

Splits audio at natural breaks instead of forcing fixed-length chunks. This prevents cutting off sentences mid-word.

Three methods are available:

Auditok (default): Energy-based detection, fast and reliable
Silero: Neural VAD-based detection, better for noisy audio
Semantic (new in v1.7.4): Texture-based clustering using MFCC features, groups acoustically similar segments together

Voice Activity Detection (VAD)

Identifies when someone is actually speaking vs. background noise or music. Reduces false transcriptions during quiet moments.

Japanese Post-Processing

Handles sentence-ending particles (ね, よ, わ, の)
Preserves aizuchi (うん, はい, ええ)
Recognizes dialect patterns (Kansai-ben, feminine/masculine speech)
Filters out common Whisper hallucinations

Hallucination Removal

Whisper sometimes generates repeated text or phrases that weren't spoken. WhisperJAV detects and removes these patterns.

Content-Specific Recommendations

Content Type	Mode	Sensitivity	Notes
Drama / Dialogue Heavy	balanced	aggressive	Or try transformers mode
Group Scenes	faster	conservative	Speed matters, less precision needed
Amateur / Homemade	fast	conservative	Variable audio quality
ASMR / VR / Whisper	fidelity	aggressive	Maximum accuracy for quiet speech
Heavy Background Music	balanced	conservative	VAD helps filter music
Maximum Accuracy	ensemble	varies	Two-pass with different pipelines

Installation

Windows Installer (Easiest)

Download and run: WhisperJAV-1.7.4-Windows-x86_64.exe

This installs everything you need including Python and dependencies.

Upgrading from Previous Installer Versions

If you installed v1.5.x or v1.6.x via the Windows installer:

Download upgrade_whisperjav.bat
Double-click to run
Wait 1-2 minutes

This updates WhisperJAV without re-downloading PyTorch (~2.5GB) or your AI models (~3GB).

Install from Source

Requires Python 3.9-3.12, FFmpeg, and Git.

Recommended: Use the install scripts (handles dependency conflicts automatically, auto-detects GPU):

Windows

git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
installer\install_windows.bat              # Auto-detects GPU and CUDA version
installer\install_windows.bat --cpu-only   # Force CPU only
installer\install_windows.bat --cuda118    # Force CUDA 11.8
installer\install_windows.bat --cuda124    # Force CUDA 12.4
installer\install_windows.bat --minimal    # Minimal install (no speech enhancement)
installer\install_windows.bat --dev        # Development/editable install

The script automatically:

Detects your NVIDIA GPU and selects optimal CUDA version
Falls back to CPU-only if no GPU found
Checks for WebView2 runtime (required for GUI)
Logs installation to install_log_windows.txt
Retries failed downloads up to 3 times

Linux / macOS

# Install system dependencies first (Linux only)
# Debian/Ubuntu:
sudo apt-get install -y python3-dev build-essential ffmpeg libsndfile1

# Fedora/RHEL:
sudo dnf install python3-devel gcc ffmpeg libsndfile

git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
chmod +x installer/install_linux.sh
./installer/install_linux.sh               # Auto-detects GPU
./installer/install_linux.sh --cpu-only    # Force CPU only
./installer/install_linux.sh --minimal     # Minimal install

Cross-Platform Python Script

git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
python install.py              # Auto-detects GPU, defaults to CUDA 12.1
python install.py --cpu-only   # CPU only
python install.py --cuda118    # CUDA 11.8
python install.py --cuda121    # CUDA 12.1
python install.py --cuda124    # CUDA 12.4
python install.py --minimal    # Minimal install (no speech enhancement)
python install.py --dev        # Development/editable install

Alternative: Manual pip install (may encounter dependency conflicts):

# Install PyTorch with GPU support first (NVIDIA example)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

# Then install WhisperJAV
pip install git+https://github.com/meizhong986/whisperjav.git@main

Platform Notes:

Apple Silicon (M1/M2/M3/M4): Just pip install torch torchaudio - MPS acceleration works automatically
AMD GPU (ROCm): Experimental. Use --mode balanced for best compatibility
CPU only: Works but slow. Use --accept-cpu-mode to skip the GPU warning
Linux server (no GPU): The install scripts auto-detect and switch to CPU-only
Linux (Debian/Ubuntu): Install system dependencies first: sudo apt-get install -y python3-dev build-essential ffmpeg libsndfile1

Prerequisites

Python 3.9-3.12 (3.13+ not compatible with openai-whisper)
FFmpeg in your system PATH
GPU recommended: NVIDIA CUDA, Apple MPS, or AMD ROCm
8GB+ disk space for installation

Detailed Windows Prerequisites

NVIDIA GPU Setup

Install latest NVIDIA drivers
Install CUDA Toolkit matching your driver version
Install cuDNN matching your CUDA version

FFmpeg

Download from gyan.dev/ffmpeg/builds
Extract to C:\ffmpeg
Add C:\ffmpeg\bin to your PATH

Python

Download from python.org. Check "Add Python to PATH" during installation.

CLI Reference

# Basic usage
whisperjav video.mp4
whisperjav video.mp4 --mode balanced --sensitivity aggressive

# All modes: faster, fast, balanced, fidelity, transformers
whisperjav video.mp4 --mode fidelity

# Transformers mode with custom parameters
whisperjav video.mp4 --mode transformers --hf-beam-size 5 --hf-chunk-length 20

# Two-pass ensemble
whisperjav video.mp4 --ensemble --pass1-pipeline transformers --pass2-pipeline balanced
whisperjav video.mp4 --ensemble --pass1-pipeline balanced --pass2-pipeline fidelity --merge-strategy smart_merge

# Output options
whisperjav video.mp4 --output-dir ./subtitles
whisperjav video.mp4 --subs-language english-direct

# Batch processing
whisperjav /path/to/folder --output-dir ./subtitles
whisperjav /path/to/folder --skip-existing    # Resume interrupted batch (skip already processed)

# Debugging
whisperjav video.mp4 --debug --keep-temp

# Translation
whisperjav video.mp4 --translate --translate-provider deepseek
whisperjav-translate -i subtitles.srt --provider gemini

Run whisperjav --help for all options.

Troubleshooting

FFmpeg not found: Install FFmpeg and add it to your PATH.

Slow processing / GPU warning: Your PyTorch might be CPU-only. Reinstall with GPU support:

pip uninstall torch torchvision torchaudio
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

model.bin error in faster mode: Enable Windows Developer Mode or run as Administrator, then delete the cached model folder:

Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\models--Systran--faster-whisper-large-v2"

Performance

Rough estimates for processing time per hour of video:

Platform	Time
NVIDIA GPU (CUDA)	5-10 minutes
Apple Silicon (MPS)	8-15 minutes
AMD GPU (ROCm)	10-20 minutes
CPU only	30-60 minutes

Contributing

Contributions welcome. See CONTRIBUTING.md for guidelines.

git clone https://github.com/meizhong986/whisperjav.git
cd whisperjav
pip install -e .[dev]
python -m pytest tests/

License

MIT License. See LICENSE file.

Citation and credits

Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio." (2025). arXiv:2501.11378.
Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down." (2025). arXiv:2505.12969.
PromptASR for Contextualized ASR with Controllable Style." (2024). arXiv:2309.07414.
In-Context Learning Boosts Speech Recognition." (2025). arXiv:2505.1
Koenecke, A., et al. (2024). "Careless Whisper: Speech-to-Text Hallucination Harms." ACM FAccT 2024.
Bain, M., et al. (2023). "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio." arXiv:2303.00747.

Acknowledgments

OpenAI Whisper - The underlying ASR model
stable-ts - Timestamp refinement
faster-whisper - Optimized CTranslate2 inference
HuggingFace Transformers - Transformers pipeline backend
Kotoba-Whisper - Japanese-optimized Whisper model
The testing community for feedback and bug reports

Disclaimer

This tool generates accessibility subtitles. Users are responsible for compliance with applicable laws regarding the content they process.