AutoKara
An experiment in automatic karaoke timing.
Installation
Requirements
- FFmpeg
- Python >= 3.8 (tested up to latest 3.11 version)
Optional :
- MKVToolnix (at least the CLI utils) : required for the preprocessing scripts
All other python modules can be installed directly through PIP, see next section.
Install
Using a virtual environment is strongly recommended (but not mandatory if you know what you're doing) :
# create the virtual environment, do it once
python -m venv env
# activate the virtual environement
source env/bin/activate
# To exit the virtual environment
deactivate
The simplest way to install Autokara is through PIP.
# Using HTTPS
pip install git+https://git.iiens.net/bakaclub/autokara.git
# Or SSH
pip install git+ssh://git@git.iiens.net:bakaclub/autokara.git
Or you can clone the repo and use pip install <repo_directory>
if you prefer.
To use the custom phonetic mappings for Japanese Romaji and other non-English languages, you need to update manually (for now) the g2p DB (within the venv):
autokara-gen-lang
Configuration
Autokara comes with a default config file in autokara/default.conf
.
If you want to tweak some values (enable CUDA, for example), you should add them to a new config file in your personal config directory : ~/.config/autokara/autokara.conf
.
This new file has priority over the default one, which is used only as fallback.
Use
Autokara
To use Autokara, you need :
- A media file of the song (video, or pre-extracted vocals)
- An ASS file with the lyrics, split by syllable (you can use the Auto-Split in Aegisub, but doing it manually may yield better results)
To execute AutoKara on a MKV video file and an ASS file containing the lyrics (ASS will be overwritten):
autokara video.mkv lyrics.ass
To output to a different file (and keep the original) :
autokara video.mkv lyrics.ass -o output.ass
To execute AutoKara on a (pre-extracted) WAV (or OGG, MP3, ...) vocals file, pass the --vocals
flag :
autokara vocals.wav lyrics.ass --vocals
To use a phonetic transcription optimized for a specific language, use --lang
(or -l
). Default is Japanese Romaji.
You can also specify a specific language for uppercase words (default is set in your config file) :
# Use french transcription
autokara video.mkv lyrics.ass --lang fr
# Use english transcription, but treat all uppercase words as french :
autokara video.mkv lyrics.ass --lang en --uppercase-lang fr
Available languages options are :
jp : Japanese Romaji (base default)
en : English (uppercase default)
fr : French
fi : Finnish
da : Danish
Full help for all options is available with :
autokara -h
Useful scripts
Manual preprocessing
Use autokara-preprocess
if you want to manually preprocess video/lyrics in advance :
# Extract vocals from video :
autokara-preprocess --vocals video_file output_folder/
# Extract ASS file from a MKV containing a subtitle track :
autokara-preprocess --lyrics video_file output_file.ass
# Do both at once :
autokara-preprocess --full video_file output_folder/
Then you can use Autokara on the extracted files with the --vocals
flag.
Sound and onsets plotting
A visualization tool, mainly intended for debug or curious people.
Does the same as autokara
, but instead of writing to a file, plots a graphic with syllable onset times, spectrogram, probability curves,...
Does not work on video files, only separated vocals audio files :
autokara-plot vocals.wav lyrics.ass
Caveats
- It is an AI model, inaccuracies are still very much a possibility. You should always check the result and edit if necessary.
- Overlapping voices, interwoven duets, background choirs,... are still a huge difficulty. Until we find a way to separate multiple voices in a song, don't expect good results with more than one singer at a time.
Documentation and References
Extracting vocals from music
Syllable segmentation
Aligning lyrics to song (Jiawen Huang, Emmanouil Benetos, Sebastian Ewert, 2022)
Using CNNs on spectrogram images (Schlüter, Böck, 2014) :