🔴2. Model Training | RVC

Change to Applio, it's better.

This guide is not currently being updated, creating with Applio is better : 2. Model Training | Applio

Please refer to 1. Dataset Creation before start this.

Start the training:

To train a model locally, begin by following the tutorial: RVC V2

If you are training locally, your dataset should be in a folder somewhere with all the files.

If you are training locally, your dataset should be in a folder somewhere with all the files.

Start by copying the directory path containing all the audio files. You can now launch the graphical interface and go to the "Train" tab at the top of the page.

In the first option, set the name of the model.

I typically keep the other parameters unchanged (generally, v2 is much faster during training and is preferred today).

You may need to reduce the number of threads if you're training locally. Mine was set to '20', and I had to change it to 4 to avoid a BSOD (Blue Screen of Death). If you want to be cautious, a value of 1 or 2 won't take much more time than a higher value since it only applies to the initial processing step.

In the first field, for 'Path to training folder,' paste the path of the dataset you copied. Then, click on 'process data.' Wait for the processing to be completed entirely (the text console will indicate 'end pre-process' when it's finished).

Now select the f0 methode, you can read this for information :

Set your value and click on "Feature" extraction. Wait for the Colab text console to indicate that it has finished feature extraction with "all-feature-done," just like it did at the end of preprocessing.

Batch size refers to the amount of data processed at once (speed option, not quality). This depends on the GPU's VRAM. For example, on an RTX 2070 with 8GB of VRAM, you would use a batch size of 8.

On a Colab GPU, 20 is the value that people consider safest to avoid errors, but I've also heard that it's better to stick to a power of 2 (so 2, 4, 8, 16, 32). Therefore, I use 16 on Colab.

DO NOT USE ONE CLICK TRAINING; it's buggy. Always keep 'Save only the last checkpoint file to reduce disk usage.' Refer to other tips here as well if needed. Set a decent number of epochs to cover yourself, for example, 600. Before you start, read the training section to understand how you will test the model while it trains and how to determine if you are overfitting. Once you've trained your feature index (second big button), you can click on "Train model" to begin training, but before that, let's go over some important features:

Testing the model during training (this is important)

If enabled, this option saves the model as a small .pth file in the /weights/ path for each save frequency (e.g., Kendrick_e10, Kendrick_e20 for a save frequency of '10'). To get an accurate (early) preview, generate the feature index before training; of course, you should ensure you've completed the first two steps (data processing + feature extraction) before training the index. You can also generate the feature index afterward if you forgot to do it. Enabling this option allows you to test the model at every epoch iteration if needed or to use a previous iteration if you've overtrained.

What number of epochs should I set? / How do I know if I am "overtraining"?

More information on TensorBoard is available in the tutorial created for it: 2. Optional: TensorBoard (RVC)

Use TensorBoard logs to identify when the model begins to overtrain. Go to the TensorBoard screen in Colab. Click on the "scalars" tab and look for g/total at the top. It should say g/total, with a "g," not d/total.

The V2 option in the training tab reaches the optimal point much faster than V1.

Once you've found the ideal number of steps, perform basic calculations to determine the ideal number of epochs.

For example, let's say that 10,000 steps is where overfitting begins. Suppose you overtrained up to 20,000 steps, and your model is currently at 600 epochs. Since 600 epochs correspond to 20k steps, it means that 10k/20k = 50%. 50% of 600 is approximately 300 epochs, roughly; so that's the ideal epoch value in this scenario.

You can also find the timestamp of the best value in TensorBoard, then check the train.log file for the model in the /logs/ folder for the corresponding timestamp to pinpoint the exact epoch.

Fewer epochs generally mean that the model will be less accurate, rather than necessarily sounding worse for training v2. However, if your dataset is not of very high quality or lacks a lot of data, you can experiment later and see which model with saved epochs strikes the best balance between accuracy and sound quality. In rarer cases, fewer epochs may sound better to your ears. It's a trial-and-error process to create a good model at this stage. If you want to be cautious, I would opt for a "slightly undertrained" model.

The maximum smoothing option on the left side is disabled, and don't forget to press the refresh button to update it if necessary.

You can also search for your specific model by its name if necessary. If you notice early signs of overfitting and are confident, click on the "Stop Training" button. You can now test your model at the maximum training epoch (e.g., Kendrickv2_e300_s69420 for 300 epochs). If you are satisfied with the result, rename the last file in the /weights/ folder (in the Colab files panel) with the name without _e100_s1337 (so Kendrickv2_e300_s4000 would become Kendrickv2.pth). If not, you can resume training from where you left off. I must emphasize that you should rename the file from the Colab files panel. You cannot rename the file from the drive; otherwise, the next step will not work.

Then, you can run this cell:

Your completed zip model will now be ready in /RVC_Backup/Finished/ as a .zip file, ready to be shared.

Continuing the training of a model from where you left off:

During retraining, to resume from where you left off, use the exact same name (with the same capitalization) and the same sampling rate (40 kHz by default if unchanged). Use the same settings you had previously for batch size, version, etc., so they match.

Do not process the files again, and do not redo the feature extraction. In fact, avoid pressing "process data" or performing "pitch extraction" again because you don't want the software to reanalyze the pitch it has already done.

Keep only the two latest .pth files in the /logs/ folder of the model, based on their modification date. If there is a "G_23333" and "D_23333" file in your model's logs folder, it represents the last checkpoint if you checked "Save only latest ckpt" (which I recommend doing earlier in this guide). If not, for some reason, delete all .pth files from the folder that are not the most recent ones to avoid inaccuracies.

You can now restart the training by clicking "train model" with the same batch size and settings as before. If training starts from the beginning (epoch 1 instead of the last saved epoch before stopping training), immediately use CTRL+C or the stop button if you are on Colab to kill the GUI server to stop it, and try restarting the GUI.

Avoiding crash/feature extraction issues:

During local training, there's an issue with feature extraction where people try to run feature extraction with the maximum number of threads (top right option in the training tab) and encounter long processing times or a blue screen crash. I would set the thread count value at the top right of the training tab to a maximum of 5, or choose a value of 2 for safety (the preprocessing step won't take much time). My default maximum value is 20 (locally), and it doesn't work. Feature extraction is not yet automatically decided, but it's the part that takes very little time, so it's better to be cautious and go with a low value.

Reverb/Echo Removal:

It's necessary to remove reverb/echo from the dataset to achieve the best results. Ideally, you should have as little reverb/echo as possible to begin with, as isolating reverb can obviously reduce voice quality. But if you must do it, you'll find MDX-Net Reverb HQ, which will export audio without reverb as the "No Other" option. Often, this may not be sufficient. If that didn't yield results (or not enough), you can try processing the voice output through the VR architecture models in UVR to remove any remaining echo and reverb using De-Echo-DeReverb. If that still doesn't suffice, you can use the regular De-Echo model on the output, which is the most aggressive echo removal model of all.

There is also a Colab for the VR Arch models if you don't want or can't run UVR locally. I have no idea how to use it, so good luck. Without a good GPU on your PC, UVR will still work locally in most cases, but it will be quite slow if you're okay with that. But if you have a large dataset, be prepared to let it run overnight.

Noise Filtering to Remove Silence:

I like to apply a noise gate in Audacity to remove noise during "silent" periods of the audio.

TÊlÊcharge Audacity

In general, a threshold of -40dB is a good setting for this.

Adobe Audition likely has more advanced tools to do this automatically (I'm not sure how to use them), but this is a good starting preset for people using basic Audacity mixing. If the sound cuts off in the middle of a phrase, redo it by increasing the sound for the Hold ms.

Isolation of background harmonies/vocal doubling:

In most cases, isolating them for dataset purposes is too difficult without it sounding of poor quality. But if you still want to give it a try, the best UVR models for this would be 5HP Karaoke (VR Architecture model) or Karaoke 2 (MDX-Net). The 6HP is supposed to be a more aggressive version of the 5HP, I believe? I'm not sure. Your mileage may vary, so try the other karaoke options unless it doesn't work at all, no matter what.

Do I need to split my audio file into pieces?

Technically, the answer is no, at least for RVC. You can have a single 10-minute file as the only file in your dataset, and RVC will correctly split it for you following the instructions in this guide, based on my testing. RVC cuts files into approximately 4-second pieces, so make sure your samples are at least 4 seconds long for consistency (or merge shorter samples into one long file). If you want to be cautious, you can divide your samples into one-minute intervals (the regular interval labeling feature in Audacity is very useful for this).

Due to a recently discovered issue, it appears that someone experienced difficulties with the incorrect processing of a 1-hour and 30-minute WAV audio file, which could potentially be an issue on their end. For very long datasets, it's possible to encounter problems if you don't split them. However, for recordings under 30 minutes, it doesn't seem to be an issue.

How much audio data do I really need for the dataset?

Not that much, actually. More is obviously better, but I don't see the point of training a model with more than an hour of data. You can get away with REALLY limited data-based models on RVC v2, but the less data you have, the more the AI has to "guess" how your voice is supposed to sound at certain pitches. A reasonable range of high-quality data would be 10 to 45 minutes.

Here's an example of my 10-second JID model rapping:

It looks good because I gave it 10 seconds of rap as a dataset, right?

But the sound is much less accurate when trying to sing:

The recommendation from RVC developers is at least 10 minutes for high-quality models that can handle a variety of pitches and tones, but remember: Quality > Quantity. This is an example of a 5-minute model trained on high-quality clips. This is a model trained on 7 seconds of Minecraft villager sounds. Somehow, it works.

Download Other UVR Models:

Go to the wrench icon, then to the "Download Center" to find the tab where you can locate the models mentioned in the guide that haven't been downloaded yet.

Last updated