Workflow - V2V Foley - Add sound to any silent video

#41

by RuneXX - opened 28 days ago

Discussion

RuneXX

Owner 28 days ago

•

edited 28 days ago

Wan (silent original video)

LTX-2 added sound + extended length

** V2V "Foley" - Add sound to any silent video**
New updated version that should work better, and smoother.
Easily add sound to any of your silent video clip, and optionally extend it as well...

Add sound fx, ambient background sound, music, voice over narrator (and even dialog in case of extending the video)

Updated version that has a few changes:

it keeps the aspect ratio of the input video
you can override input size & set max size (longest edge). But it will still keep the aspect ratio.
it has a toggle switch so you can easily toggle between using the workflow as a Single-Pass or Two-Pass mode workflow.
since the resize is done automagically it should also hopefully be no errors with pixels not matching what LTX wants

RuneXX

Owner 28 days ago

•

edited 28 days ago

Wan (silent original video)

LTX-2 added sound + extended length

crombobularBoop

27 days ago

•

edited 27 days ago

how did you get this to work at all? if i give it any video the output is completely unrelated to the input video... the workflow looks like a copy of the extend workflow. looks like you uploaded the wrong thing. also, i'd recommend not using the newer built in math nodes as they require a newer version of comfyui which is unacceptable seeing the current failed state of their frontend.

edit: looks like some video files fail to work properly with vhs video load. idk why but some others work.

RuneXX

Owner 27 days ago

•

edited 27 days ago

recommend not using the newer built in math nodes... require a newer version of comfyui

the new comfy updates are quite a big step.. so i can see why some might delay that update..
I usually use KJNodes "simple calculator", I'll use that one instead... (its the one i prefer anyways..)

edit: looks like some video files fail to work properly with vhs video load. idk why but some others work.

Maybe your input videos is encoded in a format that the video loader node can not decode. And hence your input is blank then.
If they are from tiktok or similar using some web-download site, it can often be quite "corrupted". Try any online video converter. Just search for example mov to mp4, and use it to convert mp4 to mp4 (works even if its a mov to mp4 online covert site)

See if that works

https://new.express.adobe.com/home/tools/convert-to-mp4
https://www.freeconvert.com/mov-to-mp4
https://cloudconvert.com/mov-to-mp4

RuneXX

Owner 26 days ago

edit: looks like some video files fail to work properly with vhs video load. idk why but some others work.

Strangely enough i ran into same issue, something i have never had before in years using comfyui.
It happened when i downloaded a video from civitai, to use as an example for a workflow. This video is encoded in such a way, that the video loader node outputs black frames

I was curious if there could be any fix to that, and there is. Using the Load Video FFmpeg from the same node pack that has the other video loader, all works fine ;-)
So i will definitively rather use the load video ffmpeg node in future workflows, since that one is more robust

Tomcat2048

23 days ago

Question - let's say I have several generated NSFW WAN 2.2 - 5 second 81 frame clips with no audio and I want to add audio to these NSFW clips. Would this workflow be able to do so? Or do I need some sort of NSFW prompt enhancer for this? Sorry, just haven't messed with LTX-2 or LTX-2.3 much.

Would the quality of the generated audio for an NSFW video be poor since I'm assuming the model probably wasn't trained on this?

RuneXX

Owner 23 days ago

•

edited 23 days ago

LTX doesn't know much NSFW content I think, probably not part of training data ;-)
So it depends on what is generated in the Wan video.. it will continue as well as it can, but some "concepts" might be out of scope ;-) to put it that way.
For that you might have to add some loras (probably the same as with Wan, some things are within training data, other things not so much)

All depends on what it is.. . general nudity etc i dont think is a problem for LTX to continue on, but more explicit actions it probably doesnt have much knowledge of, so it might end up a distorted mess without loras.
But give it a try.. and there are plenty of such loras on Civitai

Tomcat2048

23 days ago

LTX doesn't know much NSFW content I think, probably not part of training data ;-)
So it depends on what is generated in the Wan video.. it will continue as well as it can, but some "concepts" might be out of scope ;-) to put it that way.
For that you might have to add some loras (probably the same as with Wan, some things are within training data, other things not so much)

All depends on what it is.. . general nudity etc i dont think is a problem for LTX to continue on, but more explicit actions it probably doesnt have much knowledge of, so it might end up a distorted mess without loras.
But give it a try.. and there are plenty of such loras on Civitai

Thank you for the quick response. Is the prompt enhancer in your workflow based on a censored text encoder or an abliterated one? Just trying to figure out if I need to find an uncensored one for my use case.

RuneXX

Owner 23 days ago

Thank you for the quick response. Is the prompt enhancer in your workflow based on a censored text encoder or an abliterated one? Just trying to figure out if I need to find an uncensored one for my use case.

it depends on the gemma model you have loaded into the main clip loader. It uses the same gemma to enhance the prompt

Tomcat2048

22 days ago

So I'm a bit confused as to how to utilize this V2V workflow. I want to input an MP4 video that was generated in WAN 2.2 and to have audio added to it (without modifying the video in any way). What do I need to do to accomplish this? It seems this workflow is intended to extend my existing video...maybe I'm just not understanding how to use this workflow?

RuneXX

Owner 22 days ago

•

edited 22 days ago

Just do as you say, add your Wan-2.2 and generate. It will add sounds according to your Wan-2.2 video, according to the visual input from the video and your prompt.
The expend part is optional

Tomcat2048

16 days ago

Just do as you say, add your Wan-2.2 and generate. It will add sounds according to your Wan-2.2 video, according to the visual input from the video and your prompt.
The expend part is optional

I've been trying this workflow and I just can't get it to work the way I think it's supposed to. No matter what I do, it takes my input video and ends up generating an output video with some sort of audio (but not the audio I prompted for).

I'm really just trying to get it to take the input video (not modify it in any way except add audio that I'm prompting for) and generate an output of the exact same video with audio overlayed on top of it (not a newly created video with audio). Is that not the purpose of this workflow or what am I doing wrong here?

RuneXX

Owner 15 days ago

•

edited 15 days ago

Thats exactly how it should work. With one caveat, LTX is first and foremost a video model, not an audio generator.
So it will take the input video, and add sounds to it.. like background music or ambient sound, action sounds (say footsteps and similar), and sound fx (explosions etc), that are present in your video.
Even voice over narrator.

But it might not be able to do all sorts of audio stuff, depends a bit on what you tried perhaps

Tomcat2048

15 days ago

Thats exactly how it should work. With one caveat, LTX is first and foremost a video model, not an audio generator.
So it will take the input video, and add sounds to it.. like background music or ambient sound, action sounds (say footsteps and similar), and sound fx (explosions etc), that are present in your video.
Even voice over narrator.

But it might not be able to do all sorts of audio stuff, depends a bit on what you tried perhaps

I think perhaps I'm not phrasing my question/problem correctly. The problem that I'm trying to solve for is this: I already have a pristine WAN 2.2 generated 5 second clip. I do not want the actual video in this clip to be modified in any way/shape/form. I simply want to overlay audio on top of the clip I've already generated in WAN 2.2 (I don't want LTX to try and regenerate the video - it produces a very poor result).

RuneXX

Owner 14 days ago

Yes thats exactly how the video should work. If it doesnt, maybe there is an error with the connection logic.
But it should do exactly as you say, keep the Wan video and only add audio (with an option to extend the video, but this is purely optional)

Will take a look as well

Tomcat2048

14 days ago

Yes thats exactly how the video should work. If it doesnt, maybe there is an error with the connection logic.
But it should do exactly as you say, keep the Wan video and only add audio (with an option to extend the video, but this is purely optional)

Will take a look as well

I guess maybe I'm doing something wrong then...because for me it is generating a new video with audio (not retaining my original video). Just to confirm - I obviously need to prompt it to add audio to the existing video - correct? Or should I be leaving the prompt blank?

RuneXX

Owner 14 days ago

•

edited 14 days ago

Yes prompt for sure. This will guide the model to what audio you want to add.. like say "loud footsteps as the creature walks, explosions in the background" .. even voice over narrator.

I'll take a look, maybe something got tangled up returning ltx video instead of the source video... (its been a lot of workflows, and not enough coffee, so it could be an error ).

with one caveat, LTX will listen to your prompt, so even if the original LTX video part is masked out, if you prompt explicitly enough, the model might "decided" to change the video. But shouldn't do that easily...
Actually thinking of it, I'll make the end of the workflow have 2 video outputs. One were its a direct bypass keeping the original video and only input LTX audio (completely ignoring LTX video output), and another video output where both comes from LTX

UPDATE:
So from a test run, there might be minor changes to the video. Depending on the prompt etc. LTX might take a few "freedoms" to adapt for the prompt
I'll update the workflow so there are 2 outputs. One LTX video + audio , and one Original video + audio

RuneXX

Owner 14 days ago

•

edited 14 days ago

Updated workflow that has 3 video outputs (purely without being a heavier workflow for the pc, its just simple cut and glue together frames at the end) :

The pure original untouched video input + LTX audio added
The pure original untouched video input + LTX audio added (+ optional extended video (where only extended frames are from LTX))
LTX re-recated video and audio (that should be somewhat faithful to input video) (+ optional extended video (all by LTX))

Depending on input video and prompt, the 3 outputs might be extremely similar (or not... )

RuneXX

Owner 14 days ago

•

edited 14 days ago

ORIGINAL VIDEO
vs
LTX-RECREATED VIDEO

The keen eye will notice minor differences, at the first part of video, for example the blondes uniform not being exactly the same ;-)

But now you can choose an output that keeps the exact input video also, both with or without extended frames :

Tomcat2048

14 days ago

This is great! Thank you so much!

Portland01

7 days ago

•

edited 7 days ago

Hey RuneXX, I'm really enjoying using this workflow of yours. I find the results to be more then satisfactory. There is one minor issue though. I find the audio cuts in and out(poping sound) at least a couple of times if not more for every generated video.

I am using longer clips(between 20 to 45 seconds) and I've tried different samplers and schedulers but no change. The other samplers do give horrible audio results compared to euler though. So the default sampler you set is something I wouldn't want to change.

Are there any other settings you can think of that may help in reducing or removing the problem? This may very well just to be a limitation of LTX on account of me going over 20 seconds. Thought I would ask anyway to get your thoughts on this. Thanks pal.

Update
For anyone experiencing this, check your custom nodes

Sorry, there is nothing wrong with your workflow settings and the audio. Turns out I had a custom node installed called 'ComfyUI-AudioTools'
I don't recall ever installing that. May have been when I was trying out some voice cloning models awhile back. I removed the folder and now the audio is clear with no more cutting out or light buzzing sound in the background. Sorry for the bother.

RuneXX

Owner 7 days ago

•

edited 7 days ago

20-45 seconds might be more than the model can handle.
Its made for 10-15 seconds or so (5 to 20)

Will try 40 ish seconds here see if i can reproduce. But ideally that would need a "longer video" implementation where it generates 10-15 seconds and then another 10-15 seconds etc.. and glue them together at the end (with overlapping frames to keep consistency)

That being said, LTX does surprise sometimes, and i have generated longer videos in one go without issues ;-)

But the AudioTools might have also not been helping, I tried a new node for this workflow that was intended to help level out the volume. Will take a look, maybe that node is not handling long audio well or some error with it (if so i will use the WanVideo Wrapper one instead that works great, but since WanVideoWrapper is such a huge node base, i tried the tiny audiotools instead)

Portland01

3 days ago

•

edited 3 days ago

Appreciate you looking into this. As for myself, I'm getting great results with anything 45 seconds and under. I just have to add alot to the prompt. The AudioTools was no doubt messing up the audio for me. Your current workflow works fine without it.

I do find that if I extend the video more then 45 seconds, it falls apart. If I don't extend it at all through your workflow, I can go with even longer silent clips without it going all crazy on me.

"(with overlapping frames to keep consistency)"
Is that something I can do through a freeware program like avidemux or virtualdub. Or is this something I would want to do with one of your workflow's?

These 45 second+ silent videos I created are with Wan SVI. I then run them through your workflow to add audio. Is there a better way of doing this that I am not aware of? I want to avoid messing up/changing the voices.

The new ID Lora takes way to long and uses way too much vram for anything over 10 seconds. I also find the voices never match. So that is one option that just doesn't seem to work for me. Also, I find your extend video workflow really messes up the colors with each new segment. The latest version of SVI works great for avoiding this. Just too bad its only for Wan.

Anyway, I appreciate any suggestions you may have. Thanks a bunch.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment