[Guide] VR Deepfakes - 09/17/2023

dpfks

Administrator
Staff member
May 5, 2025
12
0
1
[MrDeepfakes guide written by Grrkin from 09/17/2023]

Note: This isn't a comprehensive guide at all and I'm definitely not an expert, but there isn't one about VR and I'd like to start it. Hopefully some discussion will happen and it'll be learning all around. I also have no life, so I have the time to edit and add to and maintain this guide as more people share and as I learn more myself.

Quick guide for vr and 3d terminology:

3d = Depth through whatever means, but not necessarily vr.
360 = monoscopic video ppl often use their phones without anything else to view, moving the phone around like a window into the otherworld.
180 SBS, or 180 OU = popular 3d formats, but usually NOT 360
360 3d SBS / OU = fully 3d 360 video, you need insane bitrates and resolutions for this and resolution wise it doesn't fully make sense to use with current hardware, except unless you want to.

SBS = Side by side, the form of 180 3d most are used to. This is preferred for porn, because you look up down a lot.

OU = Over under, splits the frame along the long direction. This is preferred for just watching movies if you can find it because ppl look left to right more often, maximum resolution

What I Have Learned About 3d SBS VR Deepfakes So Far:

1.jpeg

  • Barrel Distortion.
DFL does not handle barrel distortion well. It can figure out skewed faces and smushed faces, but the further from the center of the shot a face is, the more DFL struggles. I think the amount of distortion vectors also makes it harder, meaning if the face is barrel distorted AND distorted another way AND tilted AND too close to the camera, etc.

Not insurmountable, just frustrating.

2.jpeg

3.jpeg

Requires A Very Varied Data_Src Set.

In my experience you need a really comprehensive face set for VR. Everyone already knows this in general for deepfaking. Just more. I've learned that you need a lot of frames of the source face looking directly into the camera, which celebs don't do a whole lot of. A lot of shots of the source face staring into the camera. Like...a lot. How many? Lots.

  • Eye contact.
A big aspect of VR porn is a lot of eye contact and in 3d, close up, your source face set has to be more precise than you'd need for a regular deepfake. Two source faces staring at the camera that look exactly the same for example, if they were 3d, one could be staring directly into your soul and the other could be staring over your shoulder to the side and your brain registers that strongly in 3d. Finding more than the occasional clips of a celeb looking directly into the camera is advisable. Like mix Jodie Foster at the end of Contact with Milla Jovovich in The Fifth Element crying because a butterfly flapped it's wings and that's...actually that's literally exactly the kind of thing that would be perfect...

  • Expressions.
4.jpeg

I opened a 3d video in VLC which can handle vr stuff to screenshot kinda what this looks like in goggles, and just looked at the face. Looks ok-ish, but if you look at the full frame and compare faces you'll see that while it gets face direction and shape mostly, the expressions are different.

5.jpeg

I believe that in normal deepfaking using DeepFaceLabs, sometimes if you're lacking say shots of the source face from a super low angle looking up or something sometimes it'll work out fine, and sometimes you get The Last Woman from Doctor Who because it gets confused not having enough references.

So many variables in 3d vr stuff mean that you need a wider variety of angles/expressions in your source shots. It seems much more easy for the source face to "freeze" expressions I.e. they're talking but the mouth on the deepfake just freezes in one position until the angle changes, or their eyes. The more variety in the angle of the face combined with expression the better. I.e. Say you have the source face looking up and to the left and laughing and smiling. Cool, right? Not quite, now you need happy up to the left, sad up to the left, talking while happy and looking up to the left, face pointing up to the left but eyes looking back at you and talking... More angles of different mouths too. Everything normally required times many. Tbh this is probably the biggest obstacle for convincing VR deepfakes, moreso than software.

  • FUCKING RESOLUTION & Also File Size.
Not fucking resolution, fucking resolution. Resolution in DeepFaceLabs is fine, it's handling massive 4k 60fps files that could be prohibitive to those without good hardware and a petabyte of storage space. If you were to rip your DST frames in PNG because you roll like that, for a short scene only a few minutes long you could be looking at 100+ gigs of storage required.

  • Masking.
6.jpeg

Masking whole 3d SBS shots yourself in AE or DaVinci is probably not something normal humans will undertake. In the past I have made full face deepfakes and just merged them in DeepFaceLabs and that was that. I've been trying to work with whole face deepfakes and it's possible I just don't know wtf I'm doing, but if you merge using whole faces with vr, only one of the faces will have the "square halo" that comes with whole faces, and the other side will have a masked full face merge. And they'll swap back and forth on sides of the frames depending on wherever whatever catches, which is obviously un-ideal.

Random Tips I Don't Have A Place For:

  • If DFL is doing The Last Woman from Doctor Who thing while training, sometimes if you stop and turn off masked training (DFL2.0) for awhile, it'll start to register a face and slowly start doing a slightly better job, then you can turn face masking back on.

  • I also wonder if you had source faces that also had distortion if it'd do better with 3d vr stuff. I would rip the faces off a 3d movie except I can't think of a (real) 3d movie with any actresses I like in it.

  • Another thing on my to do list is to try separate models for faces in the center of the camera, and faces on the edges of the camera. I tried running two simultaneous models (old computer + current) up to about 200k iterations and I was sooooo disappointed to find out that if you cross train either of the models they basically pick it up so far that there's zero point, at least with just that many iterations.
My Evolving Workflow:

Caveat: This is waaaay extra. I'm trying this out to see if it's worth the effort.

Premiere > New Project > Load VR DST clip:

7.jpeg

Then roll the clip to center the face. I'm not going to explain roll/pitch/yaw because I hate it and only gunners and drone pilots understand the difference between pitch and yaw ANYWAYS. By that I of course mean that Premiere can handle vr editing but if you're using something that can't, you can google GoPro vr plugins - they have a suite they made freely available you can use with a variety of video software, which will give you basic vr video controls.

This is the clip loaded, having been rolled until the face is centered on the left side of the frame. Notice you can only center one side at a time, the other side will swing way off, because only one side can be centered at the same time:

8.jpeg

Duplicate that sequence, rename it "Right", and pull it up. Edit the values for the roll (or whatever correction you used) to the opposite, so 50 becomes -50 etc.

Now we're going to stick those sequences together back to back, but we're gonna crop it so the left sequence ONLY shows the left half of the frame, vice versa. So drag the left sequence to the new sequence button, and you'll have the left sequence nested into another sequence which you should name something like "rolled export". Edit the sequence settings and put in half the original width. Edit the horizontal position of the left side sequence so the left side of the frame is centered. Skip ahead to the right side and slide it over so the right side of the right side sequence is centered.

The goal is to have one video with both sides of the vr video one after the other, with faces centered, like so:

9.jpeg

Do your deepfakery, then merge, and process however you like, then open up the merged video in Premiere. The rest is just reversing the process, you could take your left side sequence for example, duplicate it, open up the left side duplicate sequence and drop the deepfaked left side video in and just move it to the left side of the frame, put it back where it belongs. Here's that frame above put back where it belongs in the full vr frame, still rolled though:

10.jpeg

Drag the left side duplicate sequence to a new sequence, giving you a new sequence containing the left side duplicate sequence. Then you can just cut and paste the roll effect from the opposite side to reverse the roll.

I.e. If the left side is rolled 50 degrees, that means the right side is rolled -50 degrees, so open the sequence in which you rolled the right side, copy that effect containing the roll, and past it into the new sequence containing your left side duplicate seque...this is confusing as fuck isn't it. Just reverse and combine that shit.

TL;DR: DO A THING AND DO THE OPPOSITE OF THE THING.

Addendum:

Please discuss and share, any info you have or any experiences trying is helpful!

Also, if you're here, let me know if you are a VR enthusiast or if you own a pair of goggles, I'm curious to know how niche it really is.

Should this get posted I'm gonna try and snag the first couple posts for later additions.

UPDATE MAY 2020

TL;DR:
3d Deepfaking is still in it's infancy, DFL puts weights on what it requires an eye to look like that make it difficult to get the proper flutterless stereoscopic appearance, just don't do shots too distorted or close up and you'll be fine.

Ignore the lack of masking, it takes so long to be able to see the effects of little changes that I still am not even bothering with it for the most part. Watch that clip, and pay attention to the tip of her nose, and her eyes, and compare why they feel different.

Additional VR Thoughts:

I've been playing with vr deepfakes more since I wrote the initial baby guide, and I'm going to share some things I learned. I've put VR deepfakes on hold for the time being because I think I've hit a hard wall, and I think I've identified what the biggest difficulty is.

If you watch that clip I linked, notice that while the nose and cheeks are pretty good looking, the eyes do not line up in shape or stereoscopic-ness and as such give you that weird tingly eyeball feeling when you're trying to focus on them, and it's hard to not just switch to your dominant eye because the 3d-ness isn't 3d-ing very, uh... copaceticly.

After more experimentation, of course DFL has trouble with distortion (see the first post/guide), but now I'm noticing it's rules for handling eyes are too strict to allow the eyes to construct themselves in a way such that they display in "3d" correctly. I feel like the inner workings of DFL has weighted aspects of the appearance of an eye which makes a lot of sense for general deepfaking when you don't want it the machine learning to Deep Dream the eyes, but not as much sense when you want to deepfake faces with eyes that might be distorted in ways that are outside the rules of what DFL requires an eye to look like.

Shots where the face is relatively straight on and not distorted give you ok results, but some angles and distances from the camera can cause either funky 3d effects, or the eyes to do a little "stereoscopic flutter". In general the further away the person from the camera, the easier to do and the less the faults will show, and "flat 3d" is less apparent.

Note: Pupil wobble. It's really noticeable in vr, and I think maybe some of the weird pupil wobble is caused by the specular highlights on the eye and when you're ripping faces and there are slightly blurry "ghosty" frames every now and then in which the eyes are slightly motion blurred and the pupils leave "specular trails", and it has a more distinct doubling look than normal motion blur. I swear that pupil doubling thing can happen independently from motion blur and idk what it is but that can happen in the middle of a long shot with a stable camera in bright daylight. Wonder if it happens to footage that was or wasn't originally interlaced... I notice it on Youtube interviews too though.

My uninformed suspicion is that DFL uses the visible eye white area to place the eyes on the head and the eye white + pupil + specular to determine where the eye is pointing, in it's machine learny-type way, and that frames with the doubling, "ghosty" look confuse it more than a blurry frame would, resulting in eyes that when you are observing them in vr, literally existing in a quantum state of probability where they are in one position and simultaneously in another position at the same time. Which is neat, but not desirable.

Examples:

This is your normal motion blur on the pupils:

8xPSnyL.jpg

9AmFD2M.jpg

And this is the pupil trail thing I was talking about. (Bad examples for the sake of illustration)

1TxHvd8.gif

I've been trying to clean up my facesets and prune viciously and when I try vr next I'm going to train a model from scratch using a lot of vr distorted faces and use only faces with perfect eyes, and see if that makes a difference. If not, then I'm going to wait for DFL to advance a bit more, or open up more control over how it handles details perhaps. Not so much handling details though as allowing parts of the face model to diverge more from what rules it learns.

I have slowed down on the VR front because of life and deepfaking things that don't take a week, I will continue to update when I learn things and add anything useful that anyone contributes and at some point theoretically compile everything into a for-real guide.

Workflow update: Now I just cut up the frame into it's component halves, and put them back to back as one really long clip, and face rip and deepfake it in one go that way. I'm focusing entirely on footage where faces are close enough to the center that they aren't TOO distorted at the moment. Tip: To preview your 3d clip before merging the whole thing, render enough frames for a test, then move all the frames from the first half out of your DST folder (and corresponding aligned folder pics) and then render the same amount of frames and put them together in video editing software. That way you don't have to render the whole thing to check how it's going!

Ps. I have been trying to figure out what causes the pupil wobble, it's like the cameras are getting hit with a Hitachi randomly and it happens even in shots with no movement, and in bright daylight when the shutter speed must have been very high. I had actually always thought it was regular motion blur.

Update July 2:

Was chatting and realized that this is pretty useful stuff to know:

3d = Depth through whatever means, but not necessarily vr at all.
360 = monoscopic video ppl often use their phones without anything else to view, moving the phone around like a window into the otherworld.
180 SBS, or 180 OU = popular 3d formats, but usually NOT 360
360 3d SBS / OU = fully 3d 360 video, you need insane bitrates and resolutions for this and resolution wise it doesn't fully make sense to use with current hardware, except unless you want to.

SBS = Side by side, the form of 180 3d most are used to. This is preferred for porn, because you look up down a lot.

OU = Over under, splits the frame along the long direction. This is preferred for movies if you can find it because ppl look left to right more often, maximum resolution

UPDATE JULY 14, 2020:

I freely admit I lifted this off the DeoVR website, a popular VR app that has great features, but also active developers who interact and share great data. These are some encoding guidelines I thought were pretty useful.

Video encoding basics


There are quite a few different video and audio encodings available but not all of them you can use to play high-resolution VR video. All VR-ready PCs have hardware video decoding acceleration and it is a good idea to make use of it, more about it below. Set “Key Frame Distance” to 1-2 seconds, this will ensure good compromise between file size and smooth seeking.

There are simple rules you can follow to make sure that your customers will be able to watch your videos:

  • The most supported video codec today is H.264/MPEG-4 AVC, any VR-compatible PC can play it. Use it if you want maximum compatibility, you can use the same video file to stream your videos directly in users browsers. Select AAC audio encoding with a bitrate from 128k to 384k depending on your case and MP4 container format. It has limitations:
    • Maximum resolution hardware decoders can handle is 4096x4096 @ 30 FPS or 4096x2048 @ 60 FPS, do not exceed these constraints to ensure smooth playback for your users.
    • It requires more bitrate (internet bandwidth) to get the same visual quality than modern video codecs listed below.
  • More advanced video codec is H.265/HEVC, it provides the same visual quality at a lower bitrate than H.264 and NVidia GeForce GTX 10 Sires video cards can decode up to 8192x8192 @ 30 FPS with a built-in hardware video decoder and some mobile Exynos SoCs. Use the same audio settings and container format as for H.264 video codec. Limitations:
    • Browsers do not support this codec this means that you can not use it for direct web-browser streaming.
    • Windows 7 doesn’t support hardware accelerated video decoding for this codec.
    • Takes much longer to encode with software video encoder compared to H.264
  • VP9 is an alternative video encoder from Google, it is very close to H.265 in terms of video quality for the same bitrate and it is open source software. Use Vorbis audio codec and WebM container format. Limitations:
    • All Apple products don’t officially support it.
    • @TODO: Need to test hardware decoding/Codecs on Win 7 and Win 10
If you want to ensure that all your users will be able to watch your videos, you should provide several streaming options to them. The absolute minimum would be:
  • Original H.265 video in the highest resolution and framerate for high-end users.
  • H.264 4k video option for the maximum compatibility.
UPDATE APRIL 29, 2023:

I took some liberty to update your thread for you Groggy - sincerely TMBDF 😅

Great tip from @sandwiches regarding improving rate of detection of faces in high resolution footage (mostly VR that tends to feature some very small faces at times).

"The S3FD extractor file has a line that shrinks the frame it is looking at to a maximum dimension of 640 pixels to make it faster to find faces. For a normal video, this is usually okay, but faces take up a smaller percentage of the overall frame height in a 180 degree VR video - it just struggles find anything. In the S3FDExtract.py file (in \_internal\DeepFaceLab\facelib), line 187 is "scale_to = 640 if..." - I change it to "scale_to = 1080". I've tried a few different sizes, and 1080 seems like that is a good balance of finding faces without finding much else. Doing this slows down extraction quite a bit - you might want to make a copy of the original file."