Pre

## Loss ##
loss = C1 * pixel_loss + C2 * depth_loss

exp4 | Nope 💔

[C1, C2]= [1, 0.25] to make the same numerical value for both losses (i.e. same impact for both)

Train DT = Train Val DT = Val lr = 0.001
Audio Model Audio Image Model Image Others
Net1D 1D ResNet 18 2q mask Image Batch Size = 4
Decoder
temporal upconv.

exp5 | Sigmoid is bad

[C1, C2]= [1, 0.25]

Train DT = Train Val DT = Val lr = 0.0001
Audio Model Audio Input Image Model Image Input Others
ResNet 18 1D ResNet 18 2q mask Image Batch Size = 4
Not Freeze, None Freeze, mp3d

exp6 | Sigmoid is bad

[C1, C2]= [1, 0.25]

Train DT = Train Val DT = Val lr = 0.0001