For the toy problem, we will reconstruct the toy image using its gradients and pixel intensities. Essentially, there are 3 objectives we are trying to fulfill:
In poisson blending, the goal is to take a cutout of some source image and seamlessly paste it into a target image, blending the edges and ensuring similar gradients. To do this, we want to fulfill the following constraint below. Let S be the cutout region of our source image (in the dimensions of the target image) and let N be the four neighbors below, above, right, and left of the current pixel i.
Algorithmically, we are looping through our pixels and computing their pixel values depending on which case they fall under. Pixels will either:
Each blended image also requires an offset variable bg_ul that we use to decide where in the background the object will be placed. The penguin I placed at (700, 400), the fish at (200, 200), and the plane at (250, 350). I noticed that my blends performed better when the background color matched the object itself. I also used more precise masks with a smaller border around the outside, and this resulted in darker outputs because there was less of the background in my masked image. But overall, the blends worked very well. Here are two more examples of poisson blending:
For mixed gradients, we want to similarly perform poisson blending but with a modified constraint equation:
In this case, we are mixing the gradients between the source and target images. To do this, we can take the absolute value of the source gradients, compare them against the absolute value of the target gradients, and take the larger of the two. In the above equation, d_ij corresponds to the larger gradient. Below are our results using mixed gradients compared to regular poisson blending. As you can see, the results tend to more closely match the background and hence are a little lighter.
Normally when reconstructing an image, we convert it from color to grayscale using cv2.cvtColor(image, cv2.COLOR_BGR2GRAY). However, for images such as color blindness tests, converting these images to grayscale will cause them to lose their contrast and information.
To resolve this, we can try converting the image to the HSV space:
By using the V channel, which still preserves the information and intensities from the original image and properly converts it to grayscale, we're able to reconstruct the image properly with a MSE of 4.882742657242126e-13.
Using a series of rectified images of a chess board, a regularly spaced grid, I can refocus the depth of the image by reshifting. Typically, just averaging together all the images in the grid will cause blurriness in the front and sharpness in the back due to the large positional differences of nearby objects as opposed to faraway objects. However, we're able to specify areas of sharpness by shifting the images using np.roll appropriately. By using different depth_factor values, we're able to choose how close to the camera the sharpness is.
We're able to mimic variations in aperture by averaging over a different number of images. Modifying my refocus function by introducing an aperture parameter, I choose a smaller subset of images close to the center for a smaller aperture that still focuses on the same point, and vice versa. The results are shown below.
I am fascinated by how, just using images of a regularly spaced grid, we are able to change the visualization of the object in ways analogous to settings on a camera, including depth and aperture. These refocusing techniques only required averaging and shifting various images together and demonstrates how powerful lightfield data is.
For interactive refocusing, we want to grant the user the ability to select an input point and make optimal pixel shifts such that the image is focused in a patch around the selected point. We can select for the point using plt.ginput. By selecting for a reasonable aperture, we optimize shifts for the depth_factor. Each selected point is displayed in red and the refocused results are shown below.
Neural radiance fields (NeRF) are used to map {x, y, z, θ, φ} to {r, g, b, σ}, or in other words, maps positionality and direction to color. In 2D, however, we map pixel coordinates (u, v) to {r, g, b} values. Here is an image of a fox we will use to apply NeRF to.
The architecture of my model is as follows:
First, I randomly sampled N = 1000 pixels for training, giving me 1000 x 2 2D coordinates as well as their
respective colors which have coordinates 1000 x 3 for the rgb channels. These are normalized and converted
into tensors in the random_sample_pixels method. Then, I implemented sinusoidal positional encoding (PE) to
expand the dimensionality of my randomly sampled input coordinates from 2D to a 42 dimension vector. Finally,
to build my MLP model, I chained together a series of nonlinear activations between layers, which includes 3 ReLU
opreations and then a sigmoid to clamp the values. The input dimensions I specify as 42 (which is the output of PE) and my final
output dimensions is 3, because I am outputing 3 dimensions of rgb to get a valid pixel color.
Below is a visualization of the training process of the predicted images across iterations with L = 10 (highest
frequency level), MSE loss function, Adam optimizer with learning rate 1e-2, batch size 10,000, trained over 3000 iterations.
Here are the side-by-side results.
With these hyperparameters, I achieved a PSNR of 25.96.
One hyperparameter I modified was the max frequency L for positional encoding, and instead of 10 I used L = 5. We can see the striations are not as thin and the overal picture is blurrier. This caused the sharpness in detail to significantly decrease due to the usage of lower frequency components, and hence diminished the performance of my network. With this hyperparameter modification, I achieved a lower PSNR of 24.92.
I also attempted to modify the learning rate to see how this would affect my network. By using a learning rate of 1e-3 instead of 1e-2. This yielded slightly better results with a PSNR of 26.12 and its curve is also much more stable. The details are slightly sharper than using a learning rate of 1e-2. For example, the hairs on the top of the fox are more defined and there is more contrast than in 1e-3 than 1e-2.
Here I run the model with these hyperparameters against two different images of my own choosing: L = 10, MSE loss function, Adam optimizer with learning rate 1e-3, batch size 10,000, trained over 3000 iterations. The bunny computed with PSNR = 29.90 and the sunflower computed with PSNR = 23.43.
Website Template: w3.css