Building an end-to-end panorama pipeline: capture overlapping photos, detect and match features, solve homographies, and use multi-band blending to hide seams for clean stitched results.
I captured three pairs of photographs with a fixed center of projection by rotating the camera horizontally while keeping the center of projection static. Each pair exhibits noticeable projective distortion that we will later align via a homography.
The first image set is of window frames that have strong horizontal and vertical lines which should help constrain the homography:
The second image set was taken on a deck showing trees and buildings:
The third image set is of an indoor wall mural of Keith Haring art:
I implemented computeH in code/main.py using the direct linear transform. Each correspondence contributes two rows to the linear system Ah = b (shown below), where the eight unknowns are the first two rows of the homography plus the first two entries of the third row. The final entry h33 is fixed at 1.0 to remove the global scale ambiguity. I keep the coordinate convention from the correspondence tool, treating a point as (x, y) = (column, row) in image space.
With more than four correspondences per pair, the system becomes overdetermined. I solve it via numpy.linalg.lstsq, reshape the result back into a 3×3 matrix, and renormalize so that H[2,2] = 1. After solving I verify consistency by projecting the source points forward and reporting the residuals.
[x y 1 0 0 0 -u*x -u*y] [h11] [u]
[0 0 0 x y 1 -v*x -v*y] · [h12] = [v]
...
The first row enforces that the x-coordinate from image 1 maps to the observed x' in image 2 when scaled by the third row of H. The second row for the same correspondence enforces the y-coordinate mapping. Stacking all rows yields an overdetermined system that least squares can solve reliably.
Example (window set, first correspondence): [2378.06 1097.05 1. 0. 0. 0. -2983059.34 -1376149.06] [ 0. 0. 0. 2378.06 1097.05 1. -2458595.16 -1134202.52]ᵀ
H_window ~ [[ 1.0462e+01 -3.6050e-01 -1.6359e+04] [ 2.7010e+00 6.7346e+00 -7.1132e+03] [ 2.4000e-03 -2.0000e-04 1.0000e+00]] RMSE = 24.12 px -- max error = 65.54 px
The higher residual comes from the repetitive structure, i.e. several correspondences are on almost parallel lines, so small noise creates large reprojection drift. I plan to smooth these inconsistencies during blending.
H_trees ~ [[ 2.5257e+00 -1.7600e-02 -4.8181e+03] [ 4.4800e-01 2.0730e+00 -1.3329e+03] [ 4.0000e-04 -0.0000e+00 1.0000e+00]] RMSE = 6.62 px -- max error = 14.42 px
The railing gives good horizontal constraints, while tree trunks and building edges anchor vertical structure. The reprojection errors are within a few pixels, which is good for seamless mosaics.
H_mural ~ [[ 7.5390e+00 6.1020e-01 -1.1433e+04] [ 1.3934e+00 5.2645e+00 -3.9517e+03] [ 1.6000e-03 1.0000e-04 1.0000e+00]] RMSE = 4.57 px -- max error = 8.65 px
The mural set gives the tightest fit. High-contrast edges and good perspective cues let least squares converge to a low-error solution that will function as a good rectification reference.
Across all three datasets, the homographies satisfy the projective model and will be reused unchanged for warping, rectification, and mosaicing in the following sections.
I implemented both warpImageNearestNeighbor and warpImageBilinear by following the inverse-warping recipe so that every pixel in the output looks up its source in the original image in order to avoid holes. For each call I:
H to predict the output bounding box.H-1, and analyze the interpolant.The nearest-neighbor variant rounds the floating-point coordinates and is therefore fast but prone to aliasing. The bilinear version uses the four-neighbor weighted blend, which eliminates stair-stepping at the cost of more computations.
Rectifying the wall artwork turns the projective skew into an axis-aligned poster. Bilinear interpolation carries over high-frequency strokes cleanly, while nearest neighbor leaves stair-step artifacts on diagonals and text.
The bilinear resample keeps light reflections coherent and retains thin structural edges without the jaggedness visible in the nearest-neighbor output.
In both examples the mask driven crop removes empty borders, leaving the valid warped content. Overall, nearest neighbor is useful as a debugging baseline, but the bilinear variant is better for mosaicing due to its better visual fidelity.
I keep the left image in each pair as the reference canvas and warp the right image into that frame using the precomputed homography HBA. The blending routine follows this sequence:
H-1 and bilinear interpolation. A companion warp of an all-ones mask records which canvas pixels received valid samples.The output is a single blended mosaic per pair with clean seams and no hard cut lines. Below, each row shows the two inputs (left-to-right) followed by the resulting mosaic.
The robust bounding step keeps the warped frame manageable even though the homography extrapolates far outside the window panes. Weighted averaging hides the brightness difference between the two captures.
The shared horizontal railing is a strong anchor while the feather mask smoothly blends sky gradients and foliage without ghosting.
This scene contains vivid colors and repetitive patterns. The distance-transform weights keep the seam inside the artwork nearly invisible.
All mosaics reuse the same manual correspondences from A.2 and rely on the warping/blending stack which generalizes to additional image sets and lays the groundwork for Part B’s automatic pipeline.
I start by loading each RGB frame with my existing helper, convert it to a float32 grayscale image in [0, 1], and then run the supplied harris.get_harris_corners implementation. If the dependency is missing I fall back to an OpenCV pipeline (Harris response → Gaussian blur → 3×3 dilation for local maxima), applying a 20 pixel border mask in both cases. The routine keeps the response value for every peak so that later stages can reason about corner strength.
To cull redundant detections I implement Adaptive Non-Maximal Suppression with the same robustness heuristic used in the MOPS paper. Points are first sorted by Harris score; each radius is the distance to the nearest corner whose response is at least 10% stronger (crobust=0.9). I retain the top K=500 radii (ties broken by score) and render two visualizations:
Even in highly textured regions the adaptive radius thresholds prevent clusters of overlapping keypoints, which significantly reduces descriptor redundancy and RANSAC iteration counts.
I follow the MOPS recipe as closely as possible while remaining axis-aligned. For every ANMS survivor I reflect-pad the grayscale image by 20 pixels so that a full 40×40 window is always available, then use cv2.getRectSubPix to sample that patch at sub-pixel precision. The patch is downsampled to an 8×8 descriptor with INTER_AREA filtering, flattened to 64 entries, mean-centered, and gain-normalized (divide by standard deviation plus 1e-6). Completely flat regions fall back to an all-zero descriptor so later stages never encounter NaNs.
getRectSubPix, then blur-downsample to 8×8.Running this gave 100 valid descriptors from the 100 ANMS points in mural1. Below I show 25 randomly sampled descriptors with a fixed seed for reproducibility.
The montage shows that descriptors capture blurred structure around corners (e.g., bold outlines and floor tiles) while suppressing low-frequency bias. Many patches are illumination invariant due to the gain normalization.
I implement descriptor matching by forming the full pairwise squared-distance matrix between the normalized 8×8 patches from both images. The computation uses the identity ‖a−b‖2 = ‖a‖2 + ‖b‖2 − 2a·b. For every feature in image A I then identify the nearest and second-nearest neighbors in image B via argpartition, convert the squared distances to Euclidean distances, and form Lowe’s ratio r = d1 / d2.
r < τ with τ = 0.70Each caption below reports the match count and the median Lowe ratio for the accepted correspondences.
The automatically matched correspondences line up with the manual picks from Part A.
I now replace the manual correspondences with automatic matches and estimate each homography via a four-point RANSAC loop. Each iteration pulls four unique matches with a fixed RNG seed, rejects any degenerate hypothesis (duplicate points, rank-deficient sets, or triangle areas below 1e-3), fits HBA using computeH, and scores every correspondence using the symmetric transfer error √(‖H·b − a‖² + ‖H-1·a − b‖²). Inliers satisfy an error threshold of 3 px. The loop retains the best model by (inlier count, mean inlier error) and tightens the iteration budget with N = log(1 - 0.99) / log(1 - r⁴), where r is the best inlier ratio so far.
Once the loop finishes I refit H with the inliers, recompute errors, and feed the result into the existing feathered mosaicing routine. Summary statistics for the three scenes are:
Scene Matches Inliers Inlier Ratio Mean Error Iterations Mural 26 17 65.4% 0.90 px 23 Trees 65 36 55.4% 1.64 px 87 Window 69 47 68.1% 1.34 px 44
The three automatic mosaics are visually indistinguishable from their manual counterparts. Symmetric-transfer scoring, degeneracy checks, and adaptive stopping keep RANSAC efficient (tens of iterations) even with dozens of matches per scene.