Automatic Panorama Stitching and Seam Blending

Introduction

Building an end-to-end panorama pipeline: capture overlapping photos, detect and match features, solve homographies, and use multi-band blending to hide seams for clean stitched results.

A.1 · Shoot the Pictures

I captured three pairs of photographs with a fixed center of projection by rotating the camera horizontally while keeping the center of projection static. Each pair exhibits noticeable projective distortion that we will later align via a homography.

The first image set is of window frames that have strong horizontal and vertical lines which should help constrain the homography:

Window alcove left view
Window · View A
Window alcove right view
Window · View B

The second image set was taken on a deck showing trees and buildings:

Redwood grove left view
Trees · View A
Redwood grove right view
Trees · View B

The third image set is of an indoor wall mural of Keith Haring art:

Mural left view
Mural · View A
Mural right view
Mural · View B

A.2 · Recover Homographies

I implemented computeH in code/main.py using the direct linear transform. Each correspondence contributes two rows to the linear system Ah = b (shown below), where the eight unknowns are the first two rows of the homography plus the first two entries of the third row. The final entry h33 is fixed at 1.0 to remove the global scale ambiguity. I keep the coordinate convention from the correspondence tool, treating a point as (x, y) = (column, row) in image space.

With more than four correspondences per pair, the system becomes overdetermined. I solve it via numpy.linalg.lstsq, reshape the result back into a 3×3 matrix, and renormalize so that H[2,2] = 1. After solving I verify consistency by projecting the source points forward and reporting the residuals.

Linear System Structure

[x  y  1  0  0  0  -u*x  -u*y]   [h11]   [u]
[0  0  0  x  y  1  -v*x  -v*y] · [h12] = [v]
           ...

The first row enforces that the x-coordinate from image 1 maps to the observed x' in image 2 when scaled by the third row of H. The second row for the same correspondence enforces the y-coordinate mapping. Stacking all rows yields an overdetermined system that least squares can solve reliably.

Example (window set, first correspondence):
[2378.06 1097.05    1.      0.      0.      0.  -2983059.34 -1376149.06]
[   0.      0.      0.   2378.06 1097.05    1.  -2458595.16 -1134202.52]ᵀ

Window

Window correspondences
26 correspondences
H_window ~
[[ 1.0462e+01 -3.6050e-01 -1.6359e+04]
 [ 2.7010e+00  6.7346e+00 -7.1132e+03]
 [ 2.4000e-03 -2.0000e-04  1.0000e+00]]
RMSE = 24.12 px -- max error = 65.54 px

The higher residual comes from the repetitive structure, i.e. several correspondences are on almost parallel lines, so small noise creates large reprojection drift. I plan to smooth these inconsistencies during blending.

Trees

Trees correspondences
30 correspondences
H_trees ~
[[ 2.5257e+00 -1.7600e-02 -4.8181e+03]
 [ 4.4800e-01  2.0730e+00 -1.3329e+03]
 [ 4.0000e-04 -0.0000e+00  1.0000e+00]]
RMSE = 6.62 px -- max error = 14.42 px

The railing gives good horizontal constraints, while tree trunks and building edges anchor vertical structure. The reprojection errors are within a few pixels, which is good for seamless mosaics.

Mural

Mural correspondences
32 correspondences
H_mural ~
[[ 7.5390e+00  6.1020e-01 -1.1433e+04]
 [ 1.3934e+00  5.2645e+00 -3.9517e+03]
 [ 1.6000e-03  1.0000e-04  1.0000e+00]]
RMSE = 4.57 px -- max error = 8.65 px

The mural set gives the tightest fit. High-contrast edges and good perspective cues let least squares converge to a low-error solution that will function as a good rectification reference.

Across all three datasets, the homographies satisfy the projective model and will be reused unchanged for warping, rectification, and mosaicing in the following sections.

A.3 · Warp the Images

I implemented both warpImageNearestNeighbor and warpImageBilinear by following the inverse-warping recipe so that every pixel in the output looks up its source in the original image in order to avoid holes. For each call I:

  • Project the four source corners through H to predict the output bounding box.
  • Create a dense grid over that box, transform every location back with H-1, and analyze the interpolant.
  • Warp an all-ones mask to track which output pixels received valid samples and crop to the tightest rectangle containing data.

The nearest-neighbor variant rounds the floating-point coordinates and is therefore fast but prone to aliasing. The bilinear version uses the four-neighbor weighted blend, which eliminates stair-stepping at the cost of more computations.

Wall Tarp Rectification

Wall source image
Wall rectified with nearest neighbor
Nearest neighbor: text strokes are jagged from hard rounding.
Wall rectified with bilinear
Bilinear: lines are crisp and lettering regains straight edges.

Rectifying the wall artwork turns the projective skew into an axis-aligned poster. Bilinear interpolation carries over high-frequency strokes cleanly, while nearest neighbor leaves stair-step artifacts on diagonals and text.

Boat Photograph Rectification

Boat source image
Boat rectified with nearest neighbor
Nearest neighbor: ripples and rails show aliasing and staircases.
Boat rectified with bilinear
Bilinear: smooth gradients and straight rails with minimal artifacts.

The bilinear resample keeps light reflections coherent and retains thin structural edges without the jaggedness visible in the nearest-neighbor output.

In both examples the mask driven crop removes empty borders, leaving the valid warped content. Overall, nearest neighbor is useful as a debugging baseline, but the bilinear variant is better for mosaicing due to its better visual fidelity.

A.4 · Blend the Images into a Mosaic

I keep the left image in each pair as the reference canvas and warp the right image into that frame using the precomputed homography HBA. The blending routine follows this sequence:

  • Robust canvas planning: I project a dense grid of source pixels through the homography and discard outliers with a MAD filter. The surviving footprint defines a tight bounding box so I never allocate the enormous canvases that the raw corner projection can suggest.
  • Inverse-warp with bilinear sampling: Every pixel in the warped image is filled by walking backward with H-1 and bilinear interpolation. A companion warp of an all-ones mask records which canvas pixels received valid samples.
  • Distance-transform alphas: I assign the reference image an alpha mask derived from a distance transform with full weight in the interior, tapering to zero at the border. The warped image receives the same processing on its footprint. These soft masks mean a smooth, feathered transition in overlap regions.
  • Gain compensation & cropping: Overlapping pixels mean a per-channel least-squares gain so exposure mismatches fade. I crop to the tightest bounding box where the accumulated alpha exceeds a small threshold.

The output is a single blended mosaic per pair with clean seams and no hard cut lines. Below, each row shows the two inputs (left-to-right) followed by the resulting mosaic.

Window

Window view A
Window · View A
Window view B
Window · View B
Window mosaic
Feathered mosaic — sash columns align without visible seams.

The robust bounding step keeps the warped frame manageable even though the homography extrapolates far outside the window panes. Weighted averaging hides the brightness difference between the two captures.

Trees

Trees view A
Trees · View A
Trees view B
Trees · View B
Trees mosaic
Feathered mosaic — the deck railing stitches into a straight line.

The shared horizontal railing is a strong anchor while the feather mask smoothly blends sky gradients and foliage without ghosting.

Museum

Mural view A
Mural · View A
Mural view B
Mural · View B
Mural mosaic
Feathered mosaic — floor tiles and ceiling trim align cleanly.

This scene contains vivid colors and repetitive patterns. The distance-transform weights keep the seam inside the artwork nearly invisible.

All mosaics reuse the same manual correspondences from A.2 and rely on the warping/blending stack which generalizes to additional image sets and lays the groundwork for Part B’s automatic pipeline.

B.1 · Harris Corner Detection

I start by loading each RGB frame with my existing helper, convert it to a float32 grayscale image in [0, 1], and then run the supplied harris.get_harris_corners implementation. If the dependency is missing I fall back to an OpenCV pipeline (Harris response → Gaussian blur → 3×3 dilation for local maxima), applying a 20 pixel border mask in both cases. The routine keeps the response value for every peak so that later stages can reason about corner strength.

To cull redundant detections I implement Adaptive Non-Maximal Suppression with the same robustness heuristic used in the MOPS paper. Points are first sorted by Harris score; each radius is the distance to the nearest corner whose response is at least 10% stronger (crobust=0.9). I retain the top K=500 radii (ties broken by score) and render two visualizations:

Trees

Trees Harris corners
Over 22k raw Harris detections.
Trees ANMS corners
ANMS smooths the distribution and keeps strong responses on the trunks and skyline.

Even in highly textured regions the adaptive radius thresholds prevent clusters of overlapping keypoints, which significantly reduces descriptor redundancy and RANSAC iteration counts.

B.2 · Feature Descriptor Extraction

I follow the MOPS recipe as closely as possible while remaining axis-aligned. For every ANMS survivor I reflect-pad the grayscale image by 20 pixels so that a full 40×40 window is always available, then use cv2.getRectSubPix to sample that patch at sub-pixel precision. The patch is downsampled to an 8×8 descriptor with INTER_AREA filtering, flattened to 64 entries, mean-centered, and gain-normalized (divide by standard deviation plus 1e-6). Completely flat regions fall back to an all-zero descriptor so later stages never encounter NaNs.

  • Inputs: float32 grayscale in [0, 1] and ANMS coordinates (x, y).
  • Safety margin: 20 px reflect padding guarantees valid support.
  • Sampling: inverse-map 40×40 window via getRectSubPix, then blur-downsample to 8×8.
  • Normalization: subtract mean, divide by stddev; zero out near-constant patches.

Running this gave 100 valid descriptors from the 100 ANMS points in mural1. Below I show 25 randomly sampled descriptors with a fixed seed for reproducibility.

Descriptor sample overlay
Overlay of the 25 sampled descriptor centers on the mural image.
Descriptor montage
Each 8×8 descriptor after mean/std normalization, upscaled for visibility.

The montage shows that descriptors capture blurred structure around corners (e.g., bold outlines and floor tiles) while suppressing low-frequency bias. Many patches are illumination invariant due to the gain normalization.

B.3 · Feature Matching

I implement descriptor matching by forming the full pairwise squared-distance matrix between the normalized 8×8 patches from both images. The computation uses the identity ‖a−b‖2 = ‖a‖2 + ‖b‖2 − 2a·b. For every feature in image A I then identify the nearest and second-nearest neighbors in image B via argpartition, convert the squared distances to Euclidean distances, and form Lowe’s ratio r = d1 / d2.

  • Reject candidates whenever image B offers fewer than two descriptors or when the second-best distance underflows.
  • Keep matches satisfying r < τ with τ = 0.70
  • Apply a mutual check: B’s own nearest neighbor back into A must equal the proposing feature, eliminating most collisions.

Each caption below reports the match count and the median Lowe ratio for the accepted correspondences.

Window

Window feature matches
46 matches (median r = 0.52)

Trees

Trees feature matches
39 matches (median r = 0.55)

Mural

Mural feature matches
26 matches (median r = 0.48)

The automatically matched correspondences line up with the manual picks from Part A.

B.4 · RANSAC for Robust Homography

I now replace the manual correspondences with automatic matches and estimate each homography via a four-point RANSAC loop. Each iteration pulls four unique matches with a fixed RNG seed, rejects any degenerate hypothesis (duplicate points, rank-deficient sets, or triangle areas below 1e-3), fits HBA using computeH, and scores every correspondence using the symmetric transfer error √(‖H·b − a‖² + ‖H-1·a − b‖²). Inliers satisfy an error threshold of 3 px. The loop retains the best model by (inlier count, mean inlier error) and tightens the iteration budget with N = log(1 - 0.99) / log(1 - r⁴), where r is the best inlier ratio so far.

Once the loop finishes I refit H with the inliers, recompute errors, and feed the result into the existing feathered mosaicing routine. Summary statistics for the three scenes are:

Scene      Matches  Inliers  Inlier Ratio  Mean Error  Iterations
Mural          26       17        65.4%        0.90 px        23
Trees          65       36        55.4%        1.64 px        87
Window         69       47        68.1%        1.34 px        44

Mural

Mural automatic matches
All descriptor matches (τ = 0.70)
Mural RANSAC inliers
RANSAC keeps 17 correspondences
Automatic mural mosaic
Automatic mosaic
Mural manual vs automatic mosaic
Manual vs automatic

Trees

Trees automatic matches
Matches align
Trees RANSAC inliers
RANSAC removes sky outliers and keeps 36 inliers
Automatic trees mosaic
Automatic mosaic
Trees manual vs automatic mosaic
Manual vs automatic. Gain compensation equalizes exposure so both halves match tonally.

Window

Window automatic matches
Dense matches form on each window sash
Window RANSAC inliers
47 inliers survive RANSAC
Automatic window mosaic
Automatic mosaic
Window manual vs automatic mosaic
Manual vs automatic

The three automatic mosaics are visually indistinguishable from their manual counterparts. Symmetric-transfer scoring, degeneracy checks, and adaptive stopping keep RANSAC efficient (tens of iterations) even with dozens of matches per scene.