Matching Massive Images: A Tiled Pipeline for LightGlue and RoMa

In my last post, we covered turning pairwise matches into a Bundler .out file. But that assumes you already have high-quality matches—which isn’t always a given.

When working with large-scale historical imagery (in my case, archival aerial imagery of Antarctica), standard tools like SIFT often fail against textureless ice and film degradation. The modern solution is to use deep learning models like SuperPoint and LightGlue. However, feeding massive, high-res scans directly into your GPU is often not feasible—you have to bridge the gap between image size and match quality.

This post walks through two different methods I built for getting reliable matches out of large images. It’s code-heavy, with extensive examples. A full notebook will be soon available.

The Basics

Before diving into the code, let’s establish some terminology. Points
In photogrammetry, “points” refer to specific image locations of interest for matching. There are two main types:

Keypoints (KPs): Distinct 2D pixel coordinates identified in a single image—like the sharp corner of a building.
Tie Points (TPs): Matched keypoint pairs between two different images, also called “matches” or “correspondences.” These represent the same physical location in the real world seen from two different viewpoints, and they are ultimately what photogrammetry needs.

Matchers
Matchers are algorithms or neural networks that take keypoints and their descriptors from two images and find correspondences between them. They analyze the descriptors to determine which keypoints from Image A likely correspond to which keypoints in Image B, based on similarity and geometric consistency.

SIFT: The classic workhorse of computer vision. Unlike the modern tools on this list, SIFT is not a neural network—it relies on traditional gradient math to find edges and corners. While incredibly reliable and still the default in many commercial photogrammetry applications, it can struggle with extreme lighting changes, heavy shadows, or severe perspective shifts where deep-learning methods excel.
SuperPoint & LightGlue: The modern gold standard for sparse matching. SuperPoint acts as the “scout”—a CNN that extracts highly distinct keypoints and their unique descriptors. LightGlue (the successor to SuperGlue) is the “cartographer,” matching those descriptors using a fast transformer architecture.
RoMa: Where SuperPoint/LightGlue focus on specific, isolated points (sparse matching), RoMa analyzes the entire image patch to estimate pixel-to-pixel warps (dense matching). Because it considers the whole area holistically, it is highly robust against extreme viewpoint shifts, drastic lighting changes, and mostly textureless surfaces.

Homography & Geometric Filtering
Both approaches rely heavily on computing a homography—a mathematical mapping that describes how the flat surface of Image 1 relates to Image 2. We use this both to predict where tiles should overlap and as a final filter to remove false positive matches (outliers).

Throughout these pipelines, I use USAC_MAGSAC instead of classic RANSAC. MAGSAC++ is threshold-free and marginalizes over noise levels, making it significantly more robust against the noise inherent in high-res imagery.

# A standard robust homography calculation used throughout this pipeline
h_matrix, inlier_mask = cv2.findHomography(
    pts1, pts2,
    method=cv2.USAC_MAGSAC,
    ransacReprojThreshold=5.0,
    confidence=0.9999,
    maxIters=10000,
)

The Two Approaches

I ended up building not one but two different pipelines. Approach 1 is the more classic “extract & match” pipeline. Approach 2 grew from the need to handle more extreme geometric changes and achieve subpixel accuracy, combining two different matchers.

Both have distinct advantages depending on the scenario:

	Approach 1: Two Stages	Approach 2: One Stage
Speed	Faster	Slower (RoMa is computationally heavy)
Reusability	High (keypoints can be reused across pairs)	Low (keypoints computed on-the-fly)
Subpixel Accuracy	No*	Yes (RoMa achieves subpixel accuracy)
Complexity	Higher (separate extraction and matching phases)	Lower (directly finds tie points)
Robustness	Moderate (depends on LightGlue’s performance)	High (RoMa handles extreme cases well)
Match Volume	Moderate	High

*See Tips & Tricks for a post-processing workaround.

Approach 1: The Two-Stage “Extract & Match” Pipeline

This is the classic approach: find all interesting points in Image A, find all interesting points in Image B, then compare the two sets. The key advantage is reusability—if you match the same image multiple times (e.g. Image A → B, then Image A → C), you only extract keypoints from Image A once and reuse those descriptors for every subsequent pair.

Finding Keypoints

Phase 1: Preparation & Tiling
The image is converted to 2D grayscale and checked against the minimum size requirements for the neural network. To handle large images safely, it is divided into a grid of overlapping square tiles. Because we don’t yet know which region of the image will overlap with the second image, we tile the entire thing. Figure 1 shows an example. Tiles are designed to overlap by around 20% to ensure keypoints near tile edges are never missed. Where tiles don’t fit perfectly, those at the right and bottom edges extend further to ensure full coverage.

Aerial image divided into a 5x7 grid of 35 overlapping square tiles, each approximately 2000x2000 pixels, covering the full image — 35 tiles with equal sizes of around 2000×2000 pixels.

Phase 2: Extraction
The script loops through every tile and feeds it into SuperPoint. The model outputs keypoint coordinates, confidence scores, and descriptors. The local tile coordinates are then shifted back to reflect their true position on the full-resolution image.

Phase 3: Global Merge & Cleanup
Results from all tiles are concatenated into three master lists (keypoints, scores, descriptors). Because tiles overlap, the same physical feature may be detected twice. A KDTree searches for points within a 2-pixel radius of each other and removes the duplicate with the lower confidence score.

from scipy.spatial import cKDTree

# 1. Sort by score descending to prioritize high-confidence points
sort_idx = np.argsort(scores_global)[::-1]
kpts_sorted = kpts_global[sort_idx]
scores_sorted = scores_global[sort_idx]
descs_sorted = descs_global[sort_idx]

# 2. Build KDTree for efficient radius searches
tree = cKDTree(kpts_sorted)
suppression_radius = 2.0

# query_pairs finds all pairs (i, j) within the radius where i < j.
# Because we sorted by score, 'i' is always the better point.
# We can safely drop 'j' every time.
pairs = tree.query_pairs(suppression_radius)

drop_mask = np.zeros(len(kpts_sorted), dtype=bool)
for i, j in pairs:
    drop_mask[j] = True

final_kpts = kpts_sorted[~drop_mask]
final_scores = scores_sorted[~drop_mask]
final_descs = descs_sorted[~drop_mask]

If the total number of extracted points exceeds the allowed maximum, the script sorts by confidence and keeps only the top N.

Extracted SuperPoint keypoints visualised as dots scattered across the full-resolution image after merging and deduplicating all tile results — 5000 keypoints extracted from each image after global merge and deduplication.

Matching Keypoints

Phase 1: Global Baseline Matching
The script takes a subset of the highest-scoring keypoints from both images and runs a fast global match to establish a rough baseline for how the two images relate to one another. The larger image is designated the “base” and the smaller the “other.” If too few matches are found, the script can either stop or proceed with a more lenient threshold, depending on how you want to handle edge cases.

Phase 2: Targeted Cropping
Rather than matching everything exhaustively (O(N²) complexity), the script uses the baseline matches to calculate where each base tile lands in the second image, then draws a bounding box around that region. Only that specific area is passed to the heavy-duty matcher, which both speeds up processing and reduces false positives by narrowing the search space.

# 1. Identify which baseline matches fall within the current tile
# tile_bounds = (min_x, max_x, min_y, max_y)
mask = (
    (initial_matches['pts1'][:, 0] >= tile_bounds[0]) &
    (initial_matches['pts1'][:, 0] <  tile_bounds[1]) &
    (initial_matches['pts1'][:, 1] >= tile_bounds[2]) &
    (initial_matches['pts1'][:, 1] <  tile_bounds[3])
)

# 2. Grab the corresponding locations in Image 2
pts_in_image2 = initial_matches['pts2'][mask]

if len(pts_in_image2) == 0:
    return None  # No baseline matches to guide us

# 3. Calculate the bounding box in Image 2
min_x, max_x = pts_in_image2[:, 0].min(), pts_in_image2[:, 0].max()
min_y, max_y = pts_in_image2[:, 1].min(), pts_in_image2[:, 1].max()

# 4. Add padding so we don't miss keypoints near the edges
width_pad  = (max_x - min_x) * padding
height_pad = (max_y - min_y) * padding

x_range = (min_x - width_pad, max_x + width_pad)
y_range = (min_y - height_pad, max_y + height_pad)

Figure 3 shows the tiles produced in this phase. The tiles in the base image are uniformly spaced, but in the second image they follow the distribution of the baseline tie points and cluster around regions of interest. The algorithm isn’t perfect—tile 8 is slightly misplaced due to an outlier in the baseline matches. One mitigation is to use only the 95th-percentile matches when computing the bounding boxes.

Second image showing 10 variable-size tiles projected from the base image tiles using baseline matches, concentrating on textured regions and ignoring flat featureless areas — 10 variable-size tiles in the second image. Note how the tiling focuses on textured regions and ignores featureless areas.

Phase 3: Batched Inference & Merging
The paired tile regions are grouped into batches and sent to the GPU for inference. LightGlue analyzes each targeted area to find dense, high-quality matches. If the GPU runs out of memory, the script catches the error and falls back to processing tiles one at a time. All matches from across the batches are then merged with the initial baseline matches.

Phase 4: Deduplication & Geometric Verification
A strict one-to-one mapping is enforced: if a point in Image 1 is accidentally matched to multiple points in Image 2, only the highest-confidence match survives. A spatial cleanup then removes any remaining near-duplicate matches.

# Sort by descending certainty so the best match always wins
order      = np.argsort(-certainty)
tie_points = tie_points[order]
certainty  = certainty[order]
kpts1      = tie_points[:, :2]

# Pre-compute all neighbour lists in one parallelized call
tree           = cKDTree(kpts1)
neighbor_lists = tree.query_ball_point(kpts1, r=radius, workers=-1)

kept = np.ones(len(tie_points), dtype=bool)
for i, neighbors in enumerate(neighbor_lists):
    if not kept[i]:
        continue
    for nb in neighbors:
        if nb > i:  # nb has lower certainty → suppress it
            kept[nb] = False

tie_points = tie_points[kept]
certainty  = certainty[kept]

# Simplified for brevity — see full code for complete arguments
h, inlier_mask = cv2.findHomography(...)

if h is None or inlier_mask is None:
    tie_points = np.empty((0, 4))
    certainty  = np.empty((0,))
else:
    inliers    = inlier_mask.ravel().astype(bool)
    tie_points = tie_points[inliers]
    certainty  = certainty[inliers]

Finally, a strict geometric verification using the Fundamental Matrix is run across the full merged set to remove any remaining outliers.

Result:

776 tie-points drawn as coloured lines connecting matched locations between two overlapping aerial images, with high average confidence of 0.897 — 776 tie-points with an average confidence of 0.897.

Approach 2: The One-Stage “Direct Tie-Point” Detector

This approach skips the independent extraction phase entirely. Instead, it pairs specific image regions up front and extracts tie points directly, using LightGlue as a fast gatekeeper and RoMa for dense, subpixel-accurate matching.

Phase 1: Initial Matchability Check
A fast LightGlue match is run at reduced resolution (around 1024px on the longest side) to check whether the images are matchable at all. If the number of inliers falls below a threshold, the script can stop early or proceed with a more lenient threshold, depending on your preference for edge cases.

One additional challenge with aerial imagery is that images can have significant relative rotation — a plane doesn’t always fly the same heading. To handle this, the pipeline first tests four candidate rotations (0°, 90°, 180°, 270°) at low resolution using LightGlue, picks the combination that yields the most geometric inliers, and applies that rotation before any tiling begins.

Phase 2: Targeted Tile Projection
The low-res matches are scaled back up to full-resolution coordinates and used to compute a global homography. Image 1 is then divided into overlapping square tiles, and for each tile, the homography is used to predict exactly where that tile appears in Image 2—avoiding the need to exhaustively check every possible tile combination.

# Simplified for brevity — see full code for complete arguments
h_full, _ = cv2.findHomography(...)

tiles_base = tile_image(base_image)

pairs = []
for box in tiles_base:
    x0_base, y0_base, x1_base, y1_base = box
    cx = (x0_base + x1_base) / 2.0
    cy = (y0_base + y1_base) / 2.0

    # Project the tile centre through H_full into the other image
    pt   = np.array([[[cx, cy]]], dtype=np.float64)
    proj = cv2.perspectiveTransform(pt, h_full)[0, 0]
    px, py = float(proj[0]), float(proj[1])

    # Skip if the projected centre falls outside the other image
    if not (0 <= px < w_other and 0 <= py < h_other):
        continue

    x0_other = int(np.clip(round(px - half), 0, max(0, w_other - tile_size)))
    y0_other = int(np.clip(round(py - half), 0, max(0, h_other - tile_size)))
    x1_other = min(x0_other + tile_size, w_other)
    y1_other = min(y0_other + tile_size, h_other)

    pairs.append((box, (x0_other, y0_other, x1_other, y1_other)))

Two aerial images overlaid with 96 small overlapping tiles projected using the coarse homography, providing denser coverage required by the memory-intensive RoMa matcher — RoMa’s higher memory demands require smaller tiles, resulting in 96 tiles covering the full overlap area.

Phase 3: Dense Tile Matching
For each tile pair, SuperPoint and LightGlue run first. If they find enough seed matches, the pair is passed to RoMa for dense matching.

Phase 4: Per-Tile Cleanup
RoMa matches are filtered using a local geometric check, then spatially subsampled to prevent matches from clustering in one region. Coordinates are shifted from local tile space back to global image coordinates.

Phase 5: Global Merge & Deduplication
All tile results are concatenated into a single list. Overlapping tiles will have produced duplicate matches, which are resolved using a KDTree—keeping the higher-confidence match wherever two points fall within the deduplication radius.

Phase 6: Final Geometric Verification
A strict Fundamental Matrix check across the full merged set removes any remaining outliers.

Result:

5000 dense tie-points connecting two aerial images produced by RoMa, showing significantly denser coverage compared to the LightGlue-only result — 5000 tie-points with an average confidence of 1.0.

Approach 2 finds far more matches (5000 vs 776), but more isn’t always better—Approach 1 already describes the image relationship well enough for most photogrammetry workflows. It’s also worth noting that RoMa’s confidence scores carry less comparative meaning than LightGlue’s. RoMa uses importance-weighted sampling — it draws more matches from high-certainty regions — so scores naturally cluster near 1.0 and don’t discriminate between good and great matches the way LightGlue’s scores do.

Tips & Tricks

Overlapping tiles
Always build overlap into your tiling grid. When you slice an image into rigid squares, you risk cutting straight through important geometric features that happen to lie on a boundary. Forcing tiles to overlap (typically 10–20%) ensures every feature is fully preserved in at least one tile, which meaningfully increases match quality along the seams.

OOM handling
LightGlue and RoMa are memory-hungry, and their VRAM usage spikes unpredictably depending on the visual complexity of each tile pair. To prevent a single dense tile from crashing a multi-hour run, wrap your inference calls in a targeted try-except block.

try:
    run_matching_on_gpu()

except torch.cuda.OutOfMemoryError as e:
    print(f"OOM caught: {e}")
    torch.cuda.empty_cache()

    # Fall back to CPU, lower resolution, or reduced batch size
    print("Falling back to reduced resolution...")
    return None

except Exception as e:
    raise e  # Re-raise anything that isn't an OOM

Subpixel accuracy for SuperPoint/LightGlue
SuperPoint and LightGlue are inherently limited to pixel-level precision, but subpixel refinement can be applied as a post-processing step. One option I’ve found effective is keypt2subpx, which plugs in after the matching step with minimal overhead. It improves results noticeably but is not a full substitute for a natively subpixel method like RoMa.

Conclusion

Matching high-resolution aerial imagery is a balancing act between computational efficiency and geometric precision. Modern matchers like LightGlue and RoMa have transformed what’s possible on difficult textures and lighting conditions, but they need a thoughtful tiling strategy around them to operate at scale without running out of memory or losing accuracy.

The two approaches here offer different paths to that goal. Approach 1 is efficient and reusable, well-suited to standard batch workflows. Approach 2 is the heavy-duty alternative for cases where subpixel accuracy and high match density are non-negotiable.

Room for improvement
Both pipelines are still evolving. Areas I’m actively looking to refine:

Rotation invariance: The current tile projection works best for images with limited relative rotation. When images are significantly rotated, the projected bounding boxes inflate well beyond the true overlap region, wasting computation and reducing matching precision.
Planar scene assumption: The homography-based tile projection assumes a flat scene. For imagery with significant elevation variation—mountainous terrain, for instance—this breaks down. A local affine model per tile or depth-guided warping would be more robust.

In practice, both pipelines have held up on some of the most hostile imagery I’ve encountered: near-featureless Antarctic ice fields captured on decades-old film, with all the grain, fading, and geometric distortion that comes with it. If they work there, they should serve you well on more forgiving material.

If you’re exploring this space, I also highly recommend DEEP-IMAGE-MATCHING, which incorporates several of the strategies discussed here and is actively maintained.