Imagine playing a new, slightly altered version of the game GeoGuessr. You’re faced with a photo of an average U.S. house, maybe two floors with a front lawn in a cul-de-sac and an American flag flying proudly out front. But there’s nothing particularly distinctive about this home, nothing to tell you the state it’s in or where the owners are from.
You have two tools at your disposal: your brain, and 44,416 low-resolution, bird’s-eye-view photos of random places across the United States and their associated location data. Could you match the house to an aerial image and locate it correctly?
I definitely couldn’t, but a new machine learning model likely could. The software, created by researchers at China University of Petroleum (East China), searches a database of remote sensing photos with associated location information to match the streetside image—of a home or a commercial building or anything else that can be photographed from a road—to an aerial image in the database. While other systems can do the same, this one is pocket-size compared to others and super accurate.
At its best (when faced with a picture that has a 180 degree field of view), it succeeds up to 97 percent of the time in the first stage of narrowing down location. That’s better than or within two percentage points of all the other models available for comparison. Even under less-than-ideal conditions, it performs better than many competitors. When pinpointing an exact location, it’s correct 82 percent of the time, which is within three points of the other models.
But this model is novel for its speed and memory savings. It is at least twice as fast as similar ones and uses less than a third the memory they require, according to the researchers. The combination makes it valuable for applications in navigation systems and the defense industry.
“We train the AI to ignore the superficial differences in perspective and focus on extracting the same ‘key landmarks’ from both views, converting them into a simple, shared language,” explains Peng Ren, who develops machine learning and signal processing algorithms at China University of Petroleum (East China).
The software relies on a method called deep cross-view hashing. Rather than try to compare each pixel of a street view picture to every single image in the giant bird’s-eye-view database, this method relies on hashing, which means transforming a collection of data—in this case, street-level and aerial photos—into a string of numbers unique to the data.
To do that, the China University of Petroleum research group employs a type of deep learning model called a vision transformer that splits images into small units and finds patterns among the pieces. The model may find in a photo what it’s been trained to identify as a tall building or circular fountain or roundabout, and then encode its findings into number strings. ChatGPT is based on similar architecture, but finds patterns in text instead of images. (The “T” in “GPT” stands for “transformer.”)
The number that represents each picture is like a fingerprint, says Hongdong Li, who studies computer vision at the Australian National University. The number code captures unique features from each image that allow the geolocation process to quickly narrow down possible matches.
In the new system, the code associated with a given ground-level photo gets compared to those of all of the aerial images in the database (for testing, the team used satellite images of the United States and Australia), yielding the five closest candidates for aerial matches. Data representing the geography of the closest matches is averaged using a technique that weighs locations closer to each other more heavily to reduce the impact of outliers, and out pops an estimated location of the street view image.
The new mechanism for geolocation was published last month in IEEE Transactions on Geoscience and Remote Sensing.
Fast and memory efficient
“Though not a completely new paradigm,” this paper “represents a clear advance within the field,” Li says. Because this problem has been solved before, some experts, like Washington University in St. Louis computer scientist Nathan Jacobs, are not as excited. “I don’t think that this is a particularly groundbreaking paper,” he says.
But Li disagrees with Jacobs—he thinks this approach is innovative in its use of hashing to make finding images matches faster and more memory efficient than conventional techniques. It uses just 35 megabytes, while the next smallest model Ren’s team examined requires 104 megabytes, about three times as much space.
The method is more than twice as fast as the next fastest one, the researchers claim. When matching street-level images to a dataset of aerial photography of the United States, the runner-up’s time to match was around 0.005 seconds—the Petroleum group was able to find a location in around 0.0013 seconds, almost four times faster.
“As a result, our method is more efficient than conventional image geolocalization techniques,” says Ren, and Li confirms that these claims are credible. Hashing “is a well-established route to speed and compactness, and the reported results align with theoretical expectations,” Li says.
Though these efficiencies seem promising, more work is required to ensure this method will work at scale, Li says. The group did not fully study realistic challenges like seasonal variation or clouds blocking the image, which could impact the robustness of the geolocation matching. Down the line, this limitation can be overcome by introducing images from more distributed locations, Ren says.
Still, long-term applications (beyond a super advanced GeoGuessr) are worth considering now, experts say.
There are some trivial uses for an efficient image geolocation, such as automatically geotagging old family photos, says Jacobs. But on the more serious side, navigation systems could also exploit a geolocation method like this one. If GPS fails in a self-driving car, another way to quickly and precisely find location could be useful, Jacobs says. Li also suggests it could play a role in emergency response within the next five years.
There may also be applications in defense systems. Finder, a 2011 project from the Office of the Director of National Intelligence, aimed to help intelligence analysts learn as much as they could about photos without metadata using reference data from sources including overhead images, a goal that could be accomplished with models similar to this new geolocation method.
Jacobs puts the defense application into context: If a government agency sent a photo of a terrorist training camp without metadata, how can the site be geolocated quickly and efficiently? Deep cross-view hashing might be of some help.
From Your Site Articles
Related Articles Around the Web