Window Shopping Google ARCore: Concepts

I thought Google’s ARCore SDK offered interesting capabilities for robots. So even though the SDK team is explicitly not considering robotics applications, I wanted to take a look.

The obvious starting point is ARCore’s “Fundamental Concepts” document. Here we can confirm the theory operation is consistent with an application of Structure from Motion algorithms. Out of all the possible type of information that can be extracted via SfM, a subset is exposed to applications using the ARCore SDK.

Under “Environmental Understanding” we see the foundation supporting AR applications: an understanding of the phone’s position in the world, and of surfaces that AR objects can interact with. ARCore picks out horizontal surfaces (tables, floor) upon which an AR object can be placed, or vertical surfaces (walls) upon which AR images can be hung like a picture. All other features build on top of this basic foundation, which also feel useful for robotics: most robots only navigate on horizontal surfaces, and try to avoid vertical walls. Knowing where they are relative to current position in the world would help collision detection.

The depth map is a new feature that caught my attention in the first place, used for object occlusion. There is also light estimation, helping to shade objects to fit in with their surroundings. Both of these allow a more realistic rendering of a virtual object in real space. While the depth map has obvious application for collision detection and avoidance more useful than merely detecting vertical wall surfaces. Light estimation isn’t obviously useful for a robot, but maybe interesting ideas will pop up later.

In order for users to interact with AR objects, the SDK includes the ability to map the user’s touch coordinate in 2D space into the corresponding location in 3D space. I have a vague feeling it might be useful for a robot to know where a particular point in view is in 3D space, but again no immediate application comes to mind.

ARCore also offers “Augmented Images” that can overlay 3D objects on top of 2D markers. One example offered: “for instance, they could point their phone’s camera at a movie poster and have a character pop out and enact a scene.” I don’t see this as a useful capability in a robotics application.

But as interesting as these capabilities are, they are focused on a static snapshot of a single point in time. Things get even more interesting once we are on the move and correlate data across multiple points in space or even more exciting, multiple devices.

Robotic Applications for “Structure From Motion” and ARCore

I was interested to explore if I can adapt capabilities of augmented reality on mobile device to an entirely different problem domain: robot sensing. First I had to do a little study to verify it (or more specifically, the Structure from Motion algorithms underneath) isn’t fundamentally incompatible with robots in some way. Once I gained some confidence I’m not barking up the wrong tree, a quick search online using keywords like “ROS SfM” returned several resources for applying SfM to robotics including several built on OpenCV. A fairly consistent theme is that such calculations are very computationally intensive. I found that curious, because such traits are inconsistent with the fact they run on cell phone CPUs for ARCore and ARKit. A side trip explored whether these calculations were assisted by specialized hardware like “AI Neural Coprocessor” that phone manufacturers like to tout on their spec sheet, but I decided that was unlikely for two reasons. (1) If deep learning algorithms are at play here, I should be able to find something about doing this fast on the Google AIY Vision kit, Google Coral dev board, or NVIDIA Jetson but I came up empty-handed. (2) ARCore can run on some fairly low-frills mid range phones like my Moto X4.

Finding a way to do SFM from a cell phone class processor would be useful, because that means we can potentially put it on a Raspberry Pi, the darling of hobbyist robotics. Even better if I can leverage neural net hardware like those listed above, but that’s not required. So far my searches have been empty but something might turn up later.

Turning focus back to ARCore, a search for previous work applying ARCore to robotics returned a few hits. The first hit is the most discouraging: ARCore for Robotics is explicitly not a goal for Google and the issue closed without resolution.

But that didn’t prevent a few people from trying:

  • An Indoor Navigation Robot Using Augmented Reality by Corotan, et al. is a paper on doing exactly this. Unfortunately, it’s locked behind IEEE paywall. The Semantic Scholar page at least lets me sees the figures and tables, where I can see a few tantalizing details that just make me want to find this paper even more.
  • Indoor Navigation Using AR Technology (PDF) by Kumar et al. is not about robot but human navigation, making it less applicable for my interest. Their project used ARCore to implement an indoor navigation aid, but it required the environment to be known and already scanned into a 3D point cloud. It mentions the Corotan paper above as part of “Literature Survey”, sadly none of the other papers in that section was specific to ARCore.
  • Localization of a Robotic Platform using ARCore (PDF) sounded great but, when I brought it up, I was disappointed to find it was a school project assignment and not results.

I wish I could bring up that first paper, I think it would be informative. But even without that guide, I can start looking over the ARCore SDK itself.

Augmented Reality Built On “Structure From Motion”

When learning about a new piece of technology in a domain I don’t know much about, I like to do a little background research to understand the fundamentals. This is not just for idle curiosity: understanding theoretical constraints could save a lot of grief down the line if that knowledge spares me from trying to do something that looked reasonable at the time but is actually fundamentally impossible. (Don’t laugh, this has happened more than once.)

For the current generation of augmented reality technology that can run on cell phones and tablets, the fundamental area of research is “Structure from Motion“. Motion is right in the name, and that key component explains how a depth map can be calculated from just a 2D camera image. A cell phone does not have a distance sensor like Kinect’s infrared projector/camera combination, but it does have motion sensors. Phones and tablets started out with only a crude low resolution accelerometer for detecting orientation, but that’s no longer the case thanks to rapid advancements in mobile electronics. Recent devices have high resolution, high speed sensors that integrate accelerometer, gyroscope, and compass across X, Y, and Z axis. These 9-DOF sensors (3 types of data * 3 axis = 9 Degrees of Freedom) allow the phone to accurately detect motion. And given motion data, an algorithm can correlate movement against camera video feed to extract parallax motion. That then feeds into code which builds a digital representation of the structure of the phone’s physical surroundings.

Their method of operation would also explain how such technology could not replace a Kinect sensor, which is designed to sit on the fireplace mantle and watch game players jump around in the living room. Because the Kinect sensor bar does not move, there is no motion from which to calculate structure making SfM useless for such tasks. This educational side quest has thus accomplished the “understand what’s fundamentally impossible” item I mentioned earlier.

But mounted on a mobile robot moving around in its environment? That should have no fundamental incompatibilities with SfM, and might be applicable.