Will Foundational Models save robotics? - AIRA - Artificial Intelligence. Real Abilities.

Investors certainly seem to think so. A lot of money is being spent on building robotic Foundational Models (with $500m shared between Field AI and Genesis AI alone) and the promise of universal, fully generalist, zero shot Physical AI Model seems as appealing as the promise of universal Humanoids just few months ago. But is this pursuit of the ‘eat all’ generalist Robotic Foundational Model effective, economical or even practically achievable?

Foundational Models from the likes of Field AI or Genesis and promising to be more than slightly better than the old pipelines in perception, planning and generalisation, especially in unstructured, real-world environments. It sounds great in theory but:

The datasets that need to be built to effectively underpin those vast models are probably two orders of magnitude larger than the ones underpinning the largest LLMs of today (since what is needed is detailed, 3D, action rich data, not one-dimensional text). This data is not available on the internet – it has to be collected through teleoperation or created with simulations (some call it a 100,000-year challenge).
Models of that size will require untold amount of compute to run, even though most tasks in Physical AI are relatively specialised, covering a narrow field of expertise. Some are now arguing that smaller models (arguably LLMs but issues of generative AI are literally squared in 3D space) can be better at executing agentic tasks (which are not dissimilar to agentic robots). According to NVIDIA’s latest research SLMs > LLMs.
Some experienced developers, like Boston Dynamics, appear to be investing their time in Large Behavioural Models rather than Foundational Models which combine some of the more general world knowledge with specific task knowledge, using foundational policy to cover things like movement and manipulation with further specific training for individual tasks. This suggests that even the Foundational Models that both Toyota Research Institute and Boston Dynamics have access to hit generalisation limits early and still require a lot of task specific training to work effectively.

Is multimodal generalisation associated with Large Foundation Models a false promise?

Let’s look at the NVIDA paper again which states that Agents are inherently specialised and task specific which is vastly different from how LLMs are deployed. As such the LLMs can be both too generic (not specialist enough to execute tasks effectively) and too expensive to run for Agentic tasks. Embodied Agents literally add another dimension to this problem increasing the challenge by an order of magnitude.

If we consider the tasks we’d want robots to execute, whether in manufacturing, or food processing, or construction, they appear to be infinitely specialised and vary not just between jobs but from one location to another – almost no two assembly lines (and definitely no two construction sites) are the same. Can deployment of robots be successfully generalised across via Foundational Models or will it end up deferring to Small Behavioural Models designed for each application?

Another potential risk associated with development of ‘eat all’ Foundational Models is that the costs compound first, especially with the vast amount of data needed and the accelerating cost of data acquisition with diminishing returns from additional data setting in early (meaning your marginal useful data becomes exponentially more expensive to obtain). At that point the overweight data trawling nets may start pulling the data fishing boats under. Can the value of Foundational Models be demonstrated early enough through successful deployment to justify the constant need for further investment?

The strategy behind development of Foundational Models appears to match the one that Microsoft deployed in the early PC era – build software with the broadest application and domination will come. But whereas users could (and did) endlessly customise Excel spreadsheets and Word to their needs, will the same be true for robotic applications? Will the infinite variety of deployment specific use cases lead to the failure of generalist approach? Or will it lead to some sort of hybrid, with Foundational Models playing part in pre-training and specialist Small Behavioural Models used for finetuning and deployment?

Top image created with an image Designed by Freepik