Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (2024)

Andrei Atanov¹¹footnotemark: 1¹ Jiawei Fu¹¹footnotemark: 1¹ Rishubh Singh¹¹footnotemark: 1¹ Isabella Yu^1,2 Andrew Spielberg³ Amir Zamir¹
¹Swiss Federal Institute of Technology Lausanne (EPFL)
²Massachusetts Institute of Technology ³Harvard University
https://visual-morphology.epfl.ch/

Abstract

A de facto standard in solving computer vision problems is to use a common high-resolution camera and choose its placement on an agent(i.e., position and orientation) based on human intuition. On the other hand, extremely simple and well-designed visual sensors found throughoutnature allow many organisms to perform diverse, complex behaviors. In this work, motivated by these examples, we raise the following questions:

1.
How effective simple visual sensors are in solving vision tasks?
2.
What role does their design play in their effectiveness?

We explore simple sensors with resolutions as low as one-by-one pixel, representing a single photoreceptor. First, we demonstrate that just afew photoreceptors can be enough to solve many tasks, such as visual navigation and continuous control, reasonably well, with performance comparableto that of a high-resolution camera. Second, we show that the design of these simple visual sensors plays a crucial role in their ability to provideuseful information and successfully solve these tasks. To find a well-performing design, we present a computational design optimization algorithm andevaluate its effectiveness across different tasks and domains, showing promising results. Finally, we perform a human survey to evaluate theeffectiveness of intuitive designs devised manually by humans, showing that the computationally found design is among the best designs in most cases.

1 Introduction

Visual sensors provide necessary information about the surrounding world to enable visual perception and problem-solving.A wide variety of visual sensors are found throughout nature [27, 3, 9], ranging from complex, lens-based eyes that perceive fine-grained signals to extremely simple ones, consisting of only a few photoreceptors that simply capture unfocused light from many directions to create a low-dimensional signal.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (1)

In addition to the sensor’s type, its position and orientation are also important to its effectiveness.Strategic placement of even the simplest sensors can enable complex behaviors such as obstacle avoidance, detection of coarse landmarks, and even some forms of predator avoidance[27].For example, in dragonflies, an upward-facing visual acute zone, i.e., an area with high photoreceptor density, is hypothesized to allow more efficient prey detection by positioning it against the sky instead of a cluttered foliage background.Similarly, some species of surface-feeding fish have eyes with horizontal acute zones that allow them to see prey both above and below the water even while entirely submerged by taking advantage of the refractive index of water through their positioning.This wide variety of designs is believed to emerge as evolutionary adaptations to an animal’s specific morphology and the ecology in which it lives[27].

In computer vision, on the other hand, the design of visual sensors is mostly represented by one side of the spectrum, namely, complex camera sensors.Moreover, most effort is spent on algorithmic improvements, leaving sensor design to human intuition.

This paper explores the other side of the spectrum and employs extremely simple visual sensors.In particular, we choose visual sensors with a resolution as low as one pixel, representing a single photoreceptor.One can intuitively think of this as a camera with a 1 $\times$ 1 resolution.We demonstrate that just a few well-designed, i.e., strategically placed and oriented, photoreceptors provide sufficient information to solve many active vision tasks such as visual navigation, continuous control, and locomotion with a performance much higher than that of a blind agent and close a complex camera sensor (seeFig.1).

Similar to findings in nature, when using simple photoreceptors, we find that designing the sensors’ placement, orientation, and field of view (FoV) is essential in achieving optimal performance, and an uninformed (random) design can result in a performance close to that of a blind agent without access to any visual signal.To find well-performing designs, we present a computational design optimization method that optimizes sensors’ design for a given agent, environment, and task at hand.We demonstrate promising results of its effectiveness in improving initial random designs in a variety of domains and allowing us to achieve performance similar to that of the camera sensor.Finally, to estimate whether humans can devise performant designs, we conduct a human survey to collect human intuitive designs and find that the computational design is among the best designs in most cases.

2 Related Work

Camera Design Optimization.This line of work aims at optimizing camera parameters such as lens configuration[53, 47, 20, 2, 5], camera placement[40, 24, 19] and other[68, 37, 60, 59] to improve downstream performance.Most of these approaches consider static downstream tasks such as image restoration or depth estimation with a differentiable loss function, which, combined with a differentiable renderer[63, 59, 47], enable using gradient optimization methods for design optimization.In this work, we focus on active vision tasks, where the downstream performance is defined by a non-differentiable reward function.Most similar, [6] learns an active camera that rotates during the episode but has a design space limited to turning along a single axis.In contrast to these works, we co-learn both the active vision task and the design of extremely simple photoreceptor visual sensors.

Alternative Visual Sensors.In addition to RGB camera sensors, prior work in robotics and visual sensing has also made use of time-of-flight sensors[28], LIDARs and event cameras. These sensors usually produce high-resolution images and have been used in robust 3D mapping and navigation[34, 39, 14, 30] and obstacle avoidance at high speeds[13, 70]. In contrast to these sensors, the simple photoreceptors we explore only provide sparse signals and are much smaller in size compared to other sensors.

Computational Design of Robot Morphologies. Since the idea of computationally designing a robot body for a given task is reminiscent of the evolution of organisms, it is not surprising that evolutionary algorithms were prominent early candidates for design, beginning with co-design of form, actuators, and/or controllers [46, 18, 8, 7].Such methods were even used to computationally design robots built from organic matter [25], including those with the life-like ability to (physically) reproduce [26]. More efficient co-optimization algorithms emerged [62, 49, 15, 43, 35, 51, 33] leveraging differentiable simulation [52];Yet despite their efficiency, these direct optimization methods converge to a single local minimum and are not robust against a wide variety of conditions. Learning-based approaches have been used to co-design over learned controllers and geometric forms [42, 67] as well as wholesale shape and topology [74, 69, 72, 64]. Learning-based approaches for sensing have been sparser, but have natural value in designing agents that are robust against a wide range of environmental stimuli. Sampling-based methods have been used in the design of static infrastructure [41] and soft robots [50], but to date, the role of vision-based sensing remains mostly unexplored in robot design.

ML-Based Discovery.Recently, in many fields, machine learning-based search methods proved useful for discovering new optimal designs, e.g., novel drugs[21, 38], catalysts[75, 76] or [55].These methods usually rely on and benefit from large amounts of data to train a model of the underlying process and predict the desired properties of novel designs.In contrast, in our case of designing visual sensors, there is no such dataset readily available.We, therefore, rely on exploring the design space using simulation to provide us with the performance of different designs.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (2)

Zero-Order OptimizationZero-order optimization methods aim to optimize an unknown function that can only be evaluated at proposed points and has no other available information, such as gradients[36, 48, 17, 16, 66].Our design optimization problem for visual sensors, similar to other design optimization problems, falls into this category as no gradients of performance w.r.t. the design are available.Most similar to our work is [73], which also uses a joint design-control optimization method, applying it to the robot’s morphology design.In contrast, we apply it to the problem of designing visual sensors and consider more challenging tasks of visual navigation using scans of real-world buildings.

3 The Photoreceptor Sensor: Computational Model and Design Space

This section defines a photoreceptor sensor(PR), an example of a simple visual sensor explored in this work.First, we present the employed computational model of a PR.Then, we describe the considered design space of these visual sensors.Finally, since no prior work has explored the usage and design of these sensors, we describe three types of design types explored in this work: random, computational via design optimization, and intuitive.

3.1 Computational Model of a Photoreceptor

We define a photoreceptor (PR) as a visual sensor located at a specific point in space that integrates all incoming light from a specific field of view.Unlike complex, high-resolution lens-based sensors, a PR does not produce a high-resolution image but only provides a low-dimensional signal of the average light intensity (for each color channel).An analogy from nature is a single photoreceptor placed into a pigment tube, allowing light only from a given direction and field of view.In practice, such a sensor can be realized with a photodiode[12] by restricting its field of view using a casing that prevents it from receiving the light from the entire scene.The lensless compound eye realized by Kogos et al.[23] is analogous to a grid of PR sensors introduced below.Note that our framework, including the design optimization method in Sec.5, is agnostic to this particular choice of a simple visual sensor and can work with any other sensor that can be implemented in a simulator (e.g., see an optimized camera design in Tab.1).

Computationally, we implement this definition of the PR sensor as an averaging of the pixel values of a pinhole camera image.This approach allows us to model PRs in any simulator that provides a rendering engine without additional implementation costs.To read the signal $x$ of a PR with given extrinsic (position and orientation) and intrinsic (the field of view) parameters, we spawn a camera with the same parameters and render its corresponding image view $I\in\mathbb{R}^{3\times H\times W}$ .Then, we average it’s signal along the spatial dimension to get the final value $x^{c}=\frac{1}{HW}\sum_{i,j}I^{c}_{p},\,x\in\mathbb{R}^{3}$ , where $c\in\{1,2,3\}$ each stands for a channel and $p$ for the spatial pixel coordinate.In addition to a single PR, we also consider simple sensors using a grid of PRs of size $B\times B$ for low $B$ ( $\leq 8$ , in our experiments) that share the same position but have different adjacent fields of view.We implement such a grid by splitting the image into $B^{2}$ patches and averaging each of them spatially.See Fig.2-Center for the visualization.A single grid sensor enables the extraction of useful information (e.g., direction of motion) and makes a useful building block for a simple visual system.Moreover, using such a grid instead of $B^{2}$ independent photoreceptors results in a significant computational improvement when training in simulation, as it requires rendering only a single image instead of $B^{2}$ images.

3.2 Design of Visual Sensors

Design Space.We associate each sensor with its 7-dimensional design parameter vector $\theta_{i}=[\mathbf{x}_{i},\mathbf{y}_{i},\mathbf{z}_{i},\mathrm{yaw}_{i},%\mathrm{pitch}_{i},\mathrm{roll}_{i},\mathrm{fov}_{i}]$ , where $(\mathbf{x}_{i},\mathbf{y}_{i},\mathbf{z}_{i})\in\mathbb{R}^{3}$ is the position in space, which we constrain to be on the agent’s body, $(\mathrm{yaw}_{i},\mathrm{pitch}_{i},\mathrm{roll}_{i})\in[0,2\pi]^{2}$ is the orientation, and $\mathrm{fov}_{i}\in[0,180]$ is the field of view.See Fig.2-Left for the visualization.In our experiments, we use $K\in\{2,4,8\}$ sensors and explore sensors represented by a single PR (a $1\times 1$ grid) and a grid of PRs of sizes $4\times 4$ and $8\times 8$ .This results in a total of $KB^{2}$ PRs (ranging from 2 to 256 in our experiments) with the visual observation represented as $\{x_{kj}\}_{k,j=1,1}^{K,B^{2}}$ .We also explore different designs for a camera sensor, in which case we have a single camera sensor and change its vector of parameters $\theta$ .

Design Types.Choosing a well-performing visual sensor design plays a crucial role in the performance of the final system (e.g., a navigation agent, see Fig.7).In this work, we explore the following three approaches to instantiating the design of a visual sensor:

Random design corresponds to sampling $\theta$ randomly from the design space.It sets a baseline for a computational design method, which should result in more performant designs (in cases where they exist in the design space for a given task.)
Computational design tailors the sensor parameters for a specific vision task, agent’s morphology, and environment, optimizing the corresponding performance of the agent.We introduce the employed computational design optimization method in Sec.5.
Intuitive design corresponds to a design devised intuitively by a human.Since there is no obvious choice for this design, we perform a human survey, asking participants to devise a design that would lead to the best performance on a given task, agent and environment.We describe the design of the survey in AppendixF and discuss our findings on the effectiveness of human intuition in comparison to computational design in Sec.5.4.

4 Simple Photoreceptors are Effective Visual Sensors

In this section, we demonstrate that simple photoreceptors can be effective visual sensors for solving different active vision tasks.Specifically, we show that in most cases, an agent equipped with (well-designed) PR sensors significantly outperforms a blind agent without access to any visual signal and achieves performance close to that of an agent with a high-resolution camera sensor.

4.1 Experimental Setting

We perform our experiments with the following active vision tasks.First, we consider two visual navigation tasks using the Habitat[54] simulator with 3D scans of real apartments from the Matterport3D[4] dataset.Our second set of tasks are continuous control tasks from the DeepMind Control (DMC) Suite[57], which we attempt to solve solely from the vision signal.Below, we provide a brief description of the experimental setting.For more detailed information, please refer toSupplementary Material.

Reinforcement learning background.We consider solving the active visual tasks as the decision-making processes using reinforcement learning in partially observable Markov decision processes (POMDP).At a state $s_{t}$ , the agent receives an observation $o_{t}$ which cannot precisely determine the underlying state $s_{t}$ .Then, the agent applies an action $a_{t}$ , transits to the next state $s_{t+1}$ , and receives a reward $r_{t}$ .Let $\tau$ be the trajectory rollout provided to the agent by iterating the steps above, i.e., $\tau_{t}=(o_{t},a_{t},r_{t},o_{t+1},\cdots)$ .Assume the agent computes the action $a_{t}$ with a control policy $\pi$ , i.e., $a_{t}\sim\pi(\cdot|o_{t})$ .We find the optimal control policy $\pi^{\star}$ by optimizing the expected return $\mathbb{E}_{\tau_{t}\sim\pi}[R(\tau_{t})]$ , where the return is defined as $R(\tau_{t})=\Sigma_{i=0}\gamma^{i}r_{t+i}$ and $\gamma$ is the discount factor. In our experiments, we use Proximal Policy Optimization (PPO)[44] to optimize the control policy $\pi$ .

Navigation in Habitat. We train navigation agents for PointGoalNav and TargetNav tasks in 3D replicas of real houses from the Matterport3D dataset[4] using the Habitat simulator [54].In PointGoalNav, the agent is randomly spawned in an environment and needs to navigate to a target coordinate.The agent observes an egocentric RGB view and its current position (coordinate) and orientation through the GPS+Compass sensor. We measure the performance of an agent using ‘SPL’ (Success weighted by Path Length)[1], which quantifies the performance relative to the optimal trajectory.In TargetNav, the agent is also equipped with the egocentric RGB sensor and the GPS+Compass sensor.In contrast to PointGoalNav, the agent does not receive a target coordinate but is asked to navigate to a green sphere that is randomly placed in the house at a height of 1.5m from the floor. This task, therefore, requires exploration and target identification by design. We therefore measure the performance of an agent using the success rate; that is, whether or not it finds the target sphere.

Continuous Control in DMC. We train continuous control agents using the MuJoCo simulator[58] on the DeepMind Control (DMC) benchmark[57].DMC provides a variety of continuous control tasks, including reaching, manipulation, locomotion, etc.In our context, we focus on learning the control policy that receives only visual information from either photoreceptors or a camera.We consider using the following six tasks: Reacher:Hard, Walker:Stand, Walker:Walk, Walker:Run, Finger:Spin, and Finger:Turn Easy (see Sec.E.3 and the original DMC video¹¹1https://www.youtube.com/watch?v=rAai4QzcYbs for a more detailed description of tasks.)

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (3)

Architecture.Fig.2-right illustrates our control policy architecture.We model the policy $\pi_{w}(a_{t+1}\,|\,o_{t})$ using a simple three-layer Transformer[61] backbone that encodes the current observation $o_{t}$ from a visual sensor, camera or PRs, and GPS+Compass for navigation, and a small MLP that predicts the next action $a_{t+1}$ .For the PR-based policy, we use $\{[p_{kj},\theta_{k},e_{j}]\}_{kj}$ as input tokens, where $\theta_{k}$ is the design vector of the $k^{\text{th}}$ grid sensor, and $e_{j}$ is the trainable positional embedding of the $j^{\text{th}}$ PR in the grid.For the camera-based policy, similar to ViT [11], we split the input image into 16x16 patches $\{x_{ij}\}$ , flatten them, and add positional embeddings and the design vector $\theta$ to construct the final input tokens for the encoder: $\{[\overline{x}_{ij},\theta,e_{ij}]\}_{ij}$ .Since we use a single camera, one can omit the design embedding $\theta$ , but we keep it for consistency and as we use it in the design optimization method described in Sec.5.2.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (4)

Control baselines.To estimate the effectiveness of the PR sensor, we use the following two baselines.

The Intelligent blind agent does not receive any visual signal and shows what performance can be achieved just by utilizing the structure of the problem and environment.It does not receive any input in DMC and receives only GPS+Compass in both navigation tasks.
The Camera agent receives a high-resolution image signal from a camera sensor.This baseline provides a comparison to a de facto standard for solving (active) vision tasks.For the navigation tasks, we use the resolution of 128 $\times$ 128 and the found computational design for the camera sensor as we found it to perform better than the default intuitive design from the Habitat simulator in both tasks (seeTab.1 and Fig.9).For DMC tasks, we choose the best performance between the default 3rd-person view camera with the resolution of 84 $\times$ 84 (standard choice in the literature[71, 29]) and an egocentric camera with an intuitive design (e.g., forward-looking camera on top of the torso for the walker agent), which we also find to perform better in some cases.We also use a convolutional architecture similar to[56] for a fair comparison, as we find it to perform better.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (5)

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (6)

4.2 Photoreceptors Achieve Performance Close to a Camera

Visual Navigation.Fig.3 shows that for both PointGoalNav and TargetNav tasks, even a simple visual sensor consisting of two 4 $\times$ 4 PR grids (32 PRs) provides useful information, allowing the corresponding agent to outperform significantly an intelligent blind agent without a visual signal.The PR agents match the performance of the camera agent using the same shallow 3-layer Transformer encoder as for PRs, while having a visual signal bandwidth of only $\approx 1\%$ of that of the 128 $\times$ 128 camera sensor.When compared to the camera agent using the ResNet-50 encoder, a default choice inthe literature (“gold standard”) for a fair comparison, PR agents still perform reasonably well. Fig.4 shows visualizations of the best-performing photoreceptor designs for both PointGoalNav and TargetNav tasks.

5 Visual Sensors Design Optimization

Photoreceptorscan be effective visual sensors, as we show in the previous section.However, how does one design such a visual sensor?Where should one place each PR, and in which direction should they point to provide the most useful information for a given task, environment, and agent’s morphology?In this section, we first show that design choice is essential to achieving good performance.We then introduce a computational design optimization method that optimizes the design for a given agent, task, and environment and shows promising results in improving initial designs across multiple tasks.Finally, we perform a human survey to provide a baseline for an intuitive design, finding that the computational design is among the best designs.

5.1 Design is Important for the Effectiveness of Photoreceptors

How does the design of photoreceptorsinfluence the final performance of a control policy?Fig.7 shows that the performance of a poor design can drop drastically compared to the best design for the corresponding task.We find that some designs result in performance similar to that of a blind agent, suggesting that the visual signal does not provide any useful information.These results signify the importance of a design optimization algorithm to find good-performing designs automatically.

5.2 Computational Design via Joint Optimization

The design $\theta$ of the visual sensor(s), either PRs or camera, defines what observation the agent receives at each step, i.e., $o_{t}\triangleq o_{t}(\theta)$ and what design-specific control policy $\pi_{w}\triangleq\pi_{w}^{\theta}$ with what performance will be learned.To find the best design $\theta^{*}$ , one would need to find the design that leads to training the best-performing design-specific control policy, resulting in a bi-level optimization problem $\max_{\theta}\,\max_{w}\mathbb{E}_{\tau\sim\pi_{w}}R(\tau)$ .However, training the design-specific control policy $\pi_{w}^{\theta}$ in an inner loop for every new design would make this process prohibitively expensive.

Similar to [73], we, instead, cast this problem as joint optimization and amortize the costs of training multiple design-specific policies by training a single “generalist” policy that implements control for different designs.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (8)

We achieve this by conditioning the policy on the current design vector: $\pi_{w}(a_{t+1}\,|\,o_{t},\theta)$ .In practice, we use the same design-specific architecture described in Sec.4.1, as it already receives the corresponding design vector $\theta$ as part of the input.To optimize the design, we define a design policy $\pi_{\phi}(\theta)$ and optimize its parameters jointly with the parameters of the control policy:

\phi^{*},w^{*}=\arg\max_{\phi,w}\,\mathbb{E}_{\tilde{\theta}\sim\pi_{\phi}(%\theta)}\mathbb{E}_{\tau\sim\pi_{w}(a_{t}\,|\,o_{t},\tilde{\theta})}\,R(\tau).

(1)

In practice, this implies extending the original decision process with an additional design action step $a_{0}\triangleq\theta$ at the beginning of each episode.We set the corresponding observation and reward to zero, $o_{0}=\mathbf{0},\,r_{0}=0$ .One, however, can use a design-dependent reward $r_{0}$ to favor specific designs, e.g., low-cost ones.We, then, use the same PPO algorithm and update the control policy using control actions $a_{1:T}$ and the design policy using the design action $a_{0}$ from each rollout $\tau$ .See Fig.8 for the visualization, and refer to Sec.E.4 for further implementation details.

Design Policy.We model the design policy as a Gaussian distribution over the design parameters $\pi_{\phi}(\theta)=\mathcal{N}(\theta\,|\,\mu,\mathrm{diag}(\sigma))$ , where $\phi=(\mu,\sigma),\,\mu,\sigma\in\mathbb{R}^{K\times 7}$ .After training, we use $\theta^{*}=\mu$ as the final optimal design.Note that while the distribution models each sensor independently, the final design of each sensor is informed of each other by virtue of being optimized together.

Generalist Control Policy.Training the generalist policy allows, in principle, to amortize the costs of training design-specific policies by “knowledge” and parameter sharing.Modeling and training such a policy that implements control for all possible designs can still have high memory and computing costs.Note, however, that Eq.1 only needs the policy that implements control for (likely) samples $\tilde{\theta}$ from the current design policy to provide it with a local direction for the improvement.Thus, we only need to train a local generalist policy and control the locality by the variance of the design policy, which in the limit of low variance allows approximating such a control policy with a linear dependency on design[32, 31].In practice, we initialize the variance $\sigma$ to allow training the local generalist policy that performs similarly to design-specific policies and, thus, provides a good signal for a design update.

5.3 Design Optimization Experiments

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (9)

In this section, we demonstrate the effectiveness of the proposed design optimization method.We show that it can improve the performance of the initial design for both photoreceptors and camera sensors.

Design optimization improves the performance upon the initial design guess.Fig.9 shows that in the majority of cases, the design optimization method improves the initial random design.For DMC, we run the design optimization from two random initializations for each setting (task and $K$ ) and find that while it improves their performance, the best-performing computational design depends on the initialization (e.g., reaching the reward of 375 and 562 for two initializations for the Walker:Walk tasks and $K=2$ ).This suggests that the optimization landscape might contain multiple local optima, and improving the exploration abilities of the design optimization method is an important research direction for finding the best-performing designs.

Camera Design

PointGoalNav

(SPL)

TargetNav

(Success Rate)

Intuitive

0.447

0.363

Computational

0.518

0.405

Blind Agent

0.445

0.119

Design optimization improves the “default” intuitive camera design.We also apply the design optimization to explore if we can improve the intuitive camera design used by default in the Habitat AI[54] simulator.Tab.1 shows that the agent using a computationally designed camera outperforms the one using the default intuitive design in both navigation tasks.

5.4 Intuitive Designs

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (10)

In this section, we explore the effectiveness of human intuition in engineering well-performing designs of simple photoreceptor sensors.Since there is no single obvious way to design such visual sensors, we conducted a human survey to collect intuitive designs.We ask participants to design the design parameters of visual sensors in our defined design space.We ask participants to find the photoreceptors design parameters for a given morphology and task.We collect eight designs for the TargetNav visual navigation task and six designs for the Walker agent in DMC and evaluate them on Walk and Stand tasks.We provide a more detailed description of the survey setting in AppendixF.

Fig.10 shows that the best human intuition can provide well-performing designs, and computational design is among the best designs (or the best one) in most cases.We also find a high variance in the performance of different intuitive designs in all settings, signifying the importance of a computational approach to visual design.

5.5 Do designs transfer between tasks?

Optimizing a design for a given agent and task and deploying it can be a time-consuming process.One would wish to have a visual sensor design that can be optimized and deployed once and recycled for different downstream applications of the same robot without needing to repeat the process.We compare the performance of different designs we collect in this work (random, intuitive, and optimized) on two pairs of tasks for the same agent morphology.Fig.11 shows that a general trend suggests that one can optimize the design for one task and recycle it for another one.However, there are some designs that can underperform when transferred, especially for the Walker agent.This means that to find a transferable design during design optimization on one task, some form of regularization needs to be included in addition to the performance only to avoid such cases.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (11)

5.6 Evaluation in the Real World

To evaluate generalization and ensure that the strong performance of photoreceptors is not confined to simulators, we conducted the target navigation experiment (without access to GPS+Compass sensor) in a real-world setting.

We deployed a control policy using 64 PRs (less than 1% of the camera resolution) on a real robot as shown in Fig.12. It demonstrates impressive performance, successfully navigating to the target ball in an unknown room with no real-world training, relying solely on the low-resolution visual signal. The results can be seen at https://visual-morphology.epfl.ch/#real-world.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (12)

6 Discussion and Limitations

In this work, we aim to demonstrate that even extremely simple visual sensors like photoreceptors can be effective in solving vision tasks that require an understanding of the surrounding world and the self (proprioception).This shows that, similar to numerous examples in nature, a system with certain simplicities can exhibit intelligent and complex behaviors.This also suggests an avenue for an interesting research direction in addition to the trends focused on training larger models on ever-increasing amounts of data and complex sensors.

We demonstrate that design optimization of simple visual sensors is important to achieve performance similar to that of a more complex camera sensor.We, therefore, approach this problem computationally and suggest a design optimization method that is able to improve the initial design and find well-performing designs.Below, we discuss some limitations of our work.

Scope of the Scenarios, Vision Tasks, and Agents.We instantiated a first attempt in this area and focused on an active vision tasks primarily around locomotion and typical robotic agents.Exploring other visual tasks would be useful to better understand the limits and applicability of simple visual sensors. Similarly, the most useful scenario for the narrative we provided may not necessarily be typical tasks and typical robots in typical environments, but rather less usual ones, e.g., a perceptual micro-robot that gets injected in the body to perform a medical task.

Additional Constraints and Regularized Design Optimization.In this work, we primarily focused on the performance of the PR sensor to demonstrate its effectiveness.However, other aspects, such as the number of sensors, production costs, power consumption, or physical size constraints, matter and are important to be included in the optimization objective. For instance, the “Square–cube law” shows disregarding weight distribution constraints would falsely suggest that the body of an animal can grow in size to a level that is practically impossible.

Design Space Parametrization for Complex Robot Morphology.The robot morphologies considered in this work primarily consist of primitive shapes such as boxes and cylinders, which makes it relatively simple to parametrize the design space to be constrained to the robot’s body.However, many real-world robots, e.g., soft robots, might have more complex shapes.Developing a general way to parameterize more complex robots’ surfaces is a direction to make design optimization methods easily applicable.

Local Design Optimization. Our design optimization method is able to improve upon initialization and find a well-performing design.However, starting from different initializations may be important, as it is a local optimization method that can be stuck at local optima.Incorporating methodologies from global search methods may be useful to achieve better overall performance.

References

Anderson etal. [2018]Peter Anderson, Angel Chang, DevendraSingh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and AmirR. Zamir.On Evaluation of Embodied Navigation Agents, 2018.arXiv:1807.06757 [cs].
Baek etal. [2021]Seung-Hwan Baek, Hayato Ikoma, DanielS. Jeon, Yuqi Li, Wolfgang Heidrich, Gordon Wetzstein, and MinH. Kim.Single-shot Hyperspectral-Depth Imaging with Learned Diffractive Optics.In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2631–2640, Montreal, QC, Canada, 2021. IEEE.
Banks etal. [2015]MartinS. Banks, WilliamW. Sprague, Jürgen Schmoll, Jared A.Q. Parnell, and GordonD. Love.Why do animal eyes have pupils of different shapes?Science Advances, 1(7):e1500391, 2015._eprint: https://www.science.org/doi/pdf/10.1126/sciadv.1500391.
Chang etal. [2017]Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang.Matterport3D: Learning from RGB-D Data in Indoor Environments, 2017.arXiv:1709.06158 [cs].
Chang and Wetzstein [2019]Julie Chang and Gordon Wetzstein.Deep Optics for Monocular Depth Estimation and 3D Object Detection, 2019.Issue: arXiv:1904.08601 arXiv:1904.08601 [cs, eess].
Chen etal. [2022]Peihao Chen, Dongyu Ji, Kunyang Lin, Weiwen Hu, Wenbing Huang, ThomasH. Li, Mingkui Tan, and Chuang Gan.Learning Active Camera for Multi-Object Navigation, 2022.arXiv:2210.07505 [cs].
Cheney etal. [2014a]Nicholas Cheney, Jeff Clune, and Hod Lipson.Evolved electrophysiological soft robots.In Artificial life conference proceedings, pages 222–229. MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info…, 2014a.
Cheney etal. [2014b]Nick Cheney, Robert MacCurdy, Jeff Clune, and Hod Lipson.Unshackling evolution: evolving soft robots with multiple materials and a powerful generative encoding.ACM SIGEVOlution, 7(1):11–23, 2014b.
Cronin etal. [2014]ThomasW. Cronin, Sönke Johnsen, N.Justin Marshall, and EricJ. Warrant.Visual Ecology.Princeton University Press, stu - student edition edition, 2014.
Datta etal. [2021]Samyak Datta, Oleksandr Maksymets, Judy Hoffman, Stefan Lee, Dhruv Batra, and Devi Parikh.Integrating egocentric localization for more realistic point-goal navigation agents.In Conference on Robot Learning, pages 313–328. PMLR, 2021.
Dosovitskiy etal. [2021]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 2021.arXiv:2010.11929 [cs].
Emmons [1967]R.B. Emmons.Avalanche‐Photodiode Frequency Response.Journal of Applied Physics, 38(9):3705–3714, 1967.
Falanga etal. [2020]Davide Falanga, Kevin Kleber, and Davide Scaramuzza.Dynamic obstacle avoidance for quadrotors with event cameras.Science Robotics, 5(40):eaaz9712, 2020.
Francis etal. [2015]Sobers LourduXavier Francis, SreenathaG. Anavatti, Matthew Garratt, and Hyunbgo Shim.A ToF-Camera as a 3D Vision Sensor for Autonomous Mobile Robotics.International Journal of Advanced Robotic Systems, 12(11):156, 2015.
Ha etal. [2018]Sehoon Ha, Stelian Coros, Alexander Alspach, Joohyung Kim, and Katsu Yamane.Computational co-optimization of design parameters and motion trajectories for robotic systems.The International Journal of Robotics Research, 37(13-14):1521–1536, 2018.
Hansen [2006]Nikolaus Hansen.The CMA Evolution Strategy: A Comparing Review.In Towards a New Evolutionary Computation: Advances in the Estimation of Distribution Algorithms, pages 75–102. Springer, Berlin, Heidelberg, 2006.
Hansen and Ostermeier [2001]Nikolaus Hansen and Andreas Ostermeier.Completely Derandomized Self-Adaptation in Evolution Strategies.Evolutionary Computation, 9(2):159–195, 2001.
Hiller and Lipson [2011]Jonathan Hiller and Hod Lipson.Automatic design and manufacture of soft robots.IEEE Transactions on Robotics, 28(2):457–466, 2011.
Hou etal. [2023]Yunzhong Hou, Xingjian Leng, Tom Gedeon, and Liang Zheng.Optimizing Camera Configurations for Multi-View Pedestrian Detection, 2023.Issue: arXiv:2312.02144 arXiv:2312.02144 [cs].
Ikoma etal. [2021]Hayato Ikoma, CindyM. Nguyen, ChristopherA. Metzler, Yifan Peng, and Gordon Wetzstein.Depth from Defocus with Learned Optics for Imaging and Occlusion-aware Depth Estimation.In 2021 IEEE International Conference on Computational Photography (ICCP), pages 1–12, Haifa, Israel, 2021. IEEE.
Jumper etal. [2021]John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A.A. Kohl, AndrewJ. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, AndrewW. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis.Highly accurate protein structure prediction with AlphaFold.Nature, 596(7873):583–589, 2021.Publisher: Nature Publishing Group.
Kingma and Ba [2017]DiederikP. Kingma and Jimmy Ba.Adam: A Method for Stochastic Optimization.Technical Report arXiv:1412.6980, arXiv, 2017.
Kogos etal. [2020]LeonardC. Kogos, Yunzhe Li, Jianing Liu, Yuyu Li, Lei Tian, and Roberto Paiella.Plasmonic ommatidia for lensless compound-eye vision.Nature Communications, 2020.
Krause etal. [2008]Andreas Krause, Jure Leskovec, Carlos Guestrin, Jeanne VanBriesen, and Christos Faloutsos.Efficient sensor placement optimization for securing large water distribution networks.Journal of Water Resources Planning and Management, 134(6):516–526, 2008.Publisher: American Society of Civil Engineers.
Kriegman etal. [2020]Sam Kriegman, Douglas Blackiston, Michael Levin, and Josh Bongard.A scalable pipeline for designing reconfigurable organisms.Proceedings of the National Academy of Sciences, 117(4):1853–1859, 2020.
Kriegman etal. [2021]Sam Kriegman, Douglas Blackiston, Michael Levin, and Josh Bongard.Kinematic self-replication in reconfigurable organisms.Proceedings of the National Academy of Sciences, 118(49):e2112672118, 2021.
Land etal. [2012]MichaelF. Land, Dan-Eric Nilsson, MichaelF. Land, and Dan-Eric Nilsson.Animal Eyes.Oxford University Press, Oxford, New York, second edition, second edition edition, 2012.
Lange and Seitz [2000]Robert Lange and Peter Seitz.Seeing distances – a fast time‐of‐flight 3D camera.Sensor Review, 20(3):212–217, 2000.
Laskin etal. [2020]Michael Laskin, Kimin Lee, Adam Stooke, Lerrel Pinto, Pieter Abbeel, and Aravind Srinivas.Reinforcement Learning with Augmented Data, 2020.Issue: arXiv:2004.14990 arXiv:2004.14990 [cs, stat].
Liu etal. [2021]Zhijian Liu, Alexander Amini, Sibo Zhu, Sertac Karaman, Song Han, and DanielaL. Rus.Efficient and Robust LiDAR-Based End-to-End Navigation.In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13247–13254, 2021.ISSN: 2577-087X.
Lorraine and Duvenaud [2018]Jonathan Lorraine and David Duvenaud.Stochastic Hyperparameter Optimization through Hypernetworks.Technical Report arXiv:1802.09419, arXiv, 2018.
MacKay etal. [2019]Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse.Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions.2019.Publisher: [object Object] Version Number: 1.
Matthews etal. [2023]David Matthews, Andrew Spielberg, Daniela Rus, Sam Kriegman, and Josh Bongard.Efficient automatic design of robots.Proceedings of the National Academy of Sciences, 120(41):e2305180120, 2023.
May etal. [2009]Stefan May, David Droeschel, Dirk Holz, Stefan Fuchs, Ezio Malis, Andreas Nüchter, and Joachim Hertzberg.Three-dimensional mapping with time-of-flight cameras.Journal of Field Robotics, 26(11-12):934–965, 2009.
Megaro etal. [2017]Vittorio Megaro, Espen Knoop, Andrew Spielberg, DavidIW Levin, Wojciech Matusik, Markus Gross, Bernhard Thomaszewski, and Moritz Bächer.Designing cable-driven actuation networks for kinematic chains and trees.In Proceedings of the ACM SIGGRAPH/Eurographics symposium on computer animation, pages 1–10, 2017.
Močkus [1975]Jonas Močkus.On Bayesian methods for seeking the extremum.In Optimization Techniques IFIP Technical Conference: Novosibirsk, July 1–7, 1974, pages 400–404. Springer, 1975.
Olague and Mohr [2002]Gustavo Olague and Roger Mohr.Optimal camera placement for accurate reconstruction.Pattern recognition, 35(4):927–944, 2002.Publisher: Elsevier.
Popova etal. [2018]Mariya Popova, Olexandr Isayev, and Alexander Tropsha.Deep reinforcement learning for de novo drug design.Science Advances, 4(7):eaap7885, 2018.Number: 7.
Prusak etal. [2008]A. Prusak, O. Melnychuk, H. Roth, I. Schiller, and R.Koch .Pose estimation and map building with a Time-Of-Flight-camera for robot navigation.International Journal of Intelligent Systems Technologies and Applications, 5(3-4):355–364, 2008.
Sanket etal. [2020]NitinJ. Sanket, ChahatDeep Singh, Varun Asthana, Cornelia Fermüller, and Yiannis Aloimonos.MorphEyes: Variable Baseline Stereo For Quadrotor Navigation, 2020.Issue: arXiv:2011.03077 arXiv:2011.03077 [cs].
Schaff etal. [2017]Charles Schaff, David Yunis, Ayan Chakrabarti, and MatthewR Walter.Jointly optimizing placement and inference for beacon-based localization.In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6609–6616. IEEE, 2017.
Schaff etal. [2019]Charles Schaff, David Yunis, Ayan Chakrabarti, and MatthewR Walter.Jointly learning to construct and control agents using deep reinforcement learning.In 2019 International Conference on Robotics and Automation (ICRA), pages 9798–9805. IEEE, 2019.
Schaff etal. [2022]Charles Schaff, Audrey Sedal, and MatthewR Walter.Soft robots learn to crawl: Jointly optimizing design and control with sim-to-real transfer.arXiv preprint arXiv:2202.04575, 2022.
Schulman etal. [2017]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.Proximal Policy Optimization Algorithms.arXiv, pages 1–12, 2017.arXiv: 1707.06347.
Schulman etal. [2018]John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel.High-Dimensional Continuous Control Using Generalized Advantage Estimation, 2018.arXiv:1506.02438 [cs].
Sims [1994]Karl Sims.Evolving 3d morphology and behavior by competition.Artificial life, 1(4):353–372, 1994.
Sitzmann etal. [2018]Vincent Sitzmann, Steven Diamond, Yifan Peng, Xiong Dun, Stephen Boyd, Wolfgang Heidrich, Felix Heide, and Gordon Wetzstein.End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging.ACM Transactions on Graphics, 37(4):114:1–114:13, 2018.Number: 4.
Snoek etal. [2012]Jasper Snoek, Hugo Larochelle, and RyanP Adams.Practical bayesian optimization of machine learning algorithms.Advances in neural information processing systems, 25, 2012.
Spielberg etal. [2017]Andrew Spielberg, Brandon Araki, Cynthia Sung, Russ Tedrake, and Daniela Rus.Functional co-optimization of articulated robots.In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 5035–5042. IEEE, 2017.
Spielberg etal. [2021]Andrew Spielberg, Alexander Amini, Lillian Chin, Wojciech Matusik, and Daniela Rus.Co-learning of task and sensor placement for soft robotics.IEEE Robotics and Automation Letters, 6(2):1208–1215, 2021.
Spielberg etal. [2023a]Andrew Spielberg, Tao Du, Yuanming Hu, Daniela Rus, and Wojciech Matusik.Advanced soft robot modeling in chainqueen.Robotica, 41(1):74–104, 2023a.
Spielberg etal. [2023b]Andrew Spielberg, Fangcheng Zhong, Konstantinos Rematas, KrishnaMurthy Jatavallabhula, Cengiz Oztireli, Tzu-Mao Li, and Derek Nowrouzezahrai.Differentiable visual computing for inverse problems and machine learning.Nature Machine Intelligence, 5(11):1189–1199, 2023b.
Sun etal. [2021]Qilin Sun, Congli Wang, Fu Qiang, Dun Xiong, and Heidrich Wolfgang.End-to-end complex lens design with differentiable ray tracing.ACM Trans. Graph, 40(4):1–13, 2021.
Szot etal. [2021]Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra.Habitat 2.0: Training Home Assistants to Rearrange their Habitat.In Advances in Neural Information Processing Systems (NeurIPS), 2021.
Szymanski etal. [2023]NathanJ. Szymanski, Bernardus Rendy, Yuxing Fei, RishiE. Kumar, Tanjin He, David Milsted, MatthewJ. McDermott, Max Gallant, EkinDogus Cubuk, Amil Merchant, Haegyeom Kim, Anubhav Jain, ChristopherJ. Bartel, Kristin Persson, Yan Zeng, and Gerbrand Ceder.An autonomous laboratory for the accelerated synthesis of novel materials.Nature, 624(7990):86–91, 2023.Publisher: Nature Publishing Group.
Tassa etal. [2018]Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego deLas Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller.DeepMind Control Suite, 2018.arXiv:1801.00690 [cs].
Tassa etal. [2020]Yuval Tassa, Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Piotr Trochim, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, and Nicolas Heess.dm_control: Software and Tasks for Continuous Control.Software Impacts, 6:100022, 2020.arXiv:2006.12983 [cs].
Todorov etal. [2012]Emanuel Todorov, Tom Erez, and Yuval Tassa.MuJoCo: A physics engine for model-based control.In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
Tseng etal. [2021]Ethan Tseng, Ali Mosleh, Fahim Mannan, Karl St-Arnaud, Avinash Sharma, Yifan Peng, Alexander Braun, Derek Nowrouzezahrai, Jean-François Lalonde, and Felix Heide.Differentiable Compound Optics and Processing Pipeline Optimization for End-to-end Camera Design.ACM Transactions on Graphics, 40(2):1–19, 2021.Number: 2.
Vargas etal. [2021]Edwin Vargas, Julien N.P. Martel, Gordon Wetzstein, and Henry Arguello.Time-Multiplexed Coded Aperture Imaging: Learned Coded Aperture and Pixel Exposures for Compressive Imaging Systems, 2021.Issue: arXiv:2104.02820 arXiv:2104.02820 [cs, eess].
Vaswani etal. [2023]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention Is All You Need, 2023.arXiv:1706.03762 [cs].
Wampler and Popović [2009]Kevin Wampler and Zoran Popović.Optimal gait and form for animal locomotion.ACM Transactions on Graphics (TOG), 28(3):1–8, 2009.
Wang etal. [2022]Congli Wang, Ni Chen, and Wolfgang Heidrich.dO: A Differentiable Engine for Deep Lens Design of Computational Imaging Systems.IEEE Transactions on Computational Imaging, 8:905–916, 2022.Conference Name: IEEE Transactions on Computational Imaging.
Wang etal. [2024]Tsun-HsuanJohnson Wang, Juntian Zheng, Pingchuan Ma, Yilun Du, Byungchul Kim, Andrew Spielberg, Josh Tenenbaum, Chuang Gan, and Daniela Rus.Diffusebot: Breeding soft robots with physics-augmented generative diffusion models.Advances in Neural Information Processing Systems, 36, 2024.
Wijmans etal. [2020]Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra.DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames, 2020.arXiv:1911.00357 [cs].
Williams [1992]RonaldJ. Williams.Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.In Reinforcement Learning, pages 5–32. Springer US, Boston, MA, 1992.
Won and Lee [2019]Jungdam Won and Jehee Lee.Learning body shape variation in physics-based characters.ACM Transactions on Graphics (TOG), 38(6):1–12, 2019.
Wu etal. [2019]Yicheng Wu, Vivek Boominathan, Huaijin Chen, Aswin Sankaranarayanan, and Ashok Veeraraghavan.PhaseCam3D — Learning Phase Masks for Passive Single View Depth Estimation.In 2019 IEEE International Conference on Computational Photography (ICCP), pages 1–12, 2019.ISSN: 2472-7636.
Xu etal. [2021]Jie Xu, Andrew Spielberg, Allan Zhao, Daniela Rus, and Wojciech Matusik.Multi-objective graph heuristic search for terrestrial robot design.In 2021 IEEE international conference on robotics and automation (ICRA), pages 9863–9869. IEEE, 2021.
Xu etal. [2023]Jingao Xu, Danyang Li, Zheng Yang, Yishujie Zhao, Hao Cao, Yunhao Liu, and Longfei Shangguan.Taming Event Cameras with Bio-Inspired Architecture and Algorithm: A Case for Drone Obstacle Avoidance.In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, number55, pages 1–16. Association for Computing Machinery, New York, NY, USA, 2023.
Yarats etal. [2021]Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto.Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning, 2021.Issue: arXiv:2107.09645 arXiv:2107.09645 [cs].
Yuan etal. [2021]Ye Yuan, Yuda Song, Zhengyi Luo, Wen Sun, and Kris Kitani.Transform2act: Learning a transform-and-control policy for efficient agent design.arXiv preprint arXiv:2110.03659, 2021.
Yuan etal. [2022]Ye Yuan, Yuda Song, Zhengyi Luo, Wen Sun, and Kris Kitani.Transform2Act: Learning a Transform-and-Control Policy for Efficient Agent Design, 2022.Issue: arXiv:2110.03659 arXiv:2110.03659 [cs].
Zhao etal. [2020]Allan Zhao, Jie Xu, Mina Konaković-Luković, Josephine Hughes, Andrew Spielberg, Daniela Rus, and Wojciech Matusik.Robogrammar: graph grammar for terrain-optimized robot design.ACM Transactions on Graphics (TOG), 39(6):1–16, 2020.
Zitnick etal. [2020]CLawrence Zitnick, Lowik Chanussot, Abhishek Das, Siddharth Goyal, Javier Heras-Domingo, Caleb Ho, Weihua Hu, Thibaut Lavril, Aini Palizhati, Morgane Riviere, and others.An introduction to electrocatalyst design using machine learning for renewable energy storage.arXiv preprint arXiv:2010.09435, 2020.
Zitnick etal. [2022]C.Lawrence Zitnick, Abhishek Das, Adeesh Kolluru, Janice Lan, Muhammed Shuaibi, Anuroop Sriram, Zachary Ulissi, and Brandon Wood.Spherical Channels for Modeling Atomic Interactions.2022.Publisher: [object Object] Version Number: 2.

Appendix Overview

The Appendix provides further discussions, details, and evaluations as outlined below:

•
In AppendixA, we study whether photoreceptor sensors allow extracting information about the state of the world and whether better-performing designs lead to a more accurate world state estimation.
•
AppendixB presents various analysis experiments: 1) we show that the photoreceptor agent can do target detection, 2) we show the effectiveness of the design optimization method through various ablation experiments, and 3) we perform an experimental evaluation of the importance of different design variables (such as height, pitch, etc.).
•
AppendixC provides additional visualizations of different designs, including random, intuitive, and computational designs and their corresponding performance.
•
AppendixD provides additional results of using grids of 4x4 photoreceptors for continuous control tasks in DMC.
•
In AppendixE, we provide a detailed description of our experimental settings. We provide details on the control policy training process and design optimization.
•
AppendixF provides details on the human study we developed to collect human intuitive designs for both navigation and continuous control tasks.

Appendix A Can photoreceptors extract information about the world state?

In Sec. 4.2 of the main paper, we demonstrated that an agent equipped with only a few photoreceptors can perform well in solving active vision-based tasks.One would expect such a PR agent to be able to extract useful information about the state of the world using its visual sensors.In this section, we explore whether photoreceptors can extract information about the state of the world and self, and whether better-performing designs extract state information more accurately.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (13)

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (14)

We consider three tasks from the DMC Suite: Finger: Spin, and Finger: Turn Easy and Walker: Walk.For each task, we collect rollouts using the best-performing policy available.At each step, we collect the default state information provided by the DMC benchmark.For example, for the Walker: Walk task, this includes the height of the body, orientations, and velocities of each body part.These state values are the default variables used as input by the state-based control algorithms and, therefore, provide a sufficient description of the world and agent states.

In addition to the state information, we collect visual sensory data for different designs (random, computational, and intuitive), achieving different reward values.Then, for each design, we regress the state values from the visual sensory data using the same backbone as for the policy network (we train it from scratch).We use 80000 timestamps for training and 20000 for testing (test time stamps come from different episodes).For each state variable, we measure the coefficient of determination $R^{2}$ on the test set and average it over all state dimensions, representing the overall quality of the state estimation from the photoreceptor sensors with the corresponding design.

Fig.13 shows the quality of the state estimation versus the reward for each design (we use 1x1 photoreceptors with $K\in\{2,4\}$ ).First, we find that the $R^{2}$ is greater than zero, which means that it is possible to extract more information about the state than the overall mean value (since $R^{2}=0$ corresponds to a mean prediction).

We also find evidence for the correlation between the quality of the state regression and the performance of the agent with the corresponding design.This suggests the quality of the state regression can be a good proxy for a design optimization method.This is useful because this proxy represents a supervised learning task for which a design optimization can be easier to perform than directly optimizing the performance on the active reinforcement learning task.

Appendix B Analysis Experiments and Ablations

B.1 Does the task affect the computationally obtained design?

Fig.14 shows that even with the same initial random design, the proposed design optimization method converges to different designs for different tasks, namely, PointGoalNav and TargetNav.

B.2 Design optimisation method (computational design) uses available sensors well

We run a design ablation to show that the proposed design optimization method is optimizing the placement of sensors to maximize performance. We choose the simplest setting of K=2 grids of 4 $\times$ 4 photoreceptors in the PointGoalNavigation setting. From the computational design with K=2, we create two designs of K=1 by picking one of the two sensor grids in the computational design. In Fig.15, we show a comparison between the original computational design and the ablated designs, showing that the design optimization utilizes the placement of the additional sensor grid effectively as neither of the two sensor grids alone performs well but together the performance is significantly boosted.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (15)

B.3 Photoreceptor-based agents can do Target Detection

In the TargetNav task, to confirm that photoreceptors can perform target detection, i.e., identify the green sphere and move towards it, we test the behavior and performance of the trained PR agent with a transparent sphere as the target instead of the green sphere. The comparison between an environment with the green sphere and the transparent sphere is shown in Fig.16. Fig.17 shows trajectory visualizations comparing the two settings: one with a green target sphere and the other with a transparent sphere in otherwise identical episodes. We see that initially, in both cases, the PR agent follows the same trajectory. In the episode with the green target, the PR agent is able to recognize it and move towards it, while in the episode with the transparent target, the agent does not see it (as expected) and continues searching for it. For a quantitative comparison, to demonstrate that the PR agent is indeed performing target detection and moving towards it to achieve success, rather than only conducting efficient exploration, we compare the agent’s success rate in both target settings. Tab.2 shows that the PR agent is indeed performing target detection, resulting in a much higher success rate.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (16)

With

green sphere

With

transparent sphere

Success Rate

0.314

0.132

B.4 Comparing importance of the different design space variables

We run additional experiments in order to compare the importance of the different design variables in the design space defined in the main manuscript, i.e., $[x_{i},y_{i},z_{i},\mathrm{yaw}_{i},\mathrm{pitch}_{i},\mathrm{fov}_{i}]$ . For measuring the importance of a specific design variable, $x_{i}$ for example, starting from the computational design, for all PR grids, we set all design variables except $x_{i}$ to their initial values (before optimization). This comparison between different design axes is shown in Fig.18 for K=2 grids of 8 $\times$ 8 PRs in PointGoalNavigation, which shows that the height $y_{i}$ and the pitch $\mathrm{pitch}_{i}$ design variables are the most important. We also show visualizations for each of these design change in Fig.20.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (19)

B.5 Distorting the computational design to analyse the success of design optimisation

For probing the design optimization landscape, we create designs by interpolating between the computational and initial designs using the following formula:

\theta_{interpolated}=(1-\alpha)\times\theta_{Computational}+\alpha\times%\theta_{Initial}

(2)

We choose an exponentially increasing distance from the computational design for the interpolation, i.e., $\alpha\in\{0.05,0.1,0.2,0.4,0.8\}$ and train control policies for the obtained designs. The performance obtained for each such design is shown in Fig.19.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (20)

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (21)

Appendix C Design Visualizations

In this section, we provide additional visualizations of the designs we obtained through computational optimization, intuitive survey or random sampling and their respective performance on the corresponding task.

C.1 Computational vs Random Design Visualisations

In Fig.21 and Fig.22, we show the initial random design and the corresponding computational design obtained using the proposed design optimization method for DeepMindControl and Fig.23 shows the same for the PointGoalNav task. The figures also show the improved reward corresponding to the computational design.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (22)

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (23)

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (24)

C.2 Intuitive Design Visualisations

Fig.24, shows visualizations for some of the intuitive designs collected using the survey described in AppendixF and their corresponding performance on the TargetNav task. This shows that the variance in the performance of intuitive designs is high as well.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (25)

Appendix D Additional Results for Continuous Control Tasks using the Grids of 4x4 Photoreceptors

In addition to the results in Fig.4 for using 1x1 photoreceptors, we explore whether using a grid of 4x4 photoreceptors further improves the performance of the PR agents.Fig.25 presents the results on the four most difficult tasks (i.e., where neither agent achieved high performance close to the optimal reward of 1000).

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (26)

Appendix E Experimental Details

E.1 PointGoal Navigation Setting

In PointGoalNav, the agent is randomly initialized in an environment and asked to navigate to a target point given relative to the start state.The episode ends if the agent calls the stop action. It succeeds in the episode if it stops within a 0.2-meter radius of the target point.

Observation Space.The agent has access to an idealized GPS+Compass input that provides its current position and rotation relative to the starting state.It also receives the relative position of the target point.In addition, the agent observes egocentric RGB views through its photoreceptors.

Action Space.The agent can execute 4 actions: move $\_$ forward (0.25m), turn $\_$ left (30^∘), turn $\_$ right (30^∘), and stop.

Reward.At every timestep $t$ , the agent at state $s_{t}$ has the geodesic (shortest path) distance $d_{t}$ to the target.It applies an action $a_{t}$ and transits to the next state $s_{t+1}$ whose geodesic distance to the goal is $d_{t+1}$ .It receives a reward $r_{t}$ in the form of

r_{t}=\begin{cases}2.5*\text{Success}&\text{if }a_{t}\text{ is {stop}}\\d_{t}-d_{t+1}-c^{\text{slack}}&\text{otherwise}\end{cases}

where $d_{t}-d_{t+1}$ is a dense reward for progressing towards the target position and $c^{\text{slack}}=0.003$ is the slack reward encouraging shorter episodes.

Dataset. In PointGoalNav, we use the Matterport3D dataset [4] for training and testing.The training data (train split) contains 61 scenes with around 80k episodes per scene.The testing data (test split) contains 18 scenes unseen during training with 56 episodes per scene (in total 1008 episodes).

Training Process.We train our navigation agents with Proximal Policy Optimization (PPO)[44] which optimizes the objective

L(\omega)=\mathbb{E}[\min(r_{t}(\omega)\hat{A}_{t},\text{clip}(r_{t}(\omega),1%-\epsilon,1+\epsilon)\hat{A}_{t})]

where $\omega$ parameterizes the control policy, $r_{t}(\omega)$ is the probability ratio between the current policy and the rollout policy, and $\hat{A}_{t}$ is the estimate of the advantage function through Generalized Advantage Estimation (GAE)[45].In practice, we adopt the training mechanism in Decentralized Distributed PPO (DD-PPO)[65] to accelerate the training.We train the agent on 4 A100 or V100 GPUs while each GPU has 30 parallel environments collecting 64 steps of experience (simulation steps) per environment.We call collecting the rollouts above once as a rollout collection step ( $\approx$ 7.6k simulation steps).With the collected rollouts, we perform 4 epochs of PPO update with 1 mini-batch per epoch.We use the Adam[22] optimizer with an initial learning rate of $2.5\times 10^{-4}$ .We set the clipping parameter $\epsilon$ to 0.2, the discount factor $\gamma$ to 0.99, the GAE hyperparameter $\lambda$ to 0.95. We train the agent for around 230 million (M) simulation steps.

Evaluation Process.We evaluate our agent based on the Success weighted by Path Length (SPL) metric[1].One episode is counted as success only when the agent takes the stop action within 0.2 meters of the goal position within 500 steps.The SPL reported is the average across all episodes in the test split.

E.2 Target Navigation Setting

In TargetNav, the agent is spawned at a random ground location in an environment and needs to navigate to a target green sphere placed randomly in the scene.The radius of the target sphere is 0.5 m.The agent succeeds in the episode if it enters a circle of 0.8-meter radius around the target sphere center.

Observation Space.The agent receives its current position and rotation relative to the starting point and orientation from an idealized GPS+Compass sensor.Besides, the agent also observes egocentric RGB views through its photoreceptors.Compared to PointGoalNav, the agent does not receive the target position information.Thus, TargetNav focuses more on evaluating the agent’s ability to explore the environment.

Action Space.The agent has access to 3 actions: move $\_$ forward (0.25m), turn $\_$ left (30^∘), and turn $\_$ right (30^∘).

Dataset. In TargetNav, we construct our training and testing data from the PointGoalNav dataset in Matterport3D scenes [4].We randomly sample 10 scenes from the train split of the PointGoalNav setting.We use the same 18 test scenes as in the test split of the PointGoalNav dataset.We use all the episodes of the PointGoalNav dataset for the scenes chosen above.For each episode, we add the green sphere at a height of 1.5 meters above the ground at the goal position.

Reward and Training Process.TargetNav uses the same reward design and training process as in the PointGoalNav setting described in Sec.E.1.

Evaluation Process.In TargetNav, we evaluate our agent based on Success.One episode is successful only when the agent enters the circle of 0.8-meter radius around the target sphere center within 1500 steps.We do not require the agent to call the stop action because we want to focus more on evaluating the PR agent’s exploration ability using its onboard PRs.The Success reported is the average across all episodes in the testing data.

E.3 Continuous Control in DeepMind Control Suite

We use six continuous control tasks from the DeepMind Control Suite[57] from the following three domains:

•
Reacher. We use the difficulty level Hard, which requires controlling the two-legged actuator to reach the target ball with the tip of the actuator.
•
Walker requires controlling a planar walker. The Stand task requires keeping the torso upright at some minimal height. Walk and Run tasks require, in addition, to have a specific forward velocity.
•
Finger requires controlling a simple manipulator to manipulate an unactuated spinner. In the Spin task, the manipulator needs to spin the spinner with a specific angular velocity. In the Turn Easy task, one tip the of spinner needs to align with the target position specified visually.

We refer the reader to the original work [57] for a more detailed description of action space and reward definition.For the Reacher: Hard task, we added another green target object inside the original one.We do this because the MuJoCo renderer does not render the target ball (or any object) when the camera is inside it, and it would be impossible to realize that the camera is inside the target ball.Therefore, we add a smaller object inside the target, which gets rendered even when the camera is inside the target and provides a visual cue for success. For all tasks, we use a common practice and repeat the same action twice (and four times for the Finger: Turn Easy).

Observation Space.The agent only receives egocentric views from its onboard photoreceptors.Since visual observation does not provide full information about the state (e.g., velocities), we use the standard practice of stacking three consecutive frames and using them as input to the control policy.

Training Process.To maintain consistency with the navigation experiments, we use the PPO[44] learning algorithm with the following hyperparameters.We use $\gamma=0.99$ for reward discounting, GAE $\lambda=0.95$ , and $\epsilon=0.2$ for the PPO clipping loss.We train the control policies using half of a V100 or A100 GPU.During training, we have 10 parallel environments while each environment collects 10000 steps of experience per rollout.Here each rollout collection step is equivalent to 0.1M simulation steps.We split the collected rollouts into mini-batches of 1000 and perform 4 epochs of PPO updates.We show the specific number of simulation steps for each task inTab.3.We use Adam[22] optimization method with a learning rate $0.0001$ .

TaskDesign-specifictrainingDesignoptimizationReacher: Hard100150Walker: Stand200200Walker: Walk200600Walker: Run200600Finger: Spin200800Finger: Turn Easy300800

E.4 Design Optimization

Navigation Tasks. In PointGoalNav and TargetNav, we use a Gaussian distribution as the design policy $\pi_{\phi}(\theta)=\mathcal{N}(\theta\,|\,\mu,\mathrm{diag}(\sigma))$ , where $\phi=(\mu,\sigma),\,\mu,\sigma\in\mathbb{R}^{K\times 7}$ is the mean and standard deviation, $\theta$ is the design parameter, and $K$ is the number of PRs.We initialize the mean to be a zero vector $\mu=\mathbb{0}^{K\times 7}$ and set the initial standard deviation to be 0.2, i.e., $\sigma=0.2\times\mathbb{1}^{K\times 7}$ . We separate the design optimization into two stages: Frozen Stage and Update Stage.

Frozen Stage: In this phase, the design policy is “frozen”, and we only train the control policy to act as a local generalist. At the beginning of each episode, the design parameter $\theta$ is sampled from the frozen design policy, $\theta\sim\pi_{\phi}(\cdot)$ , thereby altering the robot design. As outlined in Section 5.2 of the main paper, the local generalist policy is optimized to manage control within a specific range of design parameters centered around the mean $\mu$ . The scope of this range is determined by the standard deviation $\sigma$ ; a larger $\sigma$ allows the policy to handle a wider variety of design parameters, while a smaller $\sigma$ limits it to a narrower range. During this stage, the control policy undergoes training for 20k rollout collection steps (153M simulation steps).

Update Stage: During this phase, both the design policy and the control policy undergo training simultaneously. We initiate updates to the design policy every 100 rollout collection steps (6M simulation steps) following each update of the control policy every 400 rollout collection steps (3.1M simulation steps).

When updating the design policy, we maintain the control policy in a frozen state. The objective is to align the design policy with the distribution of returns across the design parameter space. Instead of using returns as the primary objective, which tends to favor longer episodes due to accumulated rewards, we adopt SoftSPL (Soft Success Weighted by Path Length) [10] as the objective function. SoftSPL balances episode efficiency and success more effectively by considering the minimum distance achieved to the target, thus providing a denser and smoother reward landscape for optimizing the design policy.

Concurrently, when updating the control policy, we freeze the design policy. This approach ensures that the control policy adapts to manage the agent within the local parameters defined by the updated design policy. Each rollout consists of 64 steps to facilitate more frequent updates of the policies.

This dual updating strategy allows for comprehensive refinement of both the design and control policies, ensuring robust performance across various task scenarios.

DeepMind Control Suite.We use the same Gaussian design policy as in navigation setting. The total number of simulation steps dedicated to design optimization for each task is detailed in Tab.3. During the Update Stage, we utilize return as the objective function for both the control and design policies. To adapt the control policy to the changing design policy, we update the design policy every single rollout collection step (0.1M simulation steps), following each training session of the control policy every 8 rollout collection steps (0.8 million simulation steps).

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (27)

E.5 Network Architecture

Fig.26 provides a detailed architecture of the transformer encoder used in both settings, i.e. Navigation and DMC. Each photoreceptor token consists of the RGB triple, the position embedding based on the position of the photoreceptor on the grid and the design parameters of the grid. This forms the encoder input to the control policy $\pi$ .In the PointGoalNav and TargetNav navigation tasks, the policy is a 2-layer LSTM, while in DeepMindControl (DMC), we stack the last 3 frames’ encodings as input to the policy.

Appendix F Human Study for the Intuitive Designs

F.1 Visual Navigation Tasks

To compare the performance of our design optimization algorithm with that of human engineers, we designed and conducted a survey. The survey asked participants to optimize the position, orientation, and field of view (FoV) of visual sensors on a robot, aiming to enable it to complete the TargetNav task as quickly as possible.

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (28)

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (29)

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras (30)

The survey consists of three levels, each containing three questions, making a total of nine questions. Each level provides participants with progressively more specific context regarding the robot’s environment. Each question focuses on optimizing the parameters of one or more visual sensors. To aid visualization, every question includes an interactive 3D render of the robot and its sensors. Fig.27 illustrates examples of different questions and levels in the survey. The target across all questions and levels is a green ball with a radius of 0.5 meters, positioned 1.5 meters above the ground.Questions in each level:

1.
Optimize the 3-dimensional position, pitch, yaw, and FoV of two 1 $\times$ 1 photoreceptors.
2.
Optimize the 3-dimensional position, pitch, yaw, and FoV of four 1 $\times$ 1 photoreceptors.

Environment context given in each of the three levels:

1.
Environment-independent context: The target is hovering 1.5 meters above the ground in a random location. Neither the ball nor the environment is shown in the 3D visualization viewport.
2.
Environment-dependent context: In addition to the context from the previous level, the robot is now rendered with a specific example environment: a true-to-scale mesh of a home environment as well as a mesh of the target 1.5 meters above the floor. The home environment is a simplified stand-in for a Matterport3D mesh. A visualization of this level’s environment is given in Fig.28.
3.
Change in photoreceptor resolution: In addition to the context from the previous two levels, the participant is informed that the each photoreceptor’s design will be used for a grid of 4 $\times$ 4 PRs instead of a single PR which is of resolution 1 $\times$ 1.. This level is“optional”; if participants believe that this change will not affect their design from previous levels, they can choose to skip it.

Information about the robot given in all levels:

1.
At every step, the robot can take one of four possible actions: move forward 0.25 meters, turn left 30 degrees, or turn right 30 degrees.
2.
The robot is controlled by a reinforcement learning (RL) trained policy that uses the visual output from the robot’s sensors to navigate to the target. The RL policy rewards the robot for navigating to the ball with the least distance traveled. The policy has memory of the robot’s past actions and visual inputs through an LSTM. Familiarity with RL is not required to complete the survey.
3.
The robot’s forward direction is along the positive Z axis.
4.
The robot has a height of 2.5 meters, which is also the maximum height for sensor placement.

F.2 Continuous Control Tasks from the DMC Suite

The DMC benchmark uses the MuJoCo simulator[58], which is challenging to deploy within a browser due to the specialized format used to define scenes and the agent’s morphology. Therefore, we ask participants to sketch their placement design based on a rendered image depicting the agent’s morphology and the environment. Given a sketch, we implement the design inside the simulator, show it to the participant, and update it based on their feedback until convergence, i.e., when the participant agrees that the design corresponds to their intended placement.

Due to the demanding nature of this process, we collect designs for only two tasks within one domain from six participants for continuous control tasks. Each participant provides one design for both tasks (Walker and Walker), which share the Walker domain. We provide a description of these two tasks similar to that in the original paper introducing the DMC benchmark[57].