Image/video models can transfer knowledge from Internet data to robot agents by generating goal images. But what happens when images have harmful visual artifacts? We present GHIL-Glue, a method to align image/video models and low-level policies.
ghil-glue.github.io