A wide range of communicative artifacts - perhaps the majority - involve the coordinated presentation of visual and linguistic information. We envisage computer systems that support access to information by using rich representations of the interpretation of such multimodal presentations. This paper advocates organizing such representations in terms of coherence relations [2, 19], a fundamental construct from the theory of natural language discourse that is often invoked to explain the integrated interpretation of the diverse communicative actions in face-to-face conversation [9, 25, 35]. Coherence relations come in constrained classes, such as the Explanation, Narration and Parallel relations, each of which establishes specific kinds of structural, logical, and intentional relationships among communicative actions. Representing these relationships can therefore provide a scaffold for organizing, disambiguating and integrating the interpretation of communication across modalities. This paper uses a case study of instructions presented using text and pictures to motivate and describe an analysis of multimodal discourse interpretation in terms of coherence relations and to sketch a roadmap for operationalizing the approach in computer systems.