Recent years have seen a significant increase in the number of sensors and resulting event related sensor data, allowing for a better monitoring and understanding of realworld events and situations. Event-related data come from not only physical sensors (e.g., CCTV cameras, webcams) but also from social or microblogging platforms (e.g., Twitter). Given the wide-spread availability of sensors, we observe that sensors of different modalities often independently observe the same events. We argue that fusing multimodal data about an event can be helpful for more accurate detection, localization and detailed description of events of interest. However, multimodal data often include noisy observations, varying information densities and heterogeneous representations, which makes the fusion a challenging task. In this paper, we propose a hybrid fusion approach that takes the spatial and semantic characteristics of sensor signals about events into account. For this, we first adopt the concept of an image-based representation that expresses the situation of particular visual concepts (e.g. "crowdedness", "people marching") called Cmage for both physical and social sensor data. Based on this Cmage representation, we model sparse sensor information using a Gaussian process, fuse multimodal event signals with a Bayesian approach, and incorporate spatial relations between the sensor and social observations. We demonstrate the effectiveness of our approach as a proof-of-concept over real-world data. Our early results show that the proposed approach can reliably reduce the sensor-related noise, locate the event place, improve event detection reliability, and add semantic context so that the fused data provides a better picture of the observed events.