An expression can be approximated by a sequence of temporal segments called neutral, onset, offset and apex. However, it is not easy to accurately detect such temporal segments only based on facial features. Some researchers try to temporally segment expression phases with the help of body gesture analysis. The problem of this approach is that the expression temporal phases from face and gesture channels are not synchronized. Additionally, most previous work adopted facial key points tracking or body tracking to extract motion information, which is unreliable in practice due to illumination variations and occlusions. In this paper, we present a novel algorithm to overcome the above issues, in which two simple and robust features are designed to describe face and gesture information, i.e., motion area and neutral divergence features. Both features do not depend on motion tracking, and they can be easily calculated too. Moreover, it is different from previous work in that we integrate face and body gesture together in modeling the temporal dynamics through a single channel of sensorial source, so it avoids the unsynchronized issue between face and gesture channels. Extensive experimental results demonstrate the effectiveness of the proposed algorithm.