Sensing activities at the city scale using big data can enable applications to improve the quality of citizen life. While there are approaches to sense the urban heartbeat using sound, vision, radio frequency (RF), and other sensors, capturing changes at urban scale using such sensing modalities is challenging. Due to the enormous amount of data they produce and the associated annotation and processing requirement, such data can be of limited use. In this paper, we present a vision-to-language modeling approach to capture patterns and transitions that occur in New York City from March 2020 to August 2020. We use the model on ∼1 million street images captured by dashcams over 6 months. We then use the captions to train a language model based on Latent Dirichlet Allocation  and compare models from different periods using probabilistic distance measures. We observe distribution shifts in the model that correlate well with social distancing policies and are corroborated by different data sources, such as mobility traces. This language-based sensing introduces a new sensing modality to capture dynamics in the city with lower storage requirements and privacy concerns.