StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, Dimitri Metaxas

Research output: Chapter in Book/Report/Conference proceedingConference contribution

255 Citations (Scopus)

Abstract

Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing textto- image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256.256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. Extensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5908-5916
Number of pages9
ISBN (Electronic)9781538610329
DOIs
StatePublished - Dec 22 2017
Event16th IEEE International Conference on Computer Vision, ICCV 2017 - Venice, Italy
Duration: Oct 22 2017Oct 29 2017

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
Volume2017-October
ISSN (Print)1550-5499

Other

Other16th IEEE International Conference on Computer Vision, ICCV 2017
CountryItaly
CityVenice
Period10/22/1710/29/17

Fingerprint

Image resolution
Image quality
Computer vision
Color
Defects
Experiments

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition

Cite this

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D. (2017). StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. In Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017 (pp. 5908-5916). [8237891] (Proceedings of the IEEE International Conference on Computer Vision; Vol. 2017-October). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCV.2017.629
Zhang, Han ; Xu, Tao ; Li, Hongsheng ; Zhang, Shaoting ; Wang, Xiaogang ; Huang, Xiaolei ; Metaxas, Dimitri. / StackGAN : Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 5908-5916 (Proceedings of the IEEE International Conference on Computer Vision).
@inproceedings{d13476626ad74266bb67d67d5082cd2a,
title = "StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks",
abstract = "Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing textto- image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256.256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. Extensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.",
author = "Han Zhang and Tao Xu and Hongsheng Li and Shaoting Zhang and Xiaogang Wang and Xiaolei Huang and Dimitri Metaxas",
year = "2017",
month = "12",
day = "22",
doi = "10.1109/ICCV.2017.629",
language = "English (US)",
series = "Proceedings of the IEEE International Conference on Computer Vision",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "5908--5916",
booktitle = "Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017",
address = "United States",

}

Zhang, H, Xu, T, Li, H, Zhang, S, Wang, X, Huang, X & Metaxas, D 2017, StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. in Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017., 8237891, Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October, Institute of Electrical and Electronics Engineers Inc., pp. 5908-5916, 16th IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, 10/22/17. https://doi.org/10.1109/ICCV.2017.629

StackGAN : Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. / Zhang, Han; Xu, Tao; Li, Hongsheng; Zhang, Shaoting; Wang, Xiaogang; Huang, Xiaolei; Metaxas, Dimitri.

Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 5908-5916 8237891 (Proceedings of the IEEE International Conference on Computer Vision; Vol. 2017-October).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - StackGAN

T2 - Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks

AU - Zhang, Han

AU - Xu, Tao

AU - Li, Hongsheng

AU - Zhang, Shaoting

AU - Wang, Xiaogang

AU - Huang, Xiaolei

AU - Metaxas, Dimitri

PY - 2017/12/22

Y1 - 2017/12/22

N2 - Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing textto- image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256.256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. Extensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.

AB - Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing textto- image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256.256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. Extensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.

UR - http://www.scopus.com/inward/record.url?scp=85040306596&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85040306596&partnerID=8YFLogxK

U2 - 10.1109/ICCV.2017.629

DO - 10.1109/ICCV.2017.629

M3 - Conference contribution

AN - SCOPUS:85040306596

T3 - Proceedings of the IEEE International Conference on Computer Vision

SP - 5908

EP - 5916

BT - Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. In Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 5908-5916. 8237891. (Proceedings of the IEEE International Conference on Computer Vision). https://doi.org/10.1109/ICCV.2017.629