说明: Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a <weixin_39566101> 上传 | 大小:7mb
说明: We present a model that generates natural language descr iptions of images and their regions. Our approach leverages datasets of images and their sentence descr iptions to learn about the inter-modal correspondences between language and visual data. <weixin_39566101> 上传 | 大小:5mb