In order to bridge the semantic gap, learning the semantics of images automatically using visual features alone has been an area of active research. Recently, visual keywords extracted from images have been shown to provide a useful intermediate representation for image characterization and retrieval. A challenging problem is to find effectively ways of extracting, representing and using the context of visual keyword for learning image semantic. In this paper, we will present a number of kernel and spectral methods which our research group has developed for learning the semantics of images, which can be applied to a variety of image annotation, categorization and retrieval tasks. To capture the context of visual keywords, we propose two contextual kernels, called spatial Markov kernel and spatial mismatch kernel, respectively. The first kernel is defined based on Markov models, while the second kernel is motivated from the concept of string kernel and derived without the use of any generative models. The experimental results show that the context captured by our kernels is very effective for learning the semantics of images. Moreover, to learn a semantically compact (or high level) vocabulary, we further propose a spectral embedding method to capture the local intrinsic geometric (i.e. manifold) structure of the original abundant visual keywords. This spectral method can also be applied to manifold learning on textual keywords for image annotation refinement. The experimental results show that our spectral methods lead to significant improvement in performance by capturing the manifold structure of visual or textual keywords. © 2011 Springer Science+Business Media B.V.