The human visual system excels at biasing the stereoscopic visual signals by the attention mechanisms. Traditional methods relying on the low-level features and depth relevant information for stereoscopic video saliency prediction have fundamental limitations. For example, it is cumbersome to model the interactions between multiple visual cues including spatial, temporal, and depth information as a result of the sophistication. In this paper, we argue that the high-level features are crucial and resort to the deep learning framework to learn the saliency map of stereoscopic videos. Driven by spatio-temporal coherence from consecutive frames, the model first imitates the mechanism of saliency by taking advantage of the 3D convolutional neural network. Subsequently, the saliency originated from the intrinsic depth is derived based on the correlations between left and right views in a data-driven manner. Finally, a Convolutional Long Short-Term Memory (Conv-LSTM) based fusion network is developed to model the instantaneous interactions between spatio-temporal and depth attributes, such that the ultimate stereoscopic saliency maps over time are produced. Moreover, we establish a new large-scale stereoscopic video saliency dataset (SVS) including 175 stereoscopic video sequences and their fixation density annotations, aiming to comprehensively study the intrinsic attributes for stereoscopic video saliency detection. Extensive experiments show that our proposed model can achieve superior performance compared to the state-of-the-art methods on the newly built dataset for stereoscopic videos.