Sequential recommendation aims to offer potentially interesting products to users by capturing their historical sequence of interacted items. Although it has facilitated extensive physical scenarios, sequential recommendation for multi-modal sequences has long been neglected. Multi-modal data that depicts a user's historical interactions exists ubiquitously, such as product pictures, textual descriptions, and interacted item sequences, providing semantic information from multiple perspectives that comprehensively describe a user's preferences. However, existing sequential recommendation methods either fail to directly handle multi-modality or suffer from high computational complexity. To address this, we propose a novel Multi-Modal Multi-Layer Perceptron (MMMLP) for maintaining multi-modal sequences for sequential recommendation. MMMLP is a purely MLP-based architecture that consists of three modules - the Feature Mixer Layer, Fusion Mixer Layer, and Prediction Layer - and has an edge on both efficacy and efficiency. Extensive experiments show that MMMLP achieves state-of-the-art performance with linear complexity. We also conduct ablating analysis to verify the contribution of each component. Furthermore, compatible experiments are devised, and the results show that the multi-modal representation learned by our proposed model generally benefits other recommendation models, emphasizing our model's ability to handle multi-modal information. We have made our code available online to ease reproducibility1. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.