How do AI chatbots integrate multimodal inputs like text, voice, and images?