#7508691509059799303
ikTok's business scenarios. It matches the performance of the best open-source model Qwen VL on public test sets, while significantly outperforming all other foundation models on TikTok's business test sets. In the future, we aim to continuously develop foundation models with efficient perception and reasoning capabilities, capable of handling multilingual and massive video content understanding algorithms to deliver a better content consumption experience for users.
Project Challenges:
Enhance the multimodal perception encoder: The current encoder uses a fixed frame rate. We need to explore more efficient adaptive frame rates while considering the integration of modalities such as audio and user behavior.
How to fuse multimodal perception and thinking capabilities to promote stronger comprehensive perception and cognitive abilities of the model.
Qualifications