| Meddialog: Two large-scale medical dialogue datasets X He, S Chen, Z Ju, X Dong, H Fang, S Wang, Y Yang, J Zeng, R Zhang, ... arXiv preprint arXiv:2004.03329, 2020 | 370* | 2020 |
| Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin, S Zhao, J Bian arXiv preprint arXiv:2304.09116, 2023 | 363 | 2023 |
| Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models Z Ju, Y Wang, K Shen, X Tan, D Xin, D Yang, Y Liu, Y Leng, K Song, ... arXiv preprint arXiv:2403.03100, 2024 | 341 | 2024 |
| Musicbert: Symbolic music understanding with large-scale pre-training M Zeng, X Tan, R Wang, Z Ju, T Qin, TY Liu arXiv preprint arXiv:2106.05630, 2021 | 220 | 2021 |
| Audit: Audio editing by following instructions with latent diffusion models Y Wang, Z Ju, X Tan, L He, Z Wu, J Bian Advances in Neural Information Processing Systems 36, 71340-71357, 2023 | 114 | 2023 |
| Kimi-audio technical report D Ding, Z Ju, Y Leng, S Liu, T Liu, Z Shang, K Shen, W Song, X Tan, ... arXiv preprint arXiv:2504.18425, 2025 | 113 | 2025 |
| Prompttts 2: Describing and generating voices with text prompt Y Leng, Z Guo, K Shen, X Tan, Z Ju, Y Liu, Y Liu, D Yang, L Zhang, ... arXiv preprint arXiv:2309.02285, 2023 | 67 | 2023 |
| On the generation of medical dialogs for COVID-19 M Zhou, Z Li, B Tan, G Zeng, W Yang, X He, Z Ju, S Chakravorty, S Chen, ... Proceedings of the 59th Annual Meeting of the Association for Computational …, 2021 | 65* | 2021 |
| Telemelody: Lyric-to-melody generation with a template-based two-stage method Z Ju, P Lu, X Tan, R Wang, C Zhang, S Wu, K Zhang, X Li, T Qin, TY Liu arXiv preprint arXiv:2109.09617, 2021 | 57 | 2021 |
| Rall-e: Robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis D Xin, X Tan, K Shen, Z Ju, D Yang, Y Wang, S Takamichi, H Saruwatari, ... arXiv preprint arXiv:2404.03204, 2024 | 45 | 2024 |
| Flashspeech: Efficient zero-shot speech synthesis Z Ye, Z Ju, H Liu, X Tan, J Chen, Y Lu, P Sun, J Pan, W Bian, S He, W Xue, ... Proceedings of the 32nd ACM International Conference on Multimedia, 6998-7007, 2024 | 34 | 2024 |
| MoonCast: High-quality zero-shot podcast generation Z Ju, D Yang, J Yu, K Shen, Y Leng, Z Wang, X Tan, X Zhou, T Qin, X Li arXiv preprint arXiv:2503.14345, 2025 | 17 | 2025 |
| ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling D Yang, S Liu, H Guo, J Zhao, Y Wang, H Wang, Z Ju, X Liu, X Chen, ... arXiv preprint arXiv:2504.10344, 2025 | 11 | 2025 |
| Freeaudio: Training-free timing planning for controllable long-form text-to-audio generation Y Jiang, Z Chen, Z Ju, C Li, W Dou, J Zhu Proceedings of the 33rd ACM International Conference on Multimedia, 9871-9880, 2025 | 5 | 2025 |
| HeartMuLa: A Family of Open Sourced Music Foundation Models D Yang, Y Xie, Y Yin, Z Wang, X Yi, G Zhu, X Weng, Z Xiong, Y Ma, ... arXiv preprint arXiv:2601.10547, 2026 | | 2026 |
| ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling Y Jiang, Z Chen, Z Ju, Y Dai, W Dou, J Zhu arXiv preprint arXiv:2510.08878, 2025 | | 2025 |