| Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research L Soldaini, R Kinney, A Bhagia, D Schwenk, D Atkinson, R Authur, ... ACL 2024, 2024 | 320 | 2024 |
| LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis Z Shen, R Zhang, M Dell, BCG Lee, J Carlson, W Li International Conference on Document Analysis and Recognition, 131-146, 2021 | 216 | 2021 |
| A Design Space for Intelligent and Interactive Writing Assistants M Lee, KI Gero, JJY Chung, SB Shum, V Raheja, H Shen, S Venugopalan, ... CHI 2024, 1-35, 2024 | 184* | 2024 |
| The semantic scholar open data platform R Kinney, C Anastasiades, R Authur, I Beltagy, J Bragg, A Buraczynski, ... arXiv preprint arXiv:2301.10140, 2023 | 173 | 2023 |
| Multi-lexsum: Real-world summaries of civil rights lawsuits at multiple granularities Z Shen, K Lo, L Yu, N Dahlberg, M Schlanger, D Downey Advances in Neural Information Processing Systems 35, 13158-13173, 2022 | 83 | 2022 |
| A large dataset of historical japanese documents with complex layouts Z Shen, K Zhang, M Dell Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2020 | 66 | 2020 |
| VILA: Improving structured content extraction from scientific PDFs using visual layout groups Z Shen, K Lo, LL Wang, B Kuehl, DS Weld, D Downey Transactions of the Association for Computational Linguistics 10, 376-392, 2022 | 63 | 2022 |
| Deep learning based framework for automatic damage detection in aircraft engine borescope inspection Z Shen, X Wan, F Ye, X Guan, S Liu 2019 International Conference on Computing, Networking and Communications …, 2019 | 60 | 2019 |
| Learning to Decode Collaboratively with Multiple Language Models SZ Shen, H Lang, B Wang, Y Kim, D Sontag ACL 2024, 2024 | 59 | 2024 |
| American stories: A large-scale structured text dataset of historical us newspapers M Dell, J Carlson, T Bryan, E Silcock, A Arora, Z Shen, L D'Amico-Wong, ... Advances in Neural Information Processing Systems 36, 80744-80772, 2023 | 42 | 2023 |
| The Semantic Reader Project K Lo, JC Chang, A Head, J Bragg, AX Zhang, C Trier, C Anastasiades, ... Communications of the ACM, 2024 | 41* | 2024 |
| Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search D King*, Z Shen*, N Subramani, DS Weld, I Beltagy, D Downey arXiv preprint arXiv:2203.08436, 2022 | 40 | 2022 |
| PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents K Lo, Z Shen, B Newman, JZ Chang, R Authur, E Bransom, S Candra, ... EMNLP 2023 : System Demonstrations (🏆 Best Paper Demo Award 🏆 ), 495-507, 2023 | 35 | 2023 |
| Beyond summarization: Designing ai support for real-world expository writing tasks Z Shen, T August, P Siangliulue, K Lo, J Bragg, J Hammerbacher, ... arXiv preprint arXiv:2304.02623, 2023 | 27 | 2023 |
| When one llm drools, multi-llm collaboration rules S Feng, W Ding, A Liu, Z Wang, W Shi, Y Wang, Z Shen, X Han, H Lang, ... arXiv preprint arXiv:2502.04506, 2025 | 26 | 2025 |
| Sciriff: A resource to enhance language model instruction-following over scientific literature D Wadden, K Shi, J Morrison, A Naik, S Singh, N Barzilay, K Lo, T Hope, ... EMNLP 2025, 2024 | 25 | 2024 |
| A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models S Hegselmann, SZ Shen, F Gierse, M Agrawal, D Sontag, X Jiang CHIL 2024, 2024 | 25* | 2024 |
| PAWLS: PDF annotation with labels and structure M Neumann, Z Shen, S Skjonsberg arXiv preprint arXiv:2101.10281, 2021 | 25 | 2021 |
| Towards Verifiable Text Generation with Symbolic References LT Hennigen*, S Shen*, A Nrusimha, B Gapp, D Sontag, Y Kim COLM 2024, 2023 | 21 | 2023 |
| SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models YS Chuang, B Cohen-Wang, SZ Shen, Z Wu, H Xu, XV Lin, J Glass, SW Li, ... ICML 2025, 2025 | 20* | 2025 |