A curated list of pre-trained language models in scientific domains (e.g., mathematics, physics, chemistry, materials science, biology, medicine, geoscience), covering different model sizes (from 100M to 100B parameters) and modalities (e.g., language, graph, vision, table, molecule, protein, genome, climate time series).
The repository is part of our survey paper A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery and will be continuously updated.
NOTE 1: To avoid ambiguity, when we talk about the number of parameters in a model, "Base" refers to 110M (i.e., BERT-Base), and "Large" refers to 340M (i.e., BERT-Large). Other numbers will be written explicitly.
NOTE 2: In each subsection, papers are sorted chronologically. If a paper has a preprint (e.g., arXiv or bioRxiv) version, its publication date is according to the preprint service. Otherwise, its publication date is according to the conference proceeding or journal.
NOTE 3: We appreciate contributions. If you have any suggested papers, feel free to reach out to yuz9@illinois.edu or submit a pull request. For format consistency, we will include a paper after (1) it has a version with author names AND (2) its GitHub and/or Hugging Face links are available.
- General
- Mathematics
- Physics
- Chemistry and Materials Science
- Biology and Medicine
- Geography, Geology, and Environmental Science
-
(SciBERT) SciBERT: A Pretrained Language Model for Scientific Text
EMNLP 2019
[Paper] [GitHub] [Model (Base)] -
(SciGPT2) Explaining Relationships between Scientific Documents
ACL 2021
[Paper] [GitHub] [Model (117M)] -
(CATTS) TLDR: Extreme Summarization of Scientific Documents
EMNLP 2020 Findings
[Paper] [GitHub] [Model (406M)] -
(SciNewsBERT) SciClops: Detecting and Contextualizing Scientific Claims for Assisting Manual Fact-Checking
CIKM 2021
[Paper] [Model (Base)] -
(ScholarBERT) The Diminishing Returns of Masked Language Models to Science
ACL 2023 Findings
[Paper] [Model (Large)] [Model (770M)] -
(AcademicRoBERTa) A Japanese Masked Language Model for Academic Domain
COLING 2022 Workshop
[Paper] [GitHub] [Model (125M)] -
(Galactica) Galactica: A Large Language Model for Science
arXiv 2022
[Paper] [Model (125M)] [Model (1.3B)] [Model (6.7B)] [Model (30B)] [Model (120B)] -
(DARWIN) DARWIN Series: Domain Specific Large Language Models for Natural Science
arXiv 2023
[Paper] [GitHub] [Model (7B)] -
(FORGE) FORGE: Pre-training Open Foundation Models for Science
SC 2023
[Paper] [GitHub] [Model (1.4B, General)] [Model (1.4B, Biology/Medicine)] [Model (1.4B, Chemistry)] [Model (1.4B, Engineering)] [Model (1.4B, Materials Science)] [Model (1.4B, Physics)] [Model (1.4B, Social Science/Art)] [Model (13B, General)] [Model (22B, General)] -
(SciGLM) SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning
arXiv 2024
[Paper] [GitHub] [Model (6B)]
-
(SPECTER) SPECTER: Document-level Representation Learning using Citation-informed Transformers
ACL 2020
[Paper] [GitHub] [Model (Base)] -
(OAG-BERT) OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services
KDD 2022
[Paper] [GitHub] -
(ASPIRE) Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity
NAACL 2022
[Paper] [GitHub] [Model (Base)] -
(SciNCL) Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings
EMNLP 2022
[Paper] [GitHub] [Model (Base)] -
(SPECTER 2.0) SciRepEval: A Multi-Format Benchmark for Scientific Document Representations
EMNLP 2023
[Paper] [GitHub] [Model (113M)] -
(SciPatton) Patton: Language Model Pretraining on Text-Rich Networks
ACL 2023
[Paper] [GitHub] -
(SciMult) Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding
EMNLP 2023 Findings
[Paper] [GitHub] [Model (138M)]
-
(GenBERT) Injecting Numerical Reasoning Skills into Language Models
ACL 2020
[Paper] [GitHub] -
(MathBERT) MathBERT: A Pre-trained Language Model for General NLP Tasks in Mathematics Education
arXiv 2021
[Paper] [GitHub] [Model (Base)] -
(MWP-BERT) MWP-BERT: Numeracy-Augmented Pre-training for Math Word Problem Solving
NAACL 2022 Findings
[Paper] [GitHub] [Model (Base)] -
(BERT-TD) Seeking Patterns, Not just Memorizing Procedures: Contrastive Learning for Solving Math Word Problems
ACL 2022 Findings
[Paper] [GitHub] -
(GSM8K-GPT) Training Verifiers to Solve Math Word Problems
arXiv 2021
[Paper] [GitHub] -
(DeductReasoner) Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction
ACL 2022
[Paper] [GitHub] [Model (125M)] -
(NaturalProver) NaturalProver: Grounded Mathematical Proof Generation with Language Models
NeurIPS 2022
[Paper] [GitHub] -
(Minerva) Solving Quantitative Reasoning Problems with Language Models
NeurIPS 2022
[Paper] -
(Bhaskara) Lila: A Unified Benchmark for Mathematical Reasoning
EMNLP 2022
[Paper] [GitHub] [Model (2.7B)] -
(WizardMath) WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (70B)] -
(MAmmoTH) MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
ICLR 2024
[Paper] [GitHub] [Model (7B, LLaMA-2)] [Model (7B, Mistral)] [Model (13B, LLaMA-2)] [Model (70B, LLaMA-2)] -
(MetaMath) MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
ICLR 2024
[Paper] [GitHub] [Model (7B, LLaMA-2)] [Model (7B, Mistral)] [Model (13B, LLaMA-2)] [Model (70B, LLaMA-2)] -
(ToRA) ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
ICLR 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (70B)] -
(MathCoder) MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning
ICLR 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)] -
(Llemma) Llemma: An Open Language Model For Mathematics
ICLR 2024
[Paper] [GitHub] [Model (7B)] [Model (34B)] -
(OVM) OVM, Outcome-Supervised Value Models for Planning in Mathematical Reasoning
NAACL 2024 Findings
[Paper] [GitHub] [Model (7B, LLaMA-2)] [Model (7B, Mistral)] -
(DeepSeekMath) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
arXiv 2024
[Paper] [GitHub] [Model (7B)] -
(InternLM-Math) InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning
arXiv 2024
[Paper] [GitHub] [Model (7B)] [Model (20B)] -
(OpenMath) OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
arXiv 2024
[Paper] [Model (7B, Mistral)] [Model (70B, LLaMA-2)] -
(Rho-Math) Rho-1: Not All Tokens Are What You Need
arXiv 2024
[Paper] [GitHub] [Model (1B)] [Model (7B)] -
(MAmmoTH2) MAmmoTH2: Scaling Instructions from the Web
arXiv 2024
[Paper] [GitHub] [Model (7B, Mistral)] [Model (8B, LLaMA-3)] [Model (8x7B, Mixtral)] -
(TheoremLlama) TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts
arXiv 2024
[Paper] [GitHub] [Model (8B)]
-
(Inter-GPS) Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning
ACL 2021
[Paper] [GitHub] -
(Geoformer) UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression
EMNLP 2022
[Paper] [GitHub] -
(SCA-GPS) A Symbolic Character-Aware Model for Solving Geometry Problems
ACM MM 2023
[Paper] [GitHub] -
(UniMath-Flan-T5) UniMath: A Foundational and Multimodal Mathematical Reasoner
EMNLP 2023
[Paper] [GitHub] -
(G-LLaVA) G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)]
-
(TAPAS) TAPAS: Weakly Supervised Table Parsing via Pre-training
ACL 2020
[Paper] [GitHub] [Model (Base)] [Model (Large)] -
(TaBERT) TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables
ACL 2020
[Paper] [GitHub] [Model (Base)] [Model (Large)] -
(GraPPa) GraPPa: Grammar-Augmented Pre-training for Table Semantic Parsing
ICLR 2021
[Paper] [GitHub] [Model (355M)] -
(TUTA) TUTA: Tree-Based Transformers for Generally Structured Table Pre-training
KDD 2021
[Paper] [GitHub] -
(RCI) Capturing Row and Column Semantics in Transformer Based Question Answering over Tables
NAACL 2021
[Paper] [GitHub] [Model (12M)] -
(TABBIE) TABBIE: Pretrained Representations of Tabular Data
NAACL 2021
[Paper] [GitHub] -
(TAPEX) TAPEX: Table Pre-training via Learning a Neural SQL Executor
ICLR 2022
[Paper] [GitHub] [Model (140M)] [Model (406M)] -
(FORTAP) FORTAP: Using Formulas for Numerical-Reasoning-Aware Table Pretraining
ACL 2022
[Paper] [GitHub] -
(OmniTab) OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-Based Question Answering
NAACL 2022
[Paper] [GitHub] [Model (406M)] -
(ReasTAP) ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples
EMNLP 2022
[Paper] [GitHub] [Model (406M)] -
(Table-GPT) Table-GPT: Table-tuned GPT for Diverse Table Tasks
SIGMOD 2024
[Paper] -
(TableLlama) TableLlama: Towards Open Large Generalist Models for Tables
NAACL 2024
[Paper] [GitHub] [Model (7B)] -
(TableLLM) TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios
arXiv 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)]
-
(astroBERT) Building astroBERT, a Language Model for Astronomy & Astrophysics
arXiv 2021
[Paper] [Model (Base)] -
(AstroLLaMA) AstroLLaMA: Towards Specialized Foundation Models in Astronomy
AACL 2023 Workshop
[Paper] [Model (7B)] -
(AstroLLaMA-Chat) AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets
Research Notes of the AAS 2024
[Paper] [Model (7B)] -
(PhysBERT) PhysBERT: A Text Embedding Model for Physics Scientific Literature
arXiv 2024
[Paper] [Model (Base)]
-
(ChemBERT) Automated Chemical Reaction Extraction from Scientific Literature
Journal of Chemical Information and Modeling 2022
[Paper] [GitHub] [Model (Base)] -
(MatSciBERT) MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction
npj Computational Materials 2022
[Paper] [GitHub] [Model (Base)] -
(MatBERT) Quantifying the Advantage of Domain-Specific Pre-training on Named Entity Recognition Tasks in Materials Science
Patterns 2022
[Paper] [GitHub] -
(BatteryBERT) BatteryBERT: A Pretrained Language Model for Battery Database Enhancement
Journal of Chemical Information and Modeling 2022
[Paper] [GitHub] [Model (Base)] -
(MaterialsBERT) A General-Purpose Material Property Data Extraction Pipeline from Large Polymer Corpora using Natural Language Processing
npj Computational Materials 2023
[Paper] [Model (Base)] -
(Recycle-BERT) Recycle-BERT: Extracting Knowledge about Plastic Waste Recycling by Natural Language Processing
ACS Sustainable Chemistry & Engineering 2023
[Paper] [GitHub] -
(CatBERTa) Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models
ACS Catalysis 2023
[Paper] [GitHub] -
(LLM-Prop) LLM-Prop: Predicting Physical and Electronic Properties of Crystalline Solids from Their Text Descriptions
arXiv 2023
[Paper] [GitHub] -
(ChemDFM) ChemDFM: Dialogue Foundation Model for Chemistry
arXiv 2024
[Paper] [GitHub] [Model (13B)] -
(CrystalLLM) Fine-Tuned Language Models Generate Stable Inorganic Materials as Text
ICLR 2024
[Paper] [GitHub] -
(ChemLLM) ChemLLM: A Chemical Large Language Model
arXiv 2024
[Paper] [Model (7B)] -
(LlaSMol) LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset
COLM 2024
[Paper] [GitHub] [Model (6.7B, Galactica)] [Model (7B, LLaMA-2)] [Model (7B, Mistral)]
-
(Text2Mol) Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries
EMNLP 2021
[Paper] [GitHub] -
(KV-PLM) A Deep-learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human Professionals
Nature Communications 2022
[Paper] [GitHub] [Model (Base)] -
(MolT5) Translation between Molecules and Natural Language
EMNLP 2022
[Paper] [GitHub] [Model (60M)] [Model (220M)] [Model (770M)] -
(MoMu) A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language
arXiv 2022
[Paper] [GitHub] -
(MoleculeSTM) Multi-modal Molecule Structure-text Model for Text-Based Retrieval and Editing
Nature Machine Intelligence 2023
[Paper] [GitHub] -
(Text+Chem T5) Unifying Molecular and Textual Representations via Multi-task Language Modelling
ICML 2023
[Paper] [GitHub] [Model (60M)] [Model (220M)] -
(GIMLET) GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning
NeurIPS 2023
[Paper] [GitHub] [Model (60M)] -
(MolFM) MolFM: A Multimodal Molecular Foundation Model
arXiv 2023
[Paper] [GitHub] -
(MolCA) MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter
EMNLP 2023
[Paper] [GitHub] -
(MolLM) MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations
Bioinformatics 2024
[Paper] [GitHub] -
(InstructMol) InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery
arXiv 2023
[Paper] [GitHub] -
(3D-MoLM) Towards 3D Molecule-Text Interpretation in Language Models
ICLR 2024
[Paper] [GitHub]
- (GIT-Mol) GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text
Computers in Biology and Medicine 2024
[Paper] [GitHub]
-
(SMILES-BERT) SMILES-BERT: Large Scale Unsupervised Pre-training for Molecular Property Prediction
ACM BCB 2019
[Paper] [GitHub] -
(MAT) Molecule Attention Transformer
arXiv 2020
[Paper] [GitHub] -
(ChemBERTa) ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction
arXiv 2020
[Paper] [GitHub] [Model (125M)] -
(MolBERT) Molecular Representation Learning with Language Models and Domain-Relevant Auxiliary Tasks
arXiv 2020
[Paper] [GitHub] [Model (Base)] -
(rxnfp) Mapping the Space of Chemical Reactions using Attention-Based Neural Networks
Nature Machine Intelligence 2021
[Paper] [GitHub] [Model (Base)] -
(RXNMapper) Extraction of Organic Chemistry Grammar from Unsupervised Learning of Chemical Reactions
Science Advances 2021
[Paper] [GitHub] -
(MoLFormer) Large-Scale Chemical Language Representations Capture Molecular Structure and Properties
Nature Machine Intelligence 2022
[Paper] [GitHub] [Model (47M)] -
(Chemformer) Chemformer: A Pre-trained Transformer for Computational Chemistry
Machine Learning: Science and Technology 2022
[Paper] [GitHub] [Model (45M)] [Model (230M)] -
(R-MAT) Relative Molecule Self-Attention Transformer
Journal of Cheminformatics 2024
[Paper] [GitHub] -
(MolGPT) MolGPT: Molecular Generation using a Transformer-Decoder Model
Journal of Chemical Information and Modeling 2022
[Paper] [GitHub] -
(T5Chem) Unified Deep Learning Model for Multitask Reaction Predictions with Explanation
Journal of Chemical Information and Modeling 2022
[Paper] [GitHub] -
(ChemGPT) Neural Scaling of Deep Chemical Models
Nature Machine Intelligence 2023
[Paper] [Model (4.7M)] [Model (19M)] [Model (1.2B)] -
(Uni-Mol) Uni-Mol: A Universal 3D Molecular Representation Learning Framework
ICLR 2023
[Paper] [GitHub] -
(TransPolymer) TransPolymer: A Transformer-Based Language Model for Polymer Property Predictions
npj Computational Materials 2023
[Paper] [GitHub] -
(polyBERT) polyBERT: A Chemical Language Model to Enable Fully Machine-Driven Ultrafast Polymer Informatics
Nature Communications 2023
[Paper] [GitHub] [Model (86M)] -
(MFBERT) Large-Scale Distributed Training of Transformers for Chemical Fingerprinting
Journal of Chemical Information and Modeling 2022
[Paper] [GitHub] -
(SPMM) Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model
Nature Communications 2024
[Paper] [GitHub] -
(BARTSmiles) BARTSmiles: Generative Masked Language Models for Molecular Representations
Journal of Chemical Information and Modeling 2024
[Paper] [GitHub] [Model (406M)] -
(MolGen) Domain-Agnostic Molecular Generation with Self-feedback
ICLR 2024
[Paper] [GitHub] [Model (406M, BART)] [Model (7B, LLaMA)] -
(SELFormer) SELFormer: Molecular Representation Learning via SELFIES Language Models
Machine Learning: Science and Technology 2023
[Paper] [GitHub] [Model (58M)] [Model (87M)] -
(PolyNC) PolyNC: A Natural and Chemical Language Model for the Prediction of Unified Polymer Properties
Chemical Science 2024
[Paper] [GitHub] [Model (220M)]
Acknowledgment: We referred to Wang et al.'s survey paper Pre-trained Language Models in Biomedical Domain: A Systematic Survey and He et al.'s survey paper Foundation Model for Advancing Healthcare: Challenges, Opportunities, and Future Directions when writing some parts of this section.
-
(BioBERT) BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining
Bioinformatics 2020
[Paper] [GitHub] [Model (Base)] [Model (Large)] -
(BioELMo) Probing Biomedical Embeddings from Language Models
NAACL 2019 Workshop
[Paper] [GitHub] [Model (93M)] -
(ClinicalBERT, Alsentzer et al.) Publicly Available Clinical BERT Embeddings
NAACL 2019 Workshop
[Paper] [GitHub] [Model (Base)] -
(ClinicalBERT, Huang et al.) ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission
arXiv 2019
[Paper] [GitHub] [Model (Base)] -
(BlueBERT, f.k.a. NCBI-BERT) Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets
ACL 2019 Workshop
[Paper] [GitHub] [Model (Base)] [Model (Large)] -
(BEHRT) BEHRT: Transformer for Electronic Health Records
Scientific Reports 2020
[Paper] [GitHub] -
(EhrBERT) Fine-Tuning Bidirectional Encoder Representations from Transformers (BERT)–Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study
JMIR Medical Informatics 2019
[Paper] [GitHub] -
(Clinical XLNet) Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation
EMNLP 2020 Workshop
[Paper] [GitHub] -
(ouBioBERT) Pre-training Technique to Localize Medical BERT and Enhance Biomedical BERT
arXiv 2020
[Paper] [GitHub] [Model (Base)] -
(COVID-Twitter-BERT) COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter
Frontiers in Artificial Intelligence 2023
[Paper] [GitHub] [Model (Large)] -
(Med-BERT) Med-BERT: Pretrained Contextualized Embeddings on Large-Scale Structured Electronic Health Records for Disease Prediction
npj Digital Medicine 2021
[Paper] [GitHub] -
(Bio-ELECTRA) On the Effectiveness of Small, Discriminatively Pre-trained Language Representation Models for Biomedical Text Mining
EMNLP 2020 Workshop
[Paper] [GitHub] [Model (Base)] -
(BiomedBERT, f.k.a. PubMedBERT) Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing
ACM Transactions on Computing for Healthcare 2021
[Paper] [Model (Base)] [Model (Large)] -
(MCBERT) Conceptualized Representation Learning for Chinese Biomedical Text Mining
arXiv 2020
[Paper] [GitHub] [Model (Base)] -
(BRLTM) Bidirectional Representation Learning from Transformers using Multimodal Electronic Health Record Data to Predict Depression
JBHI 2021
[Paper] [GitHub] -
(BioRedditBERT) COMETA: A Corpus for Medical Entity Linking in the Social Media
EMNLP 2020
[Paper] [GitHub] [Model (Base)] -
(BioMegatron) BioMegatron: Larger Biomedical Domain Language Model
EMNLP 2020
[Paper] [GitHub] [Model (345M)] -
(SapBERT) Self-Alignment Pretraining for Biomedical Entity Representations
NAACL 2021
[Paper] [GitHub] [Model (Base)] -
(ClinicalTransformer) Clinical Concept Extraction using Transformers
JAMIA 2020
[Paper] [GitHub] [Model (Base, BERT)] [Model (125M, RoBERTa)] [Model (12M, ALBERT)] [Model (Base, ELECTRA)] [Model (Base, XLNet)] [Model (149M, Longformer)] [Model (86M, DeBERTa)] -
(BioRoBERTa) Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art
EMNLP 2020 Workshop
[Paper] [GitHub] [Model (125M)] [Model (355M)] -
(RAD-BERT) Highly Accurate Classification of Chest Radiographic Reports using a Deep Learning Natural Language Model Pre-trained on 3.8 Million Text Reports
Bioinformatics 2020
[Paper] [GitHub] -
(BioMedBERT) BioMedBERT: A Pre-trained Biomedical Language Model for QA and IR
COLING 2020
[Paper] [GitHub] -
(LBERT) LBERT: Lexically Aware Transformer-Based Bidirectional Encoder Representation Model for Learning Universal Bio-Entity Relations
Bioinformatics 2021
[Paper] [GitHub] -
(ELECTRAMed) ELECTRAMed: A New Pre-trained Language Representation Model for Biomedical NLP
arXiv 2021
[Paper] [GitHub] [Model (Base)] -
(KeBioLM) Improving Biomedical Pretrained Language Models with Knowledge
NAACL 2021 Workshop
[Paper] [GitHub] -
(SciFive) SciFive: A Text-to-Text Transformer Model for Biomedical Literature
arXiv 2021
[Paper] [GitHub] [Model (220M)] [Model (770M)] -
(BioALBERT) Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT
BMC Bioinformatics 2022
[Paper] [GitHub] [Model (12M)] [Model (18M)] -
(Clinical-Longformer) Clinical-Longformer and Clinical-BigBird: Transformers for Long Clinical Sequences
arXiv 2022
[Paper] [GitHub] [Model (149M, Longformer)] [Model (Base, BigBird)] -
(BioBART) BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model
ACL 2022 Workshop
[Paper] [GitHub] [Model (140M)] [Model (406M)] -
(BioGPT) BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
Briefings in Bioinformatics 2022
[Paper] [GitHub] [Model (355M)] [Model (1.5B)] -
(Med-PaLM) Large Language Models Encode Clinical Knowledge
Nature 2023
[Paper] -
(GatorTron) A Large Language Model for Electronic Health Records
npj Digital Medicine 2022
[Paper] [GitHub] [Model (345M)] [Model (3.9B)] [Model (8.9B)] -
(ChatDoctor) ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) using Medical Domain Knowledge
Cureus 2023
[Paper] [GitHub] -
(DoctorGLM) DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task
arXiv 2023
[Paper] [GitHub] -
(BenTsao, f.k.a. HuaTuo) HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge
arXiv 2023
[Paper] [GitHub] -
(MedAlpaca) MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training Data
arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)] -
(PMC-LLaMA) PMC-LLaMA: Towards Building Open-source Language Models for Medicine
JAMIA 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)] -
(Med-PaLM 2) Towards Expert-Level Medical Question Answering with Large Language Models
arXiv 2023
[Paper] -
(HuatuoGPT) HuatuoGPT, towards Taming Language Model to Be a Doctor
EMNLP 2023 Findings
[Paper] [GitHub] [Model (7B)] [Model (13B)] -
(MedCPT) MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval
Bioinformatics 2023
[Paper] [GitHub] [Model (Base)] -
(Zhongjing) Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue
AAAI 2024
[Paper] [GitHub] [Model (13B)] -
(DISC-MedLLM) DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation
arXiv 2023
[Paper] [GitHub] [Model (13B)] -
(DRG-LLaMA) DRG-LLaMA: Tuning LLaMA Model to Predict Diagnosis-related Group for Hospitalized Patients
npj Digital Medicine 2024
[Paper] [GitHub] -
(Qilin-Med) Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model
arXiv 2023
[Paper] [GitHub] -
(AlpaCare) AlpaCare: Instruction-tuned Large Language Models for Medical Application
arXiv 2023
[Paper] [GitHub] [Model (7B, LLaMA)] [Model (7B, LLaMA-2)] [Model (13B, LLaMA)] [Model (13B, LLaMA-2)] -
(BianQue) BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT
arXiv 2023
[Paper] [GitHub] [Model (6B)] -
(HuatuoGPT-II) HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs
arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (34B)] -
(Taiyi) Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks
JAMIA 2024
[Paper] [GitHub] [Model (7B)] -
(MEDITRON) MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
arXiv 2023
[Paper] [GitHub] [Model (7B)] [Model (70B)] -
(PLLaMa) PLLaMa: An Open-source Large Language Model for Plant Science
arXiv 2024
[Paper] [GitHub] [Model (7B)] [Model (13B)] -
(BioMistral) BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains
ACL 2024 Findings
[Paper] [Model (7B)] -
(Me-LLaMA) Me-LLaMA: Foundation Large Language Models for Medical Applications
arXiv 2024
[Paper] [GitHub] -
(BiMediX) BiMediX: Bilingual Medical Mixture of Experts LLM
arXiv 2024
[Paper] [GitHub] [Model (8x7B)] -
(MMedLM) Towards Building Multilingual Language Model for Medicine
arXiv 2024
[Paper] [GitHub] [Model (7B, InternLM)] [Model (1.8B, InternLM2)] [Model (7B, InternLM2)] [Model (8B, LLaMA-3)] -
(BioMedLM, f.k.a. PubMedGPT) BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text
arXiv 2024
[Paper] [GitHub] [Model (2.7B)] -
(Hippocrates) Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare
arXiv 2024
[Paper] [Model (7B, LLaMA-2)] [Model (7B, Mistral)] -
(BMRetriever) BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers
arXiv 2024
[Paper] [GitHub] [Model (410M, Pythia)] [Model (1B, Pythia)] [Model (2B, Gemma)] [Model (7B, Mistral)] -
(Panacea) Panacea: A Foundation Model for Clinical Trial Search, Summarization, Design, and Recruitment
arXiv 2024
[Paper] [GitHub]
-
(G-BERT) Pre-training of Graph Augmented Transformers for Medication Recommendation
IJCAI 2019
[Paper] [GitHub] -
(CODER) CODER: Knowledge Infused Cross-Lingual Medical Term Embedding for Term Normalization
JBI 2022
[Paper] [GitHub] [Model (Base)] -
(MoP) Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT
EMNLP 2021
[Paper] [GitHub] -
(BioLinkBERT) LinkBERT: Pretraining Language Models with Document Links
ACL 2022
[Paper] [GitHub] [Model (Base)] [Model (Large)] -
(DRAGON) Deep Bidirectional Language-Knowledge Graph Pretraining
NeurIPS 2022
[Paper] [GitHub] [Model (360M)]
-
(ConVIRT) Contrastive Learning of Medical Visual Representations from Paired Images and Text
MLHC 2022
[Paper] [GitHub] -
(MMBERT) MMBERT: Multimodal BERT Pretraining for Improved Medical VQA
ISBI 2021
[Paper] [GitHub] -
(MedViLL) Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-training
JBHI 2022
[Paper] [GitHub] -
(GLoRIA) GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition
ICCV 2021
[Paper] [GitHub] -
(LoVT) Joint Learning of Localized Representations from Medical Images and Reports
ECCV 2022
[Paper] [GitHub] -
(BioViL) Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing
ECCV 2022
[Paper] [GitHub] -
(M3AE) Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-training
MICCAI 2022
[Paper] [GitHub] [Model] -
(ARL) Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
ACM MM 2022
[Paper] [GitHub] -
(CheXzero) Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised Learning
Nature Biomedical Engineering 2022
[Paper] [GitHub] [Model] -
(MGCA) Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning
NeurIPS 2022
[Paper] [GitHub] [Model] -
(MedCLIP) MedCLIP: Contrastive Learning from Unpaired Medical Images and Text
EMNLP 2022
[Paper] [GitHub] -
(BioViL-T) Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
CVPR 2023
[Paper] [GitHub] [Model] -
(BiomedCLIP) BiomedCLIP: A Multimodal Biomedical Foundation Model Pretrained from Fifteen Million Scientific Image-Text Pairs
arXiv 2023
[Paper] [Model] -
(PMC-CLIP) PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents
MICCAI 2023
[Paper] [GitHub] [Model] -
(Xplainer) Xplainer: From X-Ray Observations to Explainable Zero-Shot Diagnosis
MICCAI 2023
[Paper] [GitHub] -
(RGRG) Interactive and Explainable Region-Guided Radiology Report Generation
CVPR 2023
[Paper] [GitHub] [Model] -
(BiomedGPT) A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks
Nature Medicine 2024
[Paper] [GitHub] [Model (33M)] [Model (93M)] [Model (182M)] -
(Med-UniC) Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias
NeurIPS 2023
[Paper] [GitHub] -
(LLaVA-Med) LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
NeurIPS 2023
[Paper] [GitHub] [Model (7B)] -
(MI-Zero) Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images
CVPR 2023
[Paper] [GitHub] [Model] -
(XrayGPT) XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models
ACL 2024 Workshop
[Paper] [GitHub] -
(MONET) Transparent Medical Image AI via an Image–Text Foundation Model Grounded in Medical Literature
Nature Medicine 2024
[Paper] [GitHub] -
(QuiltNet) Quilt-1M: One Million Image-Text Pairs for Histopathology
NeurIPS 2023
[Paper] [GitHub] [Model] -
(MUMC) Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering
MICCAI 2023
[Paper] [GitHub] -
(M-FLAG) M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization
MICCAI 2023
[Paper] [GitHub] -
(PRIOR) PRIOR: Prototype Representation Joint Learning from Medical Images and Reports
ICCV 2023
[Paper] [GitHub] -
(Med-PaLM M) Towards Generalist Biomedical AI
NEJM AI 2024
[Paper] [GitHub] -
(CITE) Text-Guided Foundation Model Adaptation for Pathological Image Classification
MICCAI 2023
[Paper] [GitHub] -
(Med-Flamingo) Med-Flamingo: A Multimodal Medical Few-shot Learner
ML4H 2023
[Paper] [GitHub] -
(RadFM) Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data
arXiv 2023
[Paper] [GitHub] [Model] -
(PLIP) A Visual–Language Foundation Model for Pathology Image Analysis using Medical Twitter
Nature Medicine 2023
[Paper] [GitHub] [Model] -
(MaCo) Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning
Nature Communications 2024
[Paper] [GitHub] -
(CXR-CLIP) CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training
MICCAI 2023
[Paper] [GitHub] -
(Qilin-Med-VL) Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare
arXiv 2023
[Paper] [GitHub] [Model] -
(BioCLIP) BioCLIP: A Vision Foundation Model for the Tree of Life
CVPR 2024
[Paper] [GitHub] [Model] -
(M3D) M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models
arXiv 2024
[Paper] [GitHub] [Model] -
(Med-Gemini) Capabilities of Gemini Models in Medicine
arXiv 2024
[Paper] -
(Med-Gemini-2D/3D/Polygenic) Advancing Multimodal Medical Capabilities of Gemini
arXiv 2024
[Paper] -
(Mammo-CLIP) Mammo-CLIP: A Vision Language Foundation Model to Enhance Data Efficiency and Robustness in Mammography
MICCAI 2024
[Paper] [GitHub] [Model]
-
(ProtTrans) ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning
TPAMI 2021
[Paper] [GitHub] [Model (420M, BERT)] [Model (224M, ALBERT)] [Model (409M, XLNet)] [Model (420M, ELECTRA)] [Model (3B, T5)] [Model (11B, T5)] -
(ESM-1b) Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences
PNAS 2021
[Paper] [GitHub] [Model (650M)] -
(MSA Transformer) MSA Transformer
ICML 2021
[Paper] [GitHub] -
(ESM-1v) Language Models Enable Zero-Shot Prediction of the Effects of Mutations on Protein Function
NeurIPS 2021
[Paper] [GitHub] [Model (650M)] -
(AminoBERT) Single-Sequence Protein Structure Prediction using a Language Model and Deep Learning
Nature Biotechnology 2022
[Paper] [GitHub] -
(ProteinBERT) ProteinBERT: A Universal Deep-Learning Model of Protein Sequence and Function
Bioinformatics 2022
[Paper] [GitHub] [Model (16M)] -
(ProtGPT2) ProtGPT2 is a Deep Unsupervised Language Model for Protein Design
Nature Communications 2022
[Paper] [Model (738M)] -
(ESM-IF1) Learning Inverse Folding from Millions of Predicted Structures
ICML 2022
[Paper] [GitHub] [Model (142M)] -
(ProGen) Large Language Models Generate Functional Protein Sequences across Diverse Families
Nature Biotechnology 2023
[Paper] [GitHub] [Model (1.6B)] -
(ProGen2) ProGen2: Exploring the Boundaries of Protein Language Models
Cell Systems 2023
[Paper] [GitHub] [Model (151M)] [Model (764M)] [Model (2.7B)] [Model (6.4B)] -
(ESM-2) Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model
Science 2023
[Paper] [GitHub] [Model (8M)] [Model (35M)] [Model (150M)] [Model (650M)] [Model (3B)] [Model (15B)] -
(Ankh) Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling
arXiv 2023
[Paper] [GitHub] [Model (450M)] [Model (1.1B)] -
(ProtST) ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
ICML 2023
[Paper] [GitHub] -
(LM-Design) Structure-informed Language Models Are Protein Designers
ICML 2023
[Paper] [GitHub] [Model (659M)] -
(ProteinDT) A Text-Guided Protein Design Framework
arXiv 2023
[Paper] [GitHub] -
(Prot2Text) Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers
AAAI 2024
[Paper] [GitHub] [Model (256M)] [Model (283M)] [Model (398M)] [Model (898M)] -
(BioMedGPT) BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine
arXiv 2023
[Paper] [GitHub] [Model (10B)] -
(SaProt) SaProt: Protein Language Modeling with Structure-Aware Vocabulary
ICLR 2024
[Paper] [GitHub] [Model (35M)] [Model (650M)] -
(BioT5) BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations
EMNLP 2023
[Paper] [GitHub] [Model (220M)] -
(ProLLaMA) ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing
arXiv 2024
[Paper] [GitHub] [Model (7B)]
-
(DNABERT) DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in Genome
Bioinformatics 2021
[Paper] [GitHub] [Model (Base)] -
(GenSLMs) GenSLMs: Genome-Scale Language Models Reveal SARS-CoV-2 Evolutionary Dynamics
The International Journal of High Performance Computing Applications 2023
[Paper] [GitHub] -
(Nucleotide Transformer) The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
bioRxiv 2023
[Paper] [GitHub] [Model (50M)] [Model (100M)] [Model (250M)] [Model (500M)] -
(GENA-LM) GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences
bioRxiv 2023
[Paper] [GitHub] [Model (Base, BERT)] [Model (Large, BERT)] [Model (Base, BigBird)] -
(DNABERT-2) DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome
ICLR 2024
[Paper] [GitHub] [Model (Base)] -
(HyenaDNA) HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
NeurIPS 2023
[Paper] [GitHub] [Model (0.4M)] [Model (3.3M)] [Model (6.6M)] -
(DNAGPT) DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks
arXiv 2023
[Paper] [GitHub] [Model (0.1B)] [Model (3B)]
-
(RNABERT) Informative RNA-base Embedding for Functional RNA Structural Alignment and Clustering by Deep Representation Learning
NAR Genomics and Bioinformatics 2022
[Paper] [GitHub] -
(RNA-FM) Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions
arXiv 2022
[Paper] [GitHub] -
(SpliceBERT) Self-Supervised Learning on Millions of Primary RNA Sequences from 72 Vertebrates Improves Sequence-Based RNA Splicing Prediction
Briefings in Bioinformatics 2024
[Paper] [GitHub] [Model (19.4M)] -
(RNA-MSM) Multiple Sequence-Alignment-Based RNA Language Model and its Application to Structural Inference
Nucleic Acids Research 2024
[Paper] [GitHub] -
(CodonBERT) CodonBERT: Large Language Models for mRNA Design and Optimization
bioRxiv 2023
[Paper] [GitHub] -
(UTR-LM) A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions
Nature Machine Intelligence 2024
[Paper] [GitHub]
-
(scBERT) scBERT as a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data
Nature Machine Intelligence 2022
[Paper] [GitHub] -
(scGPT) scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics using Generative AI
Nature Methods 2024
[Paper] [GitHub] -
(scFoundation) Large Scale Foundation Model on Single-cell Transcriptomics
Nature Methods 2024
[Paper] [GitHub] [Model (100M)] -
(Geneformer) Transfer Learning Enables Predictions in Network Biology
Nature 2023
[Paper] [Model (10M)] [Model (40M)] -
(CellLM) Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning
arXiv 2023
[Paper] [GitHub] -
(CellPLM) CellPLM: Pre-training of Cell Language Model Beyond Single Cells
ICLR 2024
[Paper] [GitHub] [Model (82M)] -
(scMulan) scMulan: A Multitask Generative Pre-trained Language Model for Single-Cell Analysis
bioRxiv 2024
[Paper] [GitHub]
-
(ClimateBERT) ClimateBERT: A Pretrained Language Model for Climate-Related Text
arXiv 2021
[Paper] [GitHub] [Model (82M)] -
(SpaBERT) SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation
EMNLP 2022 Findings
[Paper] [GitHub] [Model (Base)] [Model (Large)] -
(MGeo) MGeo: Multi-Modal Geographic Pre-training Method
SIGIR 2023
[Paper] [GitHub] -
(K2) K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization
WSDM 2024
[Paper] [GitHub] [Model (7B)] -
(OceanGPT) OceanGPT: A Large Language Model for Ocean Science Tasks
ACL 2024
[Paper] [GitHub] [Model (7B)] -
(ClimateBERT-NetZero) ClimateBERT-NetZero: Detecting and Assessing Net Zero and Reduction Targets
EMNLP 2023
[Paper] [Model (82M)] -
(GeoLM) GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding
EMNLP 2023
[Paper] [GitHub] -
(GeoGalactica) GeoGalactica: A Scientific Large Language Model in Geoscience
arXiv 2024
[Paper] [GitHub] [Model (30B)]
-
(ERNIE-GeoL) ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu Maps
KDD 2022
[Paper] -
(PK-Chat) PK-Chat: Pointer Network Guided Knowledge Driven Generative Dialogue Model
arXiv 2023
[Paper] [GitHub]
- (UrbanCLIP) UrbanCLIP: Learning Text-Enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web
WWW 2024
[Paper] [GitHub]
-
(FourCastNet) FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators
arXiv 2022
[Paper] [GitHub] -
(Pangu-Weather) Accurate Medium-Range Global Weather Forecasting with 3D Neural Networks
Nature 2023
[Paper] [GitHub] -
(ClimaX) ClimaX: A Foundation Model for Weather and Climate
ICML 2023
[Paper] [GitHub] -
(FengWu) FengWu: Pushing the Skillful Global Medium-Range Weather Forecast beyond 10 Days Lead
arXiv 2023
[Paper] [GitHub] -
(W-MAE) W-MAE: Pre-trained Weather Model with Masked Autoencoder for Multi-Variable Weather Forecasting
arXiv 2023
[Paper] [GitHub] -
(FuXi) FuXi: A Cascade Machine Learning Forecasting System for 15-day Global Weather Forecast
npj Climate and Atmospheric Science 2023
[Paper] [GitHub]
If you find this repository useful, please cite the following paper:
@article{zhang2024comprehensive,
title={A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery},
author={Zhang, Yu and Chen, Xiusi and Jin, Bowen and Wang, Sheng and Ji, Shuiwang and Wang, Wei and Han, Jiawei},
booktitle={EMNLP'24},
pages={8783--8817},
year={2024}
}