ABCDEFGHIJKLMNOPQRSTUVWXYZAAABAC
1
2
3
4
Name of ModelFamilyPretraining Architecture (Encoder/Decoder/E-D)Pretraining or Fine Tuning TaskExtensionApplicationDate (of first known publication)Num. ParamsCorpusLicenseLab
5
ALBERTBERTEncoderMLM/NSPCompressed version of BERT using parameter sharing, which is much more efficient given the same number of parametersSame as BERT9//2019Base = 12M, Large = 18M, XLarge = 60MSame as BERTGoogle
6
AlexaTM 20BTransformerEncoder/DecoderDenoising and prefix LMDerived from BART and layernorms located exactly at the beginning of each layer. Encoder initialized with internal 10B pre-trained encoder.Summarization, multi-lingual machine translation and NLU tasks8//202220BWikipedia and mC4 datasets in 12 languagesAmazon
7
AlpacaLLaMADecoderLMAlpaca is fine-tuned from a 7B LLaMA model.Evaluated on a variety of text generation and classification tasks.03/20237B52K instruction-following data generated using self-instruct mechanism, from 175 human-written instruction-output pairs.Stanford
8
AlphaFoldSE(3)-TransformerEncoderProtein folding predictionThe original Alphafold used a BERT-style transformer. The details of Alphafold’s Transformer are not known, but it is believed it is an extension of the SE(3)-Tranformer, a 3-D equivariant Transformer (see this blog post).Protein folding7//202121M170,000 proteins from a public repository of protein sequences and structuresDeepmind
9
Anthropic Assistant (see also)GPTDecoderLMThese models do not introduce novelties at the architecture/pretraining level and they are based on GPT-3 but rather focuses on how to improve alignment through fine-tuning and prompting. Note that the Anthropic Assistant includes several models optimized for different tasks. Latest versions of this work focus on the benefits of RLHF.
Different models with different applications from general dialog to code assistant.12//202110M to 52B400B tokens from filtered Common Crawl and Books. They also create several Dialogue Preference datasets for the RLHF training.Anthropic
10
BARTBERT for encoder, GPT for DecoderEncoder/DecoderDAEIt can be seen as a generalization of BERT and GPT in that it combines ideas from both in the encoder and decoderMostly text generation but also some text understanding tasks10//201910% more than BERTSame as RoBERTa (160Gb of news, books, stories,
and web text)
Facebook
11
BERTBERTEncoderMLM/NSPGeneral Language Understanding and Question Answering. Many other language applications followed10//2018Base = 110M, Large = 340MToronto Book Corpus and Wikipedia (3.3B Tokens)Google
12
Big BirdEncoder AND Encoder/Decoder (BigBird is mostly a way to implement sparse attention that is implemented both in an Encoder-only as wells as Encoder/Decoder architecture)MLMBig Bird can extend other architectures such as BERT, Pegasus, or RoBERTa by using a sparse attention mechanism that elminates the quadratic dependency thus making it more suitable for longer sequencesParticularly well suited for longer sequences, not only in text but also e.g. in genomics7//2020Depends on the overall architectureBooks, CC-News, Stories and WikipediaGoogle
13
BlenderBot 3GPTDecoderLMBlenderBot 3 is based on a pre-trained OPT. It adds features needed for a dialog agent such as long-term memory or the ability to search the internet. It is also fine-tuned for some specific tasks given human feedback on them.same as GPT-38//2022175B180B tokens = RoBERTa + the Pile + PushShift.io RedditFacebook
14
BLOOMGPTDecoderLMMain difference to GPT-3 is that it uses full attention instead of sparse attentionSame as GPT-37//2022176B366B tokens (1.5 TB of text data) multilingual datasetBig Science/Huggingface
15
ChatGPTGPTDecoderLMChatGPT takes a GPT3.5 (aka GPT3 Davinci-003) pretrained model and uses RLHF to finetune the model mostly like described in InstructGPT but with slight differences in the data collection. ChatGPT is also more than a model since it includes extensions for Memory Store and retrieval similar to BlenderBot3Dialog agents10//2022Same as GPT3Same as GPT3 + datasets generated for RLHFOpenAI
16
ChinchillaGPTDecoderLMSame as Gopher but with optimizations to reduce model size and therefore training/inference time with equal or superior performanceSame as Gopher/GPT33//202270BMassive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia.Deepmind
17
CLIPCLIP (Also using Resnet, ViT, and vanilla transformer for text)Encoderpredict which of the N × N possible (image, text) pairings
across a batch actually occurred
Combines Resnet and ViT for the visual encoding with Transformer for the Textual encoderImage/Object classification2//2021WIT (WebImageText) - 400 million text,image pairsOpenAI
18
CM3HTMLDecoderCausality-masked LMsThis is somewhat similar to HTML in its use of structured training data. However, it is a different architecture and uses causal maskingMultimodal language model with the ability to do structured prompting
1//202213B (largest)CC-News, English WikipediaFacebook
19
CTRLDecodermodel can generate text conditioned on control codes that specify domain, style,
topics, dates, entities, relationships between entities, plot points, and task-related behavior
Controllable text generation9//20191.63B140 GB of text including: Wikipedia (En, De, Es, Fr), Project Gutenberg, 45 subreddits, OpenWebText2, Amazon
Reviews, Europarl and UN data from WMT, question-answer pairs from ELI5, and the
MRQA shared task3, which includes the Stanford Question Answering Dataset, NewsQA, TriviaQA, SearchQA, HotpotQA , and Natural Questions
Salesforce
20
DALL-EGPTDecoderCaption predictionA differential variational auto-encoder is used to learn the visual codebook. The transformer is a variation of GPT-3Text to image1//202112B250 million text-images pairs from the internetOpenAI
21
DALL-E-2 CLIP, GLIDEEncoder/DecoderCaption predictionCombines CLIP encoder and Diffusion decoder similar to GLIDEText to image4//20223.5BCombination of the DALL-E and CLIP datasetsOpenAI
22
DeBERTaBERTDecoderMLMSeparate positional embedding vector independent from the content embedding using disentangled attention matrices for contents and relative positionsSame as BERT6//2020750M (xlarge)English Wikipedia, BookCorpus, OPENWEBTEXT and STORIESOpen, MIT licenseMicrosoft
23
Decision TransformersGPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks)DecoderNext action predictionDecision transformers use a GPT architecture and extend it by encoding trajectories in a way that they can be learned by an auto-regressive taskGeneral RL (reinforcement learning tasks)6//2021Same as GPTDifferent corpus for different experimentsGoogle/UC Berkeley/FAIR
24
DialoGPTGPTDecoderLMGPT-2 architecture trained on dialog dataText generation in dialog settings10//20191.5B140M Reddit conversationsMicrosoft
25
DistilBERTBERTEncoderMLM/NSPCompressed version of BERT using distillation, which is much more efficient given the same number of parametersSame as BERT10//201966MSame as BERTHuggingface
26
DQ-BARTBARTEncoder/DecoderDAEAdds quantization and distillation to a BART model to improve performance and model sizeText generation and understanding3//2022Up to 30x reduction in parameters compared to standard BARTCNN/DM, XSUM, ELI5, WMT16 En-Ro (~1M tokens)Amazon
27
DollyGPTDecoderFine tuned on Q&A pairs to follow human instructionsfine-tuned based on the GPT-J-6B (V1) and Pythia model (V2)Similar to Alpaca3//20236BV1: Instruction corpus same as Alpaca, V2: databricks own dataset.OpenDatabricks, Inc
28
E5BERTEncoderFine tuned on Semantic similarity using contrastive lossFine-tunes BERT-based models to create text string embeddings optimized for semantic relatedness. Text embeddings for semantic relatedness tasks such as text clustering or search retrieval.12//2022300M (large version)MS-MARCO, NQ, NLIOpen, MITMicrosoft
29
ELECTRAEncoderRTDSame as BERT3//2020Base = 110M, Large = 330MSame as BERT except for Large with is same as XLNetStanford/Google
30
ERNIEBERTEncoderMLMUses BERT for Encoder architecture, but stacks and aggregates two of them for text and entities. This architecture could be understood as BERT for text + knowledge graphsKnowledge intensive related tasks that might benefit from knowledge graphs or entities such as entity recognition5//2019114MEnglish Wikipedia + Wikidata for entitites (note that they initialize model to original BERT parameter valuesVarious Chinese institutions
31
FlamingoChinchillaDecoderLog likelihood of text given some visual inputIt uses a frozen textual language model (like Chinchilla) conditioned on the visual representation, which is encoded from a Normalizer-Free ResNetText to image4//202280B (largest)MultiModal MassiveWeb (M3W): 185 million images and 182 GB text + a number of text paired with image datasets: ALIGN + LTIP (Long Text & Image Pairs) = 312 million images, and VTP (Video & Text Pairs) = 27 million short videos (approximately 22 seconds on average)Deepmind
32
Flan-T5T5Encoder/DecoderFine tuned on Instructions for zero-shot and few-shot tasksFlan-T5 is generated by "Flan Finetuning" the T5 models: (1) scaling the number of tasks to 1,836, (2) scaling the model size, and (3) finetuning on chain-of-thought data.
The primary use is to underestand how to improve large language models with the right kind of instruction fine-tuning. The focus is research on zero-shot and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models11//202211B(xxl)Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT.Open, Apache-2.0Google
33
Flan-PaLMPaLMDecoderFine tuned on Instructions for zero-shot and few-shot tasksFlan-PaLM is generated by "Flan Finetuning" the PaLM models: (1) scaling the number of tasks to 1,836, (2) scaling the model size, and (3) finetuning on chain-of-thought data.Same as Flan-T5. The goal is to show Flan finetuning can even improve on the largest Google LMs (+9.4\% improvement average across tasks), with improvements to chain of thought, self consistency, multilingual tasks, arithmetic reasoning11//2022540B (largest)Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT.LimitedGoogle
34
GalacticaTransformerDecoderLM for scientific domainTransformer based architecture in a decoder-only setup with a few modifications. Data extensions include special tokens for working memory, citations, genetic data, and a few other biology related tasks.he models are designed to perform scientific tasks, including but not limited to citation prediction, scientific QA, mathematical reasoning, summarization, document generation, molecular property prediction and entity extraction.11//2022120B (huge)Trained on 106 billion tokens of open-access scientific text and data. This includes papers, textbooks, scientific websites, encyclopedias, reference material, knowledge bases, and moreLimited, non-commerical CC BY-NC 4.0 licenseMeta
35
Gato“Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks)DecoderMLM (where tokens are either text or agent actions)The standard decoder-only transformer architecture is preceded by an embedding layer that can embed text and images, plus add position encodings to add spatial information when applicable.Gato presents a generalizable agent that can be used beyond text to tasks such as playing Atari or controlling a robot arm. 5//20221.2B1.5T tokens including standard text (e.g. MassiveText), vision (e.g. ALIGN), and simulation environments (e.g. ALE Atari, or RGB Stacking Real Robot)Deepmind
36
GLaMTransformerDecoderLMGLaM introduces a Mixture of 64 Experts to increase parameter count and generalization properties in a somewhat standard decoder-only. Transformer architecture. Only two experts get activated at a time per token, which makes the model also more efficient in training and inference.General language modeling12//20211.2T across 64 experts, but only 96B get activated for inference1.6T tokens including web pages filtered by Wikipedia and books for qualityGoogle
37
GLIDEDiffusion modelsEncoderCaption predictionGLIDE can be seen as an extension of the ADM (Ablated Diffusion Model) by the same authors. However, ADM is not per se a transformer architecture although it does resemble one in some of the configurations the authors use. Given that ADM is by the same authors and was quickly followed up by GLIDE, I think it is fair to consider GLIDE as the first of its kind.Text to image12//20213.5B diffusion model (2.3B for visual encoding, 1.2B for textual) + 1.5B for model for upsamplingSame as DALL-EOpenAI
38
GLMGLM (General Language Model)Encoder and decoderAuto regressive blank infillingGLM has a bidirectional encoder and a unidirectional decoder in a unified model. General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.3//2022130BPile, GLM-130B Chinese corpora, P3, DeepStruct finetuning datasetOpen, MIT licenseTsinghua
39
Global Context ViTViTEncoderImage Classificationhierarchical ViT architecture consisting of local and global self-attention modulesImage (object detection, image classification..)6//202290MImagenet-1K and other task dependent dataasetsNVidia
40
GopherGPTDecoderLMSame as GPT-2 but use RSNorm instead of LayerNorm and relative positional encoding rather than absoluteMostly Language Modeling and NLU, but also extensible like GPT12//2021280BMassive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia.Deepmind
41
GopherCiteGopherDecoderLMGopherCite is based on Gopher but adds a step using RLHP (Reinforcement Learning from Human Preferences) to learn whether not only a response is plausible but also supportedDialog systems, Q&A, general language generation tasks3//2022280BSame as Gopher plus specific dataset generated in the RLHP processDeepmind
42
GPTGPTDecoderLMText generation, but adaptable to many other NLP tasks when fine tuned.6//2018117MUnsupervised Pretraining on BookCorpus dataset. Supervised Finetuning on several task-specific datasets including SNLI, RACE, Quora...OpenAI
43
GPT-2GPTDecoderLMMinor extensions to the GPT architecture (e.g. layer normalization moved to the input of each sub-layer, or increased context size from 512 to 1024)Text generation, but adaptable to many other NLP tasks when fine tuned.2//20191.5B8 million web pages (40 GB). 10X GPT . WebText dataset is created by crawling all links at Reddit with at least 3 Karma points.OpenAI
44
GPT-3GPTDecoderLMSame as GPT-2 with the only addition of alternating dense and locally banded sparse
attention patterns, inspired by the Sparse Transformer
Initially text generation, but has over time been used for a large range of applications in areas such as code generation, but also image and audio generation5//2020175 B~ 500B tokens including CommonCrawl (410B), WebText2 (19B), Books1 (12B), Books2 (55B), and Wikipedia (3B)OpenAI
45
GPT-3.5GPTDecoderLMThe GPT3.5 series includes a number of models like Davinci-003. They are basically versions of the InstructGPT model. See here for details on the comparison of the performance to older GPT3 models.Dialog and general language, but there is a code specific model too
10//2022175BSame as InstructGPTOpenAI
46
GPT-JGPTDecoderLMGPT-J 6B is a Transformer model trained using Mesh Transformer JAX and same tokenizer as GPT2/3Same as GPT-35//20216BPile corpus, a large-scale curated dataset created by EleutherAI.Open, Apache-2.0EleutherAI
47
GPT-NeoGPTDecoderLMSimilar to GPT-2 but uses local attention in every other layer with a window size of 256 tokensText generation, but adaptable to many other NLP tasks when fine tuned.3//20211.5B
2.7B (XL)
Pile - 840 GB open source text dataset that combines 22 pre existing datasetsEleutherAI
48
GPT-NeoX-20BGPTDecoderLMSimilar to GPT-3 with rotary encoders instead of positional, parallel attention and feed forward layers, different initialization, and all dense layers instead of alternate dense/sparsesame as GPT-34//202220BPile (22 data sources)EleutherAI
49
GPTInstructGPTDecoderLMGPTInstruct starts off with a pretrained GPT3 model and adds reward modeling through reinforcement learning after a supervised finetuningKnowledge-intensive dialog or language tasks1//2022Same as GPT3Same as GPT3 for pretraining, but finetuned and optimized using labeler data and promptsOpenAI
50
InstructORT5Encoder/DecoderFine tuned with a wide variety of instruction based text-to-text tasksFine-tunes T5 explicitly to optimize encoder to produce a general purpose text string embedding useful for many NLU tasks.Any NLU task requiring a single text string embedding. As of April 2023 InstructOR is the top-ranked system on the Massive Text Embedding Benchmark (MTEB)12//2022330MFinetuned on MEDIOpen, Apache-2.0University of Hong Kong, University of Washington, META AI
51
HTMLBARTEncoder/DecoderDAEAs opposed to BART, they don’t do sentence shufflingGeneral purpose language model that allows structured HTML prompting 7//2021400M23TB of simplified HTML extracted from CommonCrawlFacebook
52
ImagenT5, CLIP, Diffusion modelsT5 (or CLIP or BERT) for frozen text encoder + U-net architecture for cascaded diffusion models for text to imageimage/text pair prediction Imagen adds a few extensions to the U-net diffusion architecture (pooled embedding vector, cross attention over text embeddings, and Layer Normalizations)Text to image6//20222Ba combination of internal datasets, with ≈ 460M image-text pairs, and the publicly available Laion dataset, with ≈ 400M image-text pairsGoogle
53
Jurassic-1GPTDecoderLMVery similar to GPT-3, but far more parameters and improved training efficiency mostly because of the improved tokenizer. Also, different ratio of depth to breadthSimilar to GPT-39//2021178B (Jumbo), 7.5B (Large)300B tokens (same as GPT-3)AI21
54
LAMDATransformerDecoderLMLAMDA focuses on how to improve safety, quality, and groundeness using different fine-tuning strategiesGeneral language modeling1//2022137B1.56T words from public dialog data and other public web documentsGoogle
55
LLaMATransformerDecoderLMLLaMA uses a Transformer architecture, and with extensions: Pre-normalization, SwiGLU activations, RoPE embeddings, reduced memory usage and runtime through efficient implementation of the causal multi-head attention, checkpointing to reduce the amount of activations that are recomputed during the backward pass, model and sequence parallelism to reduce memory usage of the model, and uses 1.4T BPE tokens after tokenization.Zero and few shot Commonsense reasoning, Question answering, Code generation and Reading comprehension.2//202365BEnglish CommonCrawl + C4 + Github + Wikipedia + Gutenberg and Books3 + ArXiv + Stack ExchangeLimited, Non-commercial bespoke licenseMeta
56
mBARTBARTEncoder/DecoderDAETranslation1//2020Same as BARTCC25 Corpus includes 25 monolingual corpuses in different languages. Largest corpuses are English (300 GB) and Russian (280GB)Facebook
57
MegatronGPT/BERT/T5Encoder or Decorder, depending on the base modelSame as base modelMegatron is a family of models that extend previously known architectures (namely GPT-2 and BERT originally, but also T5 more recently) by introducing model parallelism primitives. In the case of BERT, the authors also replace the next sentence prediction head with sentence order prediction and use whole word n-gram masking.Same as base model3//20208.3B (GPT-like), 3.9B (BERT-like)Original paper uses an aggregate dataset consisting of Wikipedia), CC-Stories), RealNews, and OpenWebtextNVidia
58
MinervaPaLMDecoderLMExtends PaLM by fine-tuning on the mathematical datasetMathematical reasoning6//2022540BSame as PaLM + ​​118GB dataset of scientific papers from the arXiv preprint server and web pages that contain mathematical expressions using LaTeX, MathJax, or other mathematical typesetting formatsGoogle
59
MT-NLG (Megatron Touring NLG)GPTDecoderLMUses parallelization similar to Megatron to train a LM double the size of GPT-3Language generation and others (similar to GPT-3)10//2021530BThe Pile (800GB dataset) + 2 Common Crawl snapshotsNVidia
60
OpenAssistant LLaMALLaMADecoderSupervised fine-tuning on crowd sourced conversation/assistant data.Same as ChatGPT, but open source. Compared to alternatives, it uses human generated conversation data4//202330BConversations collected by volunteers, available at https://huggingface.co/datasets/OpenAssistant/oasst1Limited, Non-commercial bespoke license. There is also a version based on Pythia which is Apache licensed.Various open source contributors
61
OPTGPT-3DecoderLMBasically same architecture as GPT-3 but with some training improvements introduced in Megatron-LMSame as GPT-35//2022175B (and other smaller versions)180B tokens = RoBERTa + the Pile + PushShift.io RedditFacebook
62
PalmTransformerDecoderLMPalm uses a typical decoder-only transformer architecture, but adds quite a few extensions: SwiGLU activations, parallel layers, multi-query attention, RoPE embeddings, Shared Input-Output Embeddings, no biases, and a 256k SentencePiece vocabulary generated from the training dataLanguage understanding and generation4//2022540B780B tokens from filtered webpages, books, Wikipedia, news articles, source code, and social media conversations. Code includes 24 programming languages.Google
63
PegasusEncoder/DecoderDAE (more concretely GSG) and MLMExtends vanilla Transformer by using a different pretraining task (GSG: Gap Sentence Generation) that is better suited for summarizationSummarization12//2019Base = 223M
Large = 568M
C4 (750GB) + HugeNews (3.8 TB)UCL/Google
64
RoBERTaBERTEncoderMLM (Dynamic)Extension of BERT with optimized training procedure and more dataSame as BERT7//2019356MSame as BERT + CC News + OpenWebText + Stories (~33B Tokens)UW/Google
65
SeeKerGPT (but can extend any family)Encoder/decoder or decoder only, depending on the base model it’s extendingEncoder/decoder or decoder only, depending on the base model it’s extendingSeeKer is an extension that can be applied to any Transformer architecture by introducing “search”, “knowledge”, and “response” modules that are introduced during pretrainingSame as base models3//2022Depends on the base modelSame as base modelFacebook
66
SparrowGPTDecoderLMStarts from the Chinchilla 70B model but adds RLHF (Reinforcement Learning with Human Feedback). It also adds inline evidence a la GopherCiteDialog agents and general language generation applications like Q&A
9//202270BSame as Chinchilla + interactive data gathering with human annotators during the RLHF processDeepmind
67
StableDiffusionDiffusionEncoder/DecoderCaption PredictionStable diffusion is basically the Latent Diffusion model developed by LMU Munich researchers + some learnings on conditional diffusion from DALL-e and ImagenText to image12//2021890M (although there are different, smaller, variants)LAION-5B, a publicly available dataset derived from Common CrawlLMU Munich + Stability.ai + Eleuther.ai
68
Swin TransformerViTEncoderSame as ViTExtends ViT by replacing the standard multi-head self attention (MSA) module by a module based on shifted windows (Swin) allowing ViT-like architectures to generalize to higher resolution imagesImage (object detection, image classification..)3//202129M-197MImagenet and Imagenet-22kMicrosoft
69
SwitchT5Encoder/DecoderDAEGoal to increase parameter count while keeping FLOP operations constant by using efficient routing of MoE (Mixture of Experts)General language tasks (e.g. question answering)1//20211TColossal Clean Crawled CorpusGoogle
70
T0T5Encoder/DecoderFine tuned with Natural language promptsT0 stands for "T5 for Zero Shot", obtained by fine-tuning the T5 model on multitask mixture covering many different NLP tasks. Compared with T0, T0p and T0pp were fine-tuned with more datasets. T0pp is recommended as it leads (on average) to the best performances on a variety of NLP tasks.Perform zero-shot inference tasks by specifying the query in natural language, and the models will generate a prediction.3//202211B (largest)T0 (Multiple-choice QA, Extractive QA, Closed-Book QA, Structure-To-Text, Sentiment, Summarization, Topic Classification, Paraphrase Identification. T0p (same as T0, with additional datasets from GPT-3's evaluation suite). T0pp (same as T0p, with additional datasets from SuperGLUE, excluding NLI sets)Open, Apache-2.0BigScience
71
T5TransformerEncoder/DecoderDAESame as original Transformer with some additions such as relative positional embeddings like Transformer XLGeneral language tasks including machine translation,
question answering, abstractive summarization, and text classification
10//201911 B (up to)Colossal Clean Crawled Corpus (C4) - Cleaned up version of the Common Crawl dataset - 750 GBGoogle
72
Trajectory TransformersGPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks)Decoderpredict most likely sequenceSimilarly to the Decision transformers, the main extension introduced by Trajectory Transformers is a way to encode a trajectory (state, actions, rewards) General RL (reinforcement learning tasks)6//2021Smaller architecture than GPTD4RL dataset and other RL datasets depending on the task at handUC Berkeley
73
Transformer XLTransformerDecoderLMRelative positioned embeddings enable longer-context attention when compared to vanilla Transformer modelGeneral language tasks1//2019151MDifferent training datasets depending on experiments, but baseline is Wikitext-103CMU/Google
74
Turing-NLGGPTDecoderLMOptimized version of GPT2 with optimal hyperparameters and software/hardware platform to improve trainingSame as GPT-2/32//202017B originally, up to 530B more recentlyHighest quality subset from The Pile + 2 CC snapshots (339B tokens)Microsoft
75
UL2TransformerEncoder/DecoderMixture-of-Denoisers, which combines diverse pre-training paradigms togetherUL2-20B (Unifying Language Learning) can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs.A unified framework for pre-training models that are universally effective across datasets and setups.5//202220B1 trillion tokens on C4Open, Apache-2.0Google
76
VicunaLLaMADecoderhuman instructionsLLaMA fine-tuned on user-shared conversations collected from ShareGPT.Same as ChatGPT3//202313BConversations collected from ShareGPTLimited, Non-commercial bespoke licenseUC Berkeley, CMU, Stanford, UC San Diego, and MBZUAI
77
VITBERTEncoderImage ClassificationExtension of BERT architecture to train on patches of images mage classification10//202086M(Base) to 632M (Huge) From standard Imagenet to JFT-300M (large inhouse dataset)Google
78
Wu Dao 2.0GLM (General Language Model)DecoderAutoregressive blank infillingSimilar to GPT in that it uses a Decoder/autoregressive architecture but applies a different pretraining task proposed in the GLM family of models. Besides, Wu Dao uses a “Fast Mixture of Experts” approach to scale training to trillions of parametersLanguage and multimodal (particularly image)6//20211.75TN/ABeijing Academy of Artificial Intelligence
79
XLM-RoBERTaRoBERTaEncoderMLM (Dynamic)An extension of RoBERTa that introduces small parameter tuning insights in the context of multilingual applicationsTranslation and other cross-lingual language tasks10//2019Base = 270M
Large = 550M
Cleaned Common Crawl in 100 languagesFacebook
80
XLNetTransformer XLDecoderPLMThis model basically adapts Transformer XL architecture to permutation-based LMGeneral language tasks5//2019Base=117M, Large=360MSame as BERT + Giga5 (16GB text),
and and aggressively filtered ClueWeb 2012-B (19GB), Common Crawl (110 GB)
CMU/Google
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100