Transformer Catalog

	A	B	C	D	E	F	G	H	I	J	K
1
2
3
4	Name of Model	Family	Pretraining Architecture (Encoder/Decoder/E-D)	Pretraining or Fine Tuning Task	Extension	Application	Date (of first known publication)	Num. Params	Corpus	License	Lab
5	ALBERT	BERT	Encoder	MLM/NSP	Compressed version of BERT using parameter sharing, which is much more efficient given the same number of parameters	Same as BERT	9//2019	Base = 12M, Large = 18M, XLarge = 60M	Same as BERT		Google
6	AlexaTM 20B	Transformer	Encoder/Decoder	Denoising and prefix LM	Derived from BART and layernorms located exactly at the beginning of each layer. Encoder initialized with internal 10B pre-trained encoder.	Summarization, multi-lingual machine translation and NLU tasks	8//2022	20B	Wikipedia and mC4 datasets in 12 languages		Amazon
7	Alpaca	LLaMA	Decoder	LM	Alpaca is fine-tuned from a 7B LLaMA model.	Evaluated on a variety of text generation and classification tasks.	03/2023	7B	52K instruction-following data generated using self-instruct mechanism, from 175 human-written instruction-output pairs.		Stanford
8	AlphaFold	SE(3)-Transformer	Encoder	Protein folding prediction	The original Alphafold used a BERT-style transformer. The details of Alphafold’s Transformer are not known, but it is believed it is an extension of the SE(3)-Tranformer, a 3-D equivariant Transformer (see this blog post).	Protein folding	7//2021	21M	170,000 proteins from a public repository of protein sequences and structures		Deepmind
9	Anthropic Assistant (see also)	GPT	Decoder	LM	These models do not introduce novelties at the architecture/pretraining level and they are based on GPT-3 but rather focuses on how to improve alignment through fine-tuning and prompting. Note that the Anthropic Assistant includes several models optimized for different tasks. Latest versions of this work focus on the benefits of RLHF.	Different models with different applications from general dialog to code assistant.	12//2021	10M to 52B	400B tokens from filtered Common Crawl and Books. They also create several Dialogue Preference datasets for the RLHF training.		Anthropic
10	BART	BERT for encoder, GPT for Decoder	Encoder/Decoder	DAE	It can be seen as a generalization of BERT and GPT in that it combines ideas from both in the encoder and decoder	Mostly text generation but also some text understanding tasks	10//2019	10% more than BERT	Same as RoBERTa (160Gb of news, books, stories, and web text)		Facebook
11	BERT	BERT	Encoder	MLM/NSP		General Language Understanding and Question Answering. Many other language applications followed	10//2018	Base = 110M, Large = 340M	Toronto Book Corpus and Wikipedia (3.3B Tokens)		Google
12	Big Bird		Encoder AND Encoder/Decoder (BigBird is mostly a way to implement sparse attention that is implemented both in an Encoder-only as wells as Encoder/Decoder architecture)	MLM	Big Bird can extend other architectures such as BERT, Pegasus, or RoBERTa by using a sparse attention mechanism that elminates the quadratic dependency thus making it more suitable for longer sequences	Particularly well suited for longer sequences, not only in text but also e.g. in genomics	7//2020	Depends on the overall architecture	Books, CC-News, Stories and Wikipedia		Google
13	BlenderBot 3	GPT	Decoder	LM	BlenderBot 3 is based on a pre-trained OPT. It adds features needed for a dialog agent such as long-term memory or the ability to search the internet. It is also fine-tuned for some specific tasks given human feedback on them.	same as GPT-3	8//2022	175B	180B tokens = RoBERTa + the Pile + PushShift.io Reddit		Facebook
14	BLOOM	GPT	Decoder	LM	Main difference to GPT-3 is that it uses full attention instead of sparse attention	Same as GPT-3	7//2022	176B	366B tokens (1.5 TB of text data) multilingual dataset		Big Science/Huggingface
15	ChatGPT	GPT	Decoder	LM	ChatGPT takes a GPT3.5 (aka GPT3 Davinci-003) pretrained model and uses RLHF to finetune the model mostly like described in InstructGPT but with slight differences in the data collection. ChatGPT is also more than a model since it includes extensions for Memory Store and retrieval similar to BlenderBot3	Dialog agents	10//2022	Same as GPT3	Same as GPT3 + datasets generated for RLHF		OpenAI
16	Chinchilla	GPT	Decoder	LM	Same as Gopher but with optimizations to reduce model size and therefore training/inference time with equal or superior performance	Same as Gopher/GPT3	3//2022	70B	Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia.		Deepmind
17	CLIP	CLIP (Also using Resnet, ViT, and vanilla transformer for text)	Encoder	predict which of the N × N possible (image, text) pairings across a batch actually occurred	Combines Resnet and ViT for the visual encoding with Transformer for the Textual encoder	Image/Object classification	2//2021		WIT (WebImageText) - 400 million text,image pairs		OpenAI
18	CM3	HTML	Decoder	Causality-masked LMs	This is somewhat similar to HTML in its use of structured training data. However, it is a different architecture and uses causal masking	Multimodal language model with the ability to do structured prompting	1//2022	13B (largest)	CC-News, English Wikipedia		Facebook
19	CTRL		Decoder		model can generate text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior	Controllable text generation	9//2019	1.63B	140 GB of text including: Wikipedia (En, De, Es, Fr), Project Gutenberg, 45 subreddits, OpenWebText2, Amazon Reviews, Europarl and UN data from WMT, question-answer pairs from ELI5, and the MRQA shared task3, which includes the Stanford Question Answering Dataset, NewsQA, TriviaQA, SearchQA, HotpotQA , and Natural Questions		Salesforce
20	DALL-E	GPT	Decoder	Caption prediction	A differential variational auto-encoder is used to learn the visual codebook. The transformer is a variation of GPT-3	Text to image	1//2021	12B	250 million text-images pairs from the internet		OpenAI
21	DALL-E-2	CLIP, GLIDE	Encoder/Decoder	Caption prediction	Combines CLIP encoder and Diffusion decoder similar to GLIDE	Text to image	4//2022	3.5B	Combination of the DALL-E and CLIP datasets		OpenAI
22	DeBERTa	BERT	Decoder	MLM	Separate positional embedding vector independent from the content embedding using disentangled attention matrices for contents and relative positions	Same as BERT	6//2020	750M (xlarge)	English Wikipedia, BookCorpus, OPENWEBTEXT and STORIES	Open, MIT license	Microsoft
23	Decision Transformers	GPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks)	Decoder	Next action prediction	Decision transformers use a GPT architecture and extend it by encoding trajectories in a way that they can be learned by an auto-regressive task	General RL (reinforcement learning tasks)	6//2021	Same as GPT	Different corpus for different experiments		Google/UC Berkeley/FAIR
24	DialoGPT	GPT	Decoder	LM	GPT-2 architecture trained on dialog data	Text generation in dialog settings	10//2019	1.5B	140M Reddit conversations		Microsoft
25	DistilBERT	BERT	Encoder	MLM/NSP	Compressed version of BERT using distillation, which is much more efficient given the same number of parameters	Same as BERT	10//2019	66M	Same as BERT		Huggingface
26	DQ-BART	BART	Encoder/Decoder	DAE	Adds quantization and distillation to a BART model to improve performance and model size	Text generation and understanding	3//2022	Up to 30x reduction in parameters compared to standard BART	CNN/DM, XSUM, ELI5, WMT16 En-Ro (~1M tokens)		Amazon
27	Dolly	GPT	Decoder	Fine tuned on Q&A pairs to follow human instructions	fine-tuned based on the GPT-J-6B (V1) and Pythia model (V2)	Similar to Alpaca	3//2023	6B	V1: Instruction corpus same as Alpaca, V2: databricks own dataset.	Open	Databricks, Inc
28	E5	BERT	Encoder	Fine tuned on Semantic similarity using contrastive loss	Fine-tunes BERT-based models to create text string embeddings optimized for semantic relatedness.	Text embeddings for semantic relatedness tasks such as text clustering or search retrieval.	12//2022	300M (large version)	MS-MARCO, NQ, NLI	Open, MIT	Microsoft
29	ELECTRA		Encoder	RTD		Same as BERT	3//2020	Base = 110M, Large = 330M	Same as BERT except for Large with is same as XLNet		Stanford/Google
30	ERNIE	BERT	Encoder	MLM	Uses BERT for Encoder architecture, but stacks and aggregates two of them for text and entities. This architecture could be understood as BERT for text + knowledge graphs	Knowledge intensive related tasks that might benefit from knowledge graphs or entities such as entity recognition	5//2019	114M	English Wikipedia + Wikidata for entitites (note that they initialize model to original BERT parameter values		Various Chinese institutions
31	Flamingo	Chinchilla	Decoder	Log likelihood of text given some visual input	It uses a frozen textual language model (like Chinchilla) conditioned on the visual representation, which is encoded from a Normalizer-Free ResNet	Text to image	4//2022	80B (largest)	MultiModal MassiveWeb (M3W): 185 million images and 182 GB text + a number of text paired with image datasets: ALIGN + LTIP (Long Text & Image Pairs) = 312 million images, and VTP (Video & Text Pairs) = 27 million short videos (approximately 22 seconds on average)		Deepmind
32	Flan-T5	T5	Encoder/Decoder	Fine tuned on Instructions for zero-shot and few-shot tasks	Flan-T5 is generated by "Flan Finetuning" the T5 models: (1) scaling the number of tasks to 1,836, (2) scaling the model size, and (3) finetuning on chain-of-thought data.	The primary use is to underestand how to improve large language models with the right kind of instruction fine-tuning. The focus is research on zero-shot and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models	11//2022	11B(xxl)	Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT.	Open, Apache-2.0	Google
33	Flan-PaLM	PaLM	Decoder	Fine tuned on Instructions for zero-shot and few-shot tasks	Flan-PaLM is generated by "Flan Finetuning" the PaLM models: (1) scaling the number of tasks to 1,836, (2) scaling the model size, and (3) finetuning on chain-of-thought data.	Same as Flan-T5. The goal is to show Flan finetuning can even improve on the largest Google LMs (+9.4\% improvement average across tasks), with improvements to chain of thought, self consistency, multilingual tasks, arithmetic reasoning	11//2022	540B (largest)	Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT.	Limited	Google
34	Galactica	Transformer	Decoder	LM for scientific domain	Transformer based architecture in a decoder-only setup with a few modifications. Data extensions include special tokens for working memory, citations, genetic data, and a few other biology related tasks.	he models are designed to perform scientific tasks, including but not limited to citation prediction, scientific QA, mathematical reasoning, summarization, document generation, molecular property prediction and entity extraction.	11//2022	120B (huge)	Trained on 106 billion tokens of open-access scientific text and data. This includes papers, textbooks, scientific websites, encyclopedias, reference material, knowledge bases, and more	Limited, non-commerical CC BY-NC 4.0 license	Meta
35	Gato	“Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks)	Decoder	MLM (where tokens are either text or agent actions)	The standard decoder-only transformer architecture is preceded by an embedding layer that can embed text and images, plus add position encodings to add spatial information when applicable.	Gato presents a generalizable agent that can be used beyond text to tasks such as playing Atari or controlling a robot arm.	5//2022	1.2B	1.5T tokens including standard text (e.g. MassiveText), vision (e.g. ALIGN), and simulation environments (e.g. ALE Atari, or RGB Stacking Real Robot)		Deepmind
36	GLaM	Transformer	Decoder	LM	GLaM introduces a Mixture of 64 Experts to increase parameter count and generalization properties in a somewhat standard decoder-only. Transformer architecture. Only two experts get activated at a time per token, which makes the model also more efficient in training and inference.	General language modeling	12//2021	1.2T across 64 experts, but only 96B get activated for inference	1.6T tokens including web pages filtered by Wikipedia and books for quality		Google
37	GLIDE	Diffusion models	Encoder	Caption prediction	GLIDE can be seen as an extension of the ADM (Ablated Diffusion Model) by the same authors. However, ADM is not per se a transformer architecture although it does resemble one in some of the configurations the authors use. Given that ADM is by the same authors and was quickly followed up by GLIDE, I think it is fair to consider GLIDE as the first of its kind.	Text to image	12//2021	3.5B diffusion model (2.3B for visual encoding, 1.2B for textual) + 1.5B for model for upsampling	Same as DALL-E		OpenAI
38	GLM	GLM (General Language Model)	Encoder and decoder	Auto regressive blank infilling	GLM has a bidirectional encoder and a unidirectional decoder in a unified model.	General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.	3//2022	130B	Pile, GLM-130B Chinese corpora, P3, DeepStruct finetuning dataset	Open, MIT license	Tsinghua
39	Global Context ViT	ViT	Encoder	Image Classification	hierarchical ViT architecture consisting of local and global self-attention modules	Image (object detection, image classification..)	6//2022	90M	Imagenet-1K and other task dependent dataasets		NVidia
40	Gopher	GPT	Decoder	LM	Same as GPT-2 but use RSNorm instead of LayerNorm and relative positional encoding rather than absolute	Mostly Language Modeling and NLU, but also extensible like GPT	12//2021	280B	Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia.		Deepmind
41	GopherCite	Gopher	Decoder	LM	GopherCite is based on Gopher but adds a step using RLHP (Reinforcement Learning from Human Preferences) to learn whether not only a response is plausible but also supported	Dialog systems, Q&A, general language generation tasks	3//2022	280B	Same as Gopher plus specific dataset generated in the RLHP process		Deepmind
42	GPT	GPT	Decoder	LM		Text generation, but adaptable to many other NLP tasks when fine tuned.	6//2018	117M	Unsupervised Pretraining on BookCorpus dataset. Supervised Finetuning on several task-specific datasets including SNLI, RACE, Quora...		OpenAI
43	GPT-2	GPT	Decoder	LM	Minor extensions to the GPT architecture (e.g. layer normalization moved to the input of each sub-layer, or increased context size from 512 to 1024)	Text generation, but adaptable to many other NLP tasks when fine tuned.	2//2019	1.5B	8 million web pages (40 GB). 10X GPT . WebText dataset is created by crawling all links at Reddit with at least 3 Karma points.		OpenAI
44	GPT-3	GPT	Decoder	LM	Same as GPT-2 with the only addition of alternating dense and locally banded sparse attention patterns, inspired by the Sparse Transformer	Initially text generation, but has over time been used for a large range of applications in areas such as code generation, but also image and audio generation	5//2020	175 B	~ 500B tokens including CommonCrawl (410B), WebText2 (19B), Books1 (12B), Books2 (55B), and Wikipedia (3B)		OpenAI
45	GPT-3.5	GPT	Decoder	LM	The GPT3.5 series includes a number of models like Davinci-003. They are basically versions of the InstructGPT model. See here for details on the comparison of the performance to older GPT3 models.	Dialog and general language, but there is a code specific model too	10//2022	175B	Same as InstructGPT		OpenAI
46	GPT-J	GPT	Decoder	LM	GPT-J 6B is a Transformer model trained using Mesh Transformer JAX and same tokenizer as GPT2/3	Same as GPT-3	5//2021	6B	Pile corpus, a large-scale curated dataset created by EleutherAI.	Open, Apache-2.0	EleutherAI
47	GPT-Neo	GPT	Decoder	LM	Similar to GPT-2 but uses local attention in every other layer with a window size of 256 tokens	Text generation, but adaptable to many other NLP tasks when fine tuned.	3//2021	1.5B 2.7B (XL)	Pile - 840 GB open source text dataset that combines 22 pre existing datasets		EleutherAI
48	GPT-NeoX-20B	GPT	Decoder	LM	Similar to GPT-3 with rotary encoders instead of positional, parallel attention and feed forward layers, different initialization, and all dense layers instead of alternate dense/sparse	same as GPT-3	4//2022	20B	Pile (22 data sources)		EleutherAI
49	GPTInstruct	GPT	Decoder	LM	GPTInstruct starts off with a pretrained GPT3 model and adds reward modeling through reinforcement learning after a supervised finetuning	Knowledge-intensive dialog or language tasks	1//2022	Same as GPT3	Same as GPT3 for pretraining, but finetuned and optimized using labeler data and prompts		OpenAI
50	InstructOR	T5	Encoder/Decoder	Fine tuned with a wide variety of instruction based text-to-text tasks	Fine-tunes T5 explicitly to optimize encoder to produce a general purpose text string embedding useful for many NLU tasks.	Any NLU task requiring a single text string embedding. As of April 2023 InstructOR is the top-ranked system on the Massive Text Embedding Benchmark (MTEB)	12//2022	330M	Finetuned on MEDI	Open, Apache-2.0	University of Hong Kong, University of Washington, META AI
51	HTML	BART	Encoder/Decoder	DAE	As opposed to BART, they don’t do sentence shuffling	General purpose language model that allows structured HTML prompting	7//2021	400M	23TB of simplified HTML extracted from CommonCrawl		Facebook
52	Imagen	T5, CLIP, Diffusion models	T5 (or CLIP or BERT) for frozen text encoder + U-net architecture for cascaded diffusion models for text to image	image/text pair prediction	Imagen adds a few extensions to the U-net diffusion architecture (pooled embedding vector, cross attention over text embeddings, and Layer Normalizations)	Text to image	6//2022	2B	a combination of internal datasets, with ≈ 460M image-text pairs, and the publicly available Laion dataset, with ≈ 400M image-text pairs		Google
53	Jurassic-1	GPT	Decoder	LM	Very similar to GPT-3, but far more parameters and improved training efficiency mostly because of the improved tokenizer. Also, different ratio of depth to breadth	Similar to GPT-3	9//2021	178B (Jumbo), 7.5B (Large)	300B tokens (same as GPT-3)		AI21
54	LAMDA	Transformer	Decoder	LM	LAMDA focuses on how to improve safety, quality, and groundeness using different fine-tuning strategies	General language modeling	1//2022	137B	1.56T words from public dialog data and other public web documents		Google
55	LLaMA	Transformer	Decoder	LM	LLaMA uses a Transformer architecture, and with extensions: Pre-normalization, SwiGLU activations, RoPE embeddings, reduced memory usage and runtime through efficient implementation of the causal multi-head attention, checkpointing to reduce the amount of activations that are recomputed during the backward pass, model and sequence parallelism to reduce memory usage of the model, and uses 1.4T BPE tokens after tokenization.	Zero and few shot Commonsense reasoning, Question answering, Code generation and Reading comprehension.	2//2023	65B	English CommonCrawl + C4 + Github + Wikipedia + Gutenberg and Books3 + ArXiv + Stack Exchange	Limited, Non-commercial bespoke license	Meta
56	mBART	BART	Encoder/Decoder	DAE		Translation	1//2020	Same as BART	CC25 Corpus includes 25 monolingual corpuses in different languages. Largest corpuses are English (300 GB) and Russian (280GB)		Facebook
57	Megatron	GPT/BERT/T5	Encoder or Decorder, depending on the base model	Same as base model	Megatron is a family of models that extend previously known architectures (namely GPT-2 and BERT originally, but also T5 more recently) by introducing model parallelism primitives. In the case of BERT, the authors also replace the next sentence prediction head with sentence order prediction and use whole word n-gram masking.	Same as base model	3//2020	8.3B (GPT-like), 3.9B (BERT-like)	Original paper uses an aggregate dataset consisting of Wikipedia), CC-Stories), RealNews, and OpenWebtext		NVidia
58	Minerva	PaLM	Decoder	LM	Extends PaLM by fine-tuning on the mathematical dataset	Mathematical reasoning	6//2022	540B	Same as PaLM + 118GB dataset of scientific papers from the arXiv preprint server and web pages that contain mathematical expressions using LaTeX, MathJax, or other mathematical typesetting formats		Google
59	MT-NLG (Megatron Touring NLG)	GPT	Decoder	LM	Uses parallelization similar to Megatron to train a LM double the size of GPT-3	Language generation and others (similar to GPT-3)	10//2021	530B	The Pile (800GB dataset) + 2 Common Crawl snapshots		NVidia
60	OpenAssistant LLaMA	LLaMA	Decoder		Supervised fine-tuning on crowd sourced conversation/assistant data.	Same as ChatGPT, but open source. Compared to alternatives, it uses human generated conversation data	4//2023	30B	Conversations collected by volunteers, available at https://huggingface.co/datasets/OpenAssistant/oasst1	Limited, Non-commercial bespoke license. There is also a version based on Pythia which is Apache licensed.	Various open source contributors
61	OPT	GPT-3	Decoder	LM	Basically same architecture as GPT-3 but with some training improvements introduced in Megatron-LM	Same as GPT-3	5//2022	175B (and other smaller versions)	180B tokens = RoBERTa + the Pile + PushShift.io Reddit		Facebook
62	Palm	Transformer	Decoder	LM	Palm uses a typical decoder-only transformer architecture, but adds quite a few extensions: SwiGLU activations, parallel layers, multi-query attention, RoPE embeddings, Shared Input-Output Embeddings, no biases, and a 256k SentencePiece vocabulary generated from the training data	Language understanding and generation	4//2022	540B	780B tokens from filtered webpages, books, Wikipedia, news articles, source code, and social media conversations. Code includes 24 programming languages.		Google
63	Pegasus		Encoder/Decoder	DAE (more concretely GSG) and MLM	Extends vanilla Transformer by using a different pretraining task (GSG: Gap Sentence Generation) that is better suited for summarization	Summarization	12//2019	Base = 223M Large = 568M	C4 (750GB) + HugeNews (3.8 TB)		UCL/Google
64	RoBERTa	BERT	Encoder	MLM (Dynamic)	Extension of BERT with optimized training procedure and more data	Same as BERT	7//2019	356M	Same as BERT + CC News + OpenWebText + Stories (~33B Tokens)		UW/Google
65	SeeKer	GPT (but can extend any family)	Encoder/decoder or decoder only, depending on the base model it’s extending	Encoder/decoder or decoder only, depending on the base model it’s extending	SeeKer is an extension that can be applied to any Transformer architecture by introducing “search”, “knowledge”, and “response” modules that are introduced during pretraining	Same as base models	3//2022	Depends on the base model	Same as base model		Facebook
66	Sparrow	GPT	Decoder	LM	Starts from the Chinchilla 70B model but adds RLHF (Reinforcement Learning with Human Feedback). It also adds inline evidence a la GopherCite	Dialog agents and general language generation applications like Q&A	9//2022	70B	Same as Chinchilla + interactive data gathering with human annotators during the RLHF process		Deepmind
67	StableDiffusion	Diffusion	Encoder/Decoder	Caption Prediction	Stable diffusion is basically the Latent Diffusion model developed by LMU Munich researchers + some learnings on conditional diffusion from DALL-e and Imagen	Text to image	12//2021	890M (although there are different, smaller, variants)	LAION-5B, a publicly available dataset derived from Common Crawl		LMU Munich + Stability.ai + Eleuther.ai
68	Swin Transformer	ViT	Encoder	Same as ViT	Extends ViT by replacing the standard multi-head self attention (MSA) module by a module based on shifted windows (Swin) allowing ViT-like architectures to generalize to higher resolution images	Image (object detection, image classification..)	3//2021	29M-197M	Imagenet and Imagenet-22k		Microsoft
69	Switch	T5	Encoder/Decoder	DAE	Goal to increase parameter count while keeping FLOP operations constant by using efficient routing of MoE (Mixture of Experts)	General language tasks (e.g. question answering)	1//2021	1T	Colossal Clean Crawled Corpus		Google
70	T0	T5	Encoder/Decoder	Fine tuned with Natural language prompts	T0 stands for "T5 for Zero Shot", obtained by fine-tuning the T5 model on multitask mixture covering many different NLP tasks. Compared with T0, T0p and T0pp were fine-tuned with more datasets. T0pp is recommended as it leads (on average) to the best performances on a variety of NLP tasks.	Perform zero-shot inference tasks by specifying the query in natural language, and the models will generate a prediction.	3//2022	11B (largest)	T0 (Multiple-choice QA, Extractive QA, Closed-Book QA, Structure-To-Text, Sentiment, Summarization, Topic Classification, Paraphrase Identification. T0p (same as T0, with additional datasets from GPT-3's evaluation suite). T0pp (same as T0p, with additional datasets from SuperGLUE, excluding NLI sets)	Open, Apache-2.0	BigScience
71	T5	Transformer	Encoder/Decoder	DAE	Same as original Transformer with some additions such as relative positional embeddings like Transformer XL	General language tasks including machine translation, question answering, abstractive summarization, and text classification	10//2019	11 B (up to)	Colossal Clean Crawled Corpus (C4) - Cleaned up version of the Common Crawl dataset - 750 GB		Google
72	Trajectory Transformers	GPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks)	Decoder	predict most likely sequence	Similarly to the Decision transformers, the main extension introduced by Trajectory Transformers is a way to encode a trajectory (state, actions, rewards)	General RL (reinforcement learning tasks)	6//2021	Smaller architecture than GPT	D4RL dataset and other RL datasets depending on the task at hand		UC Berkeley
73	Transformer XL	Transformer	Decoder	LM	Relative positioned embeddings enable longer-context attention when compared to vanilla Transformer model	General language tasks	1//2019	151M	Different training datasets depending on experiments, but baseline is Wikitext-103		CMU/Google
74	Turing-NLG	GPT	Decoder	LM	Optimized version of GPT2 with optimal hyperparameters and software/hardware platform to improve training	Same as GPT-2/3	2//2020	17B originally, up to 530B more recently	Highest quality subset from The Pile + 2 CC snapshots (339B tokens)		Microsoft
75	UL2	Transformer	Encoder/Decoder	Mixture-of-Denoisers, which combines diverse pre-training paradigms together	UL2-20B (Unifying Language Learning) can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs.	A unified framework for pre-training models that are universally effective across datasets and setups.	5//2022	20B	1 trillion tokens on C4	Open, Apache-2.0	Google
76	Vicuna	LLaMA	Decoder	human instructions	LLaMA fine-tuned on user-shared conversations collected from ShareGPT.	Same as ChatGPT	3//2023	13B	Conversations collected from ShareGPT	Limited, Non-commercial bespoke license	UC Berkeley, CMU, Stanford, UC San Diego, and MBZUAI
77	VIT	BERT	Encoder	Image Classification	Extension of BERT architecture to train on patches of images	mage classification	10//2020	86M(Base) to 632M (Huge)	From standard Imagenet to JFT-300M (large inhouse dataset)		Google
78	Wu Dao 2.0	GLM (General Language Model)	Decoder	Autoregressive blank infilling	Similar to GPT in that it uses a Decoder/autoregressive architecture but applies a different pretraining task proposed in the GLM family of models. Besides, Wu Dao uses a “Fast Mixture of Experts” approach to scale training to trillions of parameters	Language and multimodal (particularly image)	6//2021	1.75T	N/A		Beijing Academy of Artificial Intelligence
79	XLM-RoBERTa	RoBERTa	Encoder	MLM (Dynamic)	An extension of RoBERTa that introduces small parameter tuning insights in the context of multilingual applications	Translation and other cross-lingual language tasks	10//2019	Base = 270M Large = 550M	Cleaned Common Crawl in 100 languages		Facebook
80	XLNet	Transformer XL	Decoder	PLM	This model basically adapts Transformer XL architecture to permutation-based LM	General language tasks	5//2019	Base=117M, Large=360M	Same as BERT + Giga5 (16GB text), and and aggressively filtered ClueWeb 2012-B (19GB), Common Crawl (110 GB)		CMU/Google
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100