A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | AB | AC | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ||||||||||||||||||||||||||||||
2 | ||||||||||||||||||||||||||||||
3 | ||||||||||||||||||||||||||||||
4 | Name of Model | Family | Pretraining Architecture (Encoder/Decoder/E-D) | Pretraining or Fine Tuning Task | Extension | Application | Date (of first known publication) | Num. Params | Corpus | License | Lab | |||||||||||||||||||
5 | ALBERT | BERT | Encoder | MLM/NSP | Compressed version of BERT using parameter sharing, which is much more efficient given the same number of parameters | Same as BERT | 9//2019 | Base = 12M, Large = 18M, XLarge = 60M | Same as BERT | |||||||||||||||||||||
6 | AlexaTM 20B | Transformer | Encoder/Decoder | Denoising and prefix LM | Derived from BART and layernorms located exactly at the beginning of each layer. Encoder initialized with internal 10B pre-trained encoder. | Summarization, multi-lingual machine translation and NLU tasks | 8//2022 | 20B | Wikipedia and mC4 datasets in 12 languages | Amazon | ||||||||||||||||||||
7 | Alpaca | LLaMA | Decoder | LM | Alpaca is fine-tuned from a 7B LLaMA model. | Evaluated on a variety of text generation and classification tasks. | 03/2023 | 7B | 52K instruction-following data generated using self-instruct mechanism, from 175 human-written instruction-output pairs. | Stanford | ||||||||||||||||||||
8 | AlphaFold | SE(3)-Transformer | Encoder | Protein folding prediction | The original Alphafold used a BERT-style transformer. The details of Alphafold’s Transformer are not known, but it is believed it is an extension of the SE(3)-Tranformer, a 3-D equivariant Transformer (see this blog post). | Protein folding | 7//2021 | 21M | 170,000 proteins from a public repository of protein sequences and structures | Deepmind | ||||||||||||||||||||
9 | Anthropic Assistant (see also) | GPT | Decoder | LM | These models do not introduce novelties at the architecture/pretraining level and they are based on GPT-3 but rather focuses on how to improve alignment through fine-tuning and prompting. Note that the Anthropic Assistant includes several models optimized for different tasks. Latest versions of this work focus on the benefits of RLHF. | Different models with different applications from general dialog to code assistant. | 12//2021 | 10M to 52B | 400B tokens from filtered Common Crawl and Books. They also create several Dialogue Preference datasets for the RLHF training. | Anthropic | ||||||||||||||||||||
10 | BART | BERT for encoder, GPT for Decoder | Encoder/Decoder | DAE | It can be seen as a generalization of BERT and GPT in that it combines ideas from both in the encoder and decoder | Mostly text generation but also some text understanding tasks | 10//2019 | 10% more than BERT | Same as RoBERTa (160Gb of news, books, stories, and web text) | |||||||||||||||||||||
11 | BERT | BERT | Encoder | MLM/NSP | General Language Understanding and Question Answering. Many other language applications followed | 10//2018 | Base = 110M, Large = 340M | Toronto Book Corpus and Wikipedia (3.3B Tokens) | ||||||||||||||||||||||
12 | Big Bird | Encoder AND Encoder/Decoder (BigBird is mostly a way to implement sparse attention that is implemented both in an Encoder-only as wells as Encoder/Decoder architecture) | MLM | Big Bird can extend other architectures such as BERT, Pegasus, or RoBERTa by using a sparse attention mechanism that elminates the quadratic dependency thus making it more suitable for longer sequences | Particularly well suited for longer sequences, not only in text but also e.g. in genomics | 7//2020 | Depends on the overall architecture | Books, CC-News, Stories and Wikipedia | ||||||||||||||||||||||
13 | BlenderBot 3 | GPT | Decoder | LM | BlenderBot 3 is based on a pre-trained OPT. It adds features needed for a dialog agent such as long-term memory or the ability to search the internet. It is also fine-tuned for some specific tasks given human feedback on them. | same as GPT-3 | 8//2022 | 175B | 180B tokens = RoBERTa + the Pile + PushShift.io Reddit | |||||||||||||||||||||
14 | BLOOM | GPT | Decoder | LM | Main difference to GPT-3 is that it uses full attention instead of sparse attention | Same as GPT-3 | 7//2022 | 176B | 366B tokens (1.5 TB of text data) multilingual dataset | Big Science/Huggingface | ||||||||||||||||||||
15 | ChatGPT | GPT | Decoder | LM | ChatGPT takes a GPT3.5 (aka GPT3 Davinci-003) pretrained model and uses RLHF to finetune the model mostly like described in InstructGPT but with slight differences in the data collection. ChatGPT is also more than a model since it includes extensions for Memory Store and retrieval similar to BlenderBot3 | Dialog agents | 10//2022 | Same as GPT3 | Same as GPT3 + datasets generated for RLHF | OpenAI | ||||||||||||||||||||
16 | Chinchilla | GPT | Decoder | LM | Same as Gopher but with optimizations to reduce model size and therefore training/inference time with equal or superior performance | Same as Gopher/GPT3 | 3//2022 | 70B | Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia. | Deepmind | ||||||||||||||||||||
17 | CLIP | CLIP (Also using Resnet, ViT, and vanilla transformer for text) | Encoder | predict which of the N × N possible (image, text) pairings across a batch actually occurred | Combines Resnet and ViT for the visual encoding with Transformer for the Textual encoder | Image/Object classification | 2//2021 | WIT (WebImageText) - 400 million text,image pairs | OpenAI | |||||||||||||||||||||
18 | CM3 | HTML | Decoder | Causality-masked LMs | This is somewhat similar to HTML in its use of structured training data. However, it is a different architecture and uses causal masking | Multimodal language model with the ability to do structured prompting | 1//2022 | 13B (largest) | CC-News, English Wikipedia | |||||||||||||||||||||
19 | CTRL | Decoder | model can generate text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior | Controllable text generation | 9//2019 | 1.63B | 140 GB of text including: Wikipedia (En, De, Es, Fr), Project Gutenberg, 45 subreddits, OpenWebText2, Amazon Reviews, Europarl and UN data from WMT, question-answer pairs from ELI5, and the MRQA shared task3, which includes the Stanford Question Answering Dataset, NewsQA, TriviaQA, SearchQA, HotpotQA , and Natural Questions | Salesforce | ||||||||||||||||||||||
20 | DALL-E | GPT | Decoder | Caption prediction | A differential variational auto-encoder is used to learn the visual codebook. The transformer is a variation of GPT-3 | Text to image | 1//2021 | 12B | 250 million text-images pairs from the internet | OpenAI | ||||||||||||||||||||
21 | DALL-E-2 | CLIP, GLIDE | Encoder/Decoder | Caption prediction | Combines CLIP encoder and Diffusion decoder similar to GLIDE | Text to image | 4//2022 | 3.5B | Combination of the DALL-E and CLIP datasets | OpenAI | ||||||||||||||||||||
22 | DeBERTa | BERT | Decoder | MLM | Separate positional embedding vector independent from the content embedding using disentangled attention matrices for contents and relative positions | Same as BERT | 6//2020 | 750M (xlarge) | English Wikipedia, BookCorpus, OPENWEBTEXT and STORIES | Open, MIT license | Microsoft | |||||||||||||||||||
23 | Decision Transformers | GPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks) | Decoder | Next action prediction | Decision transformers use a GPT architecture and extend it by encoding trajectories in a way that they can be learned by an auto-regressive task | General RL (reinforcement learning tasks) | 6//2021 | Same as GPT | Different corpus for different experiments | Google/UC Berkeley/FAIR | ||||||||||||||||||||
24 | DialoGPT | GPT | Decoder | LM | GPT-2 architecture trained on dialog data | Text generation in dialog settings | 10//2019 | 1.5B | 140M Reddit conversations | Microsoft | ||||||||||||||||||||
25 | DistilBERT | BERT | Encoder | MLM/NSP | Compressed version of BERT using distillation, which is much more efficient given the same number of parameters | Same as BERT | 10//2019 | 66M | Same as BERT | Huggingface | ||||||||||||||||||||
26 | DQ-BART | BART | Encoder/Decoder | DAE | Adds quantization and distillation to a BART model to improve performance and model size | Text generation and understanding | 3//2022 | Up to 30x reduction in parameters compared to standard BART | CNN/DM, XSUM, ELI5, WMT16 En-Ro (~1M tokens) | Amazon | ||||||||||||||||||||
27 | Dolly | GPT | Decoder | Fine tuned on Q&A pairs to follow human instructions | fine-tuned based on the GPT-J-6B (V1) and Pythia model (V2) | Similar to Alpaca | 3//2023 | 6B | V1: Instruction corpus same as Alpaca, V2: databricks own dataset. | Open | Databricks, Inc | |||||||||||||||||||
28 | E5 | BERT | Encoder | Fine tuned on Semantic similarity using contrastive loss | Fine-tunes BERT-based models to create text string embeddings optimized for semantic relatedness. | Text embeddings for semantic relatedness tasks such as text clustering or search retrieval. | 12//2022 | 300M (large version) | MS-MARCO, NQ, NLI | Open, MIT | Microsoft | |||||||||||||||||||
29 | ELECTRA | Encoder | RTD | Same as BERT | 3//2020 | Base = 110M, Large = 330M | Same as BERT except for Large with is same as XLNet | Stanford/Google | ||||||||||||||||||||||
30 | ERNIE | BERT | Encoder | MLM | Uses BERT for Encoder architecture, but stacks and aggregates two of them for text and entities. This architecture could be understood as BERT for text + knowledge graphs | Knowledge intensive related tasks that might benefit from knowledge graphs or entities such as entity recognition | 5//2019 | 114M | English Wikipedia + Wikidata for entitites (note that they initialize model to original BERT parameter values | Various Chinese institutions | ||||||||||||||||||||
31 | Flamingo | Chinchilla | Decoder | Log likelihood of text given some visual input | It uses a frozen textual language model (like Chinchilla) conditioned on the visual representation, which is encoded from a Normalizer-Free ResNet | Text to image | 4//2022 | 80B (largest) | MultiModal MassiveWeb (M3W): 185 million images and 182 GB text + a number of text paired with image datasets: ALIGN + LTIP (Long Text & Image Pairs) = 312 million images, and VTP (Video & Text Pairs) = 27 million short videos (approximately 22 seconds on average) | Deepmind | ||||||||||||||||||||
32 | Flan-T5 | T5 | Encoder/Decoder | Fine tuned on Instructions for zero-shot and few-shot tasks | Flan-T5 is generated by "Flan Finetuning" the T5 models: (1) scaling the number of tasks to 1,836, (2) scaling the model size, and (3) finetuning on chain-of-thought data. | The primary use is to underestand how to improve large language models with the right kind of instruction fine-tuning. The focus is research on zero-shot and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models | 11//2022 | 11B(xxl) | Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT. | Open, Apache-2.0 | ||||||||||||||||||||
33 | Flan-PaLM | PaLM | Decoder | Fine tuned on Instructions for zero-shot and few-shot tasks | Flan-PaLM is generated by "Flan Finetuning" the PaLM models: (1) scaling the number of tasks to 1,836, (2) scaling the model size, and (3) finetuning on chain-of-thought data. | Same as Flan-T5. The goal is to show Flan finetuning can even improve on the largest Google LMs (+9.4\% improvement average across tasks), with improvements to chain of thought, self consistency, multilingual tasks, arithmetic reasoning | 11//2022 | 540B (largest) | Flan finetuned with tasks in Muffin, T0-SF, NIV2, and CoT. | Limited | ||||||||||||||||||||
34 | Galactica | Transformer | Decoder | LM for scientific domain | Transformer based architecture in a decoder-only setup with a few modifications. Data extensions include special tokens for working memory, citations, genetic data, and a few other biology related tasks. | he models are designed to perform scientific tasks, including but not limited to citation prediction, scientific QA, mathematical reasoning, summarization, document generation, molecular property prediction and entity extraction. | 11//2022 | 120B (huge) | Trained on 106 billion tokens of open-access scientific text and data. This includes papers, textbooks, scientific websites, encyclopedias, reference material, knowledge bases, and more | Limited, non-commerical CC BY-NC 4.0 license | Meta | |||||||||||||||||||
35 | Gato | “Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks) | Decoder | MLM (where tokens are either text or agent actions) | The standard decoder-only transformer architecture is preceded by an embedding layer that can embed text and images, plus add position encodings to add spatial information when applicable. | Gato presents a generalizable agent that can be used beyond text to tasks such as playing Atari or controlling a robot arm. | 5//2022 | 1.2B | 1.5T tokens including standard text (e.g. MassiveText), vision (e.g. ALIGN), and simulation environments (e.g. ALE Atari, or RGB Stacking Real Robot) | Deepmind | ||||||||||||||||||||
36 | GLaM | Transformer | Decoder | LM | GLaM introduces a Mixture of 64 Experts to increase parameter count and generalization properties in a somewhat standard decoder-only. Transformer architecture. Only two experts get activated at a time per token, which makes the model also more efficient in training and inference. | General language modeling | 12//2021 | 1.2T across 64 experts, but only 96B get activated for inference | 1.6T tokens including web pages filtered by Wikipedia and books for quality | |||||||||||||||||||||
37 | GLIDE | Diffusion models | Encoder | Caption prediction | GLIDE can be seen as an extension of the ADM (Ablated Diffusion Model) by the same authors. However, ADM is not per se a transformer architecture although it does resemble one in some of the configurations the authors use. Given that ADM is by the same authors and was quickly followed up by GLIDE, I think it is fair to consider GLIDE as the first of its kind. | Text to image | 12//2021 | 3.5B diffusion model (2.3B for visual encoding, 1.2B for textual) + 1.5B for model for upsampling | Same as DALL-E | OpenAI | ||||||||||||||||||||
38 | GLM | GLM (General Language Model) | Encoder and decoder | Auto regressive blank infilling | GLM has a bidirectional encoder and a unidirectional decoder in a unified model. | General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks. | 3//2022 | 130B | Pile, GLM-130B Chinese corpora, P3, DeepStruct finetuning dataset | Open, MIT license | Tsinghua | |||||||||||||||||||
39 | Global Context ViT | ViT | Encoder | Image Classification | hierarchical ViT architecture consisting of local and global self-attention modules | Image (object detection, image classification..) | 6//2022 | 90M | Imagenet-1K and other task dependent dataasets | NVidia | ||||||||||||||||||||
40 | Gopher | GPT | Decoder | LM | Same as GPT-2 but use RSNorm instead of LayerNorm and relative positional encoding rather than absolute | Mostly Language Modeling and NLU, but also extensible like GPT | 12//2021 | 280B | Massive Text (2.35 billion documents, or about 10.5 TB of text including Massive Web, Books, Github, News, C4, and Wikipedia. | Deepmind | ||||||||||||||||||||
41 | GopherCite | Gopher | Decoder | LM | GopherCite is based on Gopher but adds a step using RLHP (Reinforcement Learning from Human Preferences) to learn whether not only a response is plausible but also supported | Dialog systems, Q&A, general language generation tasks | 3//2022 | 280B | Same as Gopher plus specific dataset generated in the RLHP process | Deepmind | ||||||||||||||||||||
42 | GPT | GPT | Decoder | LM | Text generation, but adaptable to many other NLP tasks when fine tuned. | 6//2018 | 117M | Unsupervised Pretraining on BookCorpus dataset. Supervised Finetuning on several task-specific datasets including SNLI, RACE, Quora... | OpenAI | |||||||||||||||||||||
43 | GPT-2 | GPT | Decoder | LM | Minor extensions to the GPT architecture (e.g. layer normalization moved to the input of each sub-layer, or increased context size from 512 to 1024) | Text generation, but adaptable to many other NLP tasks when fine tuned. | 2//2019 | 1.5B | 8 million web pages (40 GB). 10X GPT . WebText dataset is created by crawling all links at Reddit with at least 3 Karma points. | OpenAI | ||||||||||||||||||||
44 | GPT-3 | GPT | Decoder | LM | Same as GPT-2 with the only addition of alternating dense and locally banded sparse attention patterns, inspired by the Sparse Transformer | Initially text generation, but has over time been used for a large range of applications in areas such as code generation, but also image and audio generation | 5//2020 | 175 B | ~ 500B tokens including CommonCrawl (410B), WebText2 (19B), Books1 (12B), Books2 (55B), and Wikipedia (3B) | OpenAI | ||||||||||||||||||||
45 | GPT-3.5 | GPT | Decoder | LM | The GPT3.5 series includes a number of models like Davinci-003. They are basically versions of the InstructGPT model. See here for details on the comparison of the performance to older GPT3 models. | Dialog and general language, but there is a code specific model too | 10//2022 | 175B | Same as InstructGPT | OpenAI | ||||||||||||||||||||
46 | GPT-J | GPT | Decoder | LM | GPT-J 6B is a Transformer model trained using Mesh Transformer JAX and same tokenizer as GPT2/3 | Same as GPT-3 | 5//2021 | 6B | Pile corpus, a large-scale curated dataset created by EleutherAI. | Open, Apache-2.0 | EleutherAI | |||||||||||||||||||
47 | GPT-Neo | GPT | Decoder | LM | Similar to GPT-2 but uses local attention in every other layer with a window size of 256 tokens | Text generation, but adaptable to many other NLP tasks when fine tuned. | 3//2021 | 1.5B 2.7B (XL) | Pile - 840 GB open source text dataset that combines 22 pre existing datasets | EleutherAI | ||||||||||||||||||||
48 | GPT-NeoX-20B | GPT | Decoder | LM | Similar to GPT-3 with rotary encoders instead of positional, parallel attention and feed forward layers, different initialization, and all dense layers instead of alternate dense/sparse | same as GPT-3 | 4//2022 | 20B | Pile (22 data sources) | EleutherAI | ||||||||||||||||||||
49 | GPTInstruct | GPT | Decoder | LM | GPTInstruct starts off with a pretrained GPT3 model and adds reward modeling through reinforcement learning after a supervised finetuning | Knowledge-intensive dialog or language tasks | 1//2022 | Same as GPT3 | Same as GPT3 for pretraining, but finetuned and optimized using labeler data and prompts | OpenAI | ||||||||||||||||||||
50 | InstructOR | T5 | Encoder/Decoder | Fine tuned with a wide variety of instruction based text-to-text tasks | Fine-tunes T5 explicitly to optimize encoder to produce a general purpose text string embedding useful for many NLU tasks. | Any NLU task requiring a single text string embedding. As of April 2023 InstructOR is the top-ranked system on the Massive Text Embedding Benchmark (MTEB) | 12//2022 | 330M | Finetuned on MEDI | Open, Apache-2.0 | University of Hong Kong, University of Washington, META AI | |||||||||||||||||||
51 | HTML | BART | Encoder/Decoder | DAE | As opposed to BART, they don’t do sentence shuffling | General purpose language model that allows structured HTML prompting | 7//2021 | 400M | 23TB of simplified HTML extracted from CommonCrawl | |||||||||||||||||||||
52 | Imagen | T5, CLIP, Diffusion models | T5 (or CLIP or BERT) for frozen text encoder + U-net architecture for cascaded diffusion models for text to image | image/text pair prediction | Imagen adds a few extensions to the U-net diffusion architecture (pooled embedding vector, cross attention over text embeddings, and Layer Normalizations) | Text to image | 6//2022 | 2B | a combination of internal datasets, with ≈ 460M image-text pairs, and the publicly available Laion dataset, with ≈ 400M image-text pairs | |||||||||||||||||||||
53 | Jurassic-1 | GPT | Decoder | LM | Very similar to GPT-3, but far more parameters and improved training efficiency mostly because of the improved tokenizer. Also, different ratio of depth to breadth | Similar to GPT-3 | 9//2021 | 178B (Jumbo), 7.5B (Large) | 300B tokens (same as GPT-3) | AI21 | ||||||||||||||||||||
54 | LAMDA | Transformer | Decoder | LM | LAMDA focuses on how to improve safety, quality, and groundeness using different fine-tuning strategies | General language modeling | 1//2022 | 137B | 1.56T words from public dialog data and other public web documents | |||||||||||||||||||||
55 | LLaMA | Transformer | Decoder | LM | LLaMA uses a Transformer architecture, and with extensions: Pre-normalization, SwiGLU activations, RoPE embeddings, reduced memory usage and runtime through efficient implementation of the causal multi-head attention, checkpointing to reduce the amount of activations that are recomputed during the backward pass, model and sequence parallelism to reduce memory usage of the model, and uses 1.4T BPE tokens after tokenization. | Zero and few shot Commonsense reasoning, Question answering, Code generation and Reading comprehension. | 2//2023 | 65B | English CommonCrawl + C4 + Github + Wikipedia + Gutenberg and Books3 + ArXiv + Stack Exchange | Limited, Non-commercial bespoke license | Meta | |||||||||||||||||||
56 | mBART | BART | Encoder/Decoder | DAE | Translation | 1//2020 | Same as BART | CC25 Corpus includes 25 monolingual corpuses in different languages. Largest corpuses are English (300 GB) and Russian (280GB) | ||||||||||||||||||||||
57 | Megatron | GPT/BERT/T5 | Encoder or Decorder, depending on the base model | Same as base model | Megatron is a family of models that extend previously known architectures (namely GPT-2 and BERT originally, but also T5 more recently) by introducing model parallelism primitives. In the case of BERT, the authors also replace the next sentence prediction head with sentence order prediction and use whole word n-gram masking. | Same as base model | 3//2020 | 8.3B (GPT-like), 3.9B (BERT-like) | Original paper uses an aggregate dataset consisting of Wikipedia), CC-Stories), RealNews, and OpenWebtext | NVidia | ||||||||||||||||||||
58 | Minerva | PaLM | Decoder | LM | Extends PaLM by fine-tuning on the mathematical dataset | Mathematical reasoning | 6//2022 | 540B | Same as PaLM + 118GB dataset of scientific papers from the arXiv preprint server and web pages that contain mathematical expressions using LaTeX, MathJax, or other mathematical typesetting formats | |||||||||||||||||||||
59 | MT-NLG (Megatron Touring NLG) | GPT | Decoder | LM | Uses parallelization similar to Megatron to train a LM double the size of GPT-3 | Language generation and others (similar to GPT-3) | 10//2021 | 530B | The Pile (800GB dataset) + 2 Common Crawl snapshots | NVidia | ||||||||||||||||||||
60 | OpenAssistant LLaMA | LLaMA | Decoder | Supervised fine-tuning on crowd sourced conversation/assistant data. | Same as ChatGPT, but open source. Compared to alternatives, it uses human generated conversation data | 4//2023 | 30B | Conversations collected by volunteers, available at https://huggingface.co/datasets/OpenAssistant/oasst1 | Limited, Non-commercial bespoke license. There is also a version based on Pythia which is Apache licensed. | Various open source contributors | ||||||||||||||||||||
61 | OPT | GPT-3 | Decoder | LM | Basically same architecture as GPT-3 but with some training improvements introduced in Megatron-LM | Same as GPT-3 | 5//2022 | 175B (and other smaller versions) | 180B tokens = RoBERTa + the Pile + PushShift.io Reddit | |||||||||||||||||||||
62 | Palm | Transformer | Decoder | LM | Palm uses a typical decoder-only transformer architecture, but adds quite a few extensions: SwiGLU activations, parallel layers, multi-query attention, RoPE embeddings, Shared Input-Output Embeddings, no biases, and a 256k SentencePiece vocabulary generated from the training data | Language understanding and generation | 4//2022 | 540B | 780B tokens from filtered webpages, books, Wikipedia, news articles, source code, and social media conversations. Code includes 24 programming languages. | |||||||||||||||||||||
63 | Pegasus | Encoder/Decoder | DAE (more concretely GSG) and MLM | Extends vanilla Transformer by using a different pretraining task (GSG: Gap Sentence Generation) that is better suited for summarization | Summarization | 12//2019 | Base = 223M Large = 568M | C4 (750GB) + HugeNews (3.8 TB) | UCL/Google | |||||||||||||||||||||
64 | RoBERTa | BERT | Encoder | MLM (Dynamic) | Extension of BERT with optimized training procedure and more data | Same as BERT | 7//2019 | 356M | Same as BERT + CC News + OpenWebText + Stories (~33B Tokens) | UW/Google | ||||||||||||||||||||
65 | SeeKer | GPT (but can extend any family) | Encoder/decoder or decoder only, depending on the base model it’s extending | Encoder/decoder or decoder only, depending on the base model it’s extending | SeeKer is an extension that can be applied to any Transformer architecture by introducing “search”, “knowledge”, and “response” modules that are introduced during pretraining | Same as base models | 3//2022 | Depends on the base model | Same as base model | |||||||||||||||||||||
66 | Sparrow | GPT | Decoder | LM | Starts from the Chinchilla 70B model but adds RLHF (Reinforcement Learning with Human Feedback). It also adds inline evidence a la GopherCite | Dialog agents and general language generation applications like Q&A | 9//2022 | 70B | Same as Chinchilla + interactive data gathering with human annotators during the RLHF process | Deepmind | ||||||||||||||||||||
67 | StableDiffusion | Diffusion | Encoder/Decoder | Caption Prediction | Stable diffusion is basically the Latent Diffusion model developed by LMU Munich researchers + some learnings on conditional diffusion from DALL-e and Imagen | Text to image | 12//2021 | 890M (although there are different, smaller, variants) | LAION-5B, a publicly available dataset derived from Common Crawl | LMU Munich + Stability.ai + Eleuther.ai | ||||||||||||||||||||
68 | Swin Transformer | ViT | Encoder | Same as ViT | Extends ViT by replacing the standard multi-head self attention (MSA) module by a module based on shifted windows (Swin) allowing ViT-like architectures to generalize to higher resolution images | Image (object detection, image classification..) | 3//2021 | 29M-197M | Imagenet and Imagenet-22k | Microsoft | ||||||||||||||||||||
69 | Switch | T5 | Encoder/Decoder | DAE | Goal to increase parameter count while keeping FLOP operations constant by using efficient routing of MoE (Mixture of Experts) | General language tasks (e.g. question answering) | 1//2021 | 1T | Colossal Clean Crawled Corpus | |||||||||||||||||||||
70 | T0 | T5 | Encoder/Decoder | Fine tuned with Natural language prompts | T0 stands for "T5 for Zero Shot", obtained by fine-tuning the T5 model on multitask mixture covering many different NLP tasks. Compared with T0, T0p and T0pp were fine-tuned with more datasets. T0pp is recommended as it leads (on average) to the best performances on a variety of NLP tasks. | Perform zero-shot inference tasks by specifying the query in natural language, and the models will generate a prediction. | 3//2022 | 11B (largest) | T0 (Multiple-choice QA, Extractive QA, Closed-Book QA, Structure-To-Text, Sentiment, Summarization, Topic Classification, Paraphrase Identification. T0p (same as T0, with additional datasets from GPT-3's evaluation suite). T0pp (same as T0p, with additional datasets from SuperGLUE, excluding NLI sets) | Open, Apache-2.0 | BigScience | |||||||||||||||||||
71 | T5 | Transformer | Encoder/Decoder | DAE | Same as original Transformer with some additions such as relative positional embeddings like Transformer XL | General language tasks including machine translation, question answering, abstractive summarization, and text classification | 10//2019 | 11 B (up to) | Colossal Clean Crawled Corpus (C4) - Cleaned up version of the Common Crawl dataset - 750 GB | |||||||||||||||||||||
72 | Trajectory Transformers | GPT, Control Transformers” (not per se a family, but grouping here those transformers that try to model more general control, RL-like, tasks) | Decoder | predict most likely sequence | Similarly to the Decision transformers, the main extension introduced by Trajectory Transformers is a way to encode a trajectory (state, actions, rewards) | General RL (reinforcement learning tasks) | 6//2021 | Smaller architecture than GPT | D4RL dataset and other RL datasets depending on the task at hand | UC Berkeley | ||||||||||||||||||||
73 | Transformer XL | Transformer | Decoder | LM | Relative positioned embeddings enable longer-context attention when compared to vanilla Transformer model | General language tasks | 1//2019 | 151M | Different training datasets depending on experiments, but baseline is Wikitext-103 | CMU/Google | ||||||||||||||||||||
74 | Turing-NLG | GPT | Decoder | LM | Optimized version of GPT2 with optimal hyperparameters and software/hardware platform to improve training | Same as GPT-2/3 | 2//2020 | 17B originally, up to 530B more recently | Highest quality subset from The Pile + 2 CC snapshots (339B tokens) | Microsoft | ||||||||||||||||||||
75 | UL2 | Transformer | Encoder/Decoder | Mixture-of-Denoisers, which combines diverse pre-training paradigms together | UL2-20B (Unifying Language Learning) can be interpreted as a model that is quite similar to T5 but trained with a different objective and slightly different scaling knobs. | A unified framework for pre-training models that are universally effective across datasets and setups. | 5//2022 | 20B | 1 trillion tokens on C4 | Open, Apache-2.0 | ||||||||||||||||||||
76 | Vicuna | LLaMA | Decoder | human instructions | LLaMA fine-tuned on user-shared conversations collected from ShareGPT. | Same as ChatGPT | 3//2023 | 13B | Conversations collected from ShareGPT | Limited, Non-commercial bespoke license | UC Berkeley, CMU, Stanford, UC San Diego, and MBZUAI | |||||||||||||||||||
77 | VIT | BERT | Encoder | Image Classification | Extension of BERT architecture to train on patches of images | mage classification | 10//2020 | 86M(Base) to 632M (Huge) | From standard Imagenet to JFT-300M (large inhouse dataset) | |||||||||||||||||||||
78 | Wu Dao 2.0 | GLM (General Language Model) | Decoder | Autoregressive blank infilling | Similar to GPT in that it uses a Decoder/autoregressive architecture but applies a different pretraining task proposed in the GLM family of models. Besides, Wu Dao uses a “Fast Mixture of Experts” approach to scale training to trillions of parameters | Language and multimodal (particularly image) | 6//2021 | 1.75T | N/A | Beijing Academy of Artificial Intelligence | ||||||||||||||||||||
79 | XLM-RoBERTa | RoBERTa | Encoder | MLM (Dynamic) | An extension of RoBERTa that introduces small parameter tuning insights in the context of multilingual applications | Translation and other cross-lingual language tasks | 10//2019 | Base = 270M Large = 550M | Cleaned Common Crawl in 100 languages | |||||||||||||||||||||
80 | XLNet | Transformer XL | Decoder | PLM | This model basically adapts Transformer XL architecture to permutation-based LM | General language tasks | 5//2019 | Base=117M, Large=360M | Same as BERT + Giga5 (16GB text), and and aggressively filtered ClueWeb 2012-B (19GB), Common Crawl (110 GB) | CMU/Google | ||||||||||||||||||||
81 | ||||||||||||||||||||||||||||||
82 | ||||||||||||||||||||||||||||||
83 | ||||||||||||||||||||||||||||||
84 | ||||||||||||||||||||||||||||||
85 | ||||||||||||||||||||||||||||||
86 | ||||||||||||||||||||||||||||||
87 | ||||||||||||||||||||||||||||||
88 | ||||||||||||||||||||||||||||||
89 | ||||||||||||||||||||||||||||||
90 | ||||||||||||||||||||||||||||||
91 | ||||||||||||||||||||||||||||||
92 | ||||||||||||||||||||||||||||||
93 | ||||||||||||||||||||||||||||||
94 | ||||||||||||||||||||||||||||||
95 | ||||||||||||||||||||||||||||||
96 | ||||||||||||||||||||||||||||||
97 | ||||||||||||||||||||||||||||||
98 | ||||||||||||||||||||||||||||||
99 | ||||||||||||||||||||||||||||||
100 |