ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago et al.
2021 · IEEE Transactions on Pattern Analysis and Machine Intelligence · 2,230 citations
Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-em…
Explore this paper's citation graph on Constellation.