Efficient Scaling and Pre-Training of Language Models with Electra and Reformers
Published:
The transformer architecture has been central to a multitude of recent research work in natural language processing. Yet, scaling the network suffers from huge memory related drawbacks and doesn’t come without significant computational resources or budgetary support. In this paper, we repurpose a highly efficient Reformer encoder architecture to serve as the foundational blocks for the Electra pre-training methodology, thereby allowing the network to scale to 8 times the size of its transformer counterpart while maintaining the same memory requirements. The subsequent downstream performance of this scaled up architecture is at par with the transformer based Electra benchmark, while being pre-trained using only a third of the data.
Download paper here.
GitHub: https://github.com/keshavbhandari/ElectraReformer