Efficient Scaling and Pre-Training of Language Models with Electra and Reformers

Published: July 09, 2025

The transformer architecture has been central to a multitude of recent research work in natural language processing. Yet, scaling the network suffers from huge memory related drawbacks and doesn’t come without significant computational resources or budgetary support. In this paper, we repurpose a highly efficient Reformer encoder architecture to serve as the foundational blocks for the Electra pre-training methodology, thereby allowing the network to scale to 8 times the size of its transformer counterpart while maintaining the same memory requirements. The subsequent downstream performance of this scaled up architecture is at par with the transformer based Electra benchmark, while being pre-trained using only a third of the data.

Download paper here.
GitHub: https://github.com/keshavbhandari/ElectraReformer

Share on

Twitter Facebook Google+ LinkedIn

Keshav Bhandari

Share on