Library software implementing a parallel small-bulge multishift QR algorithm with aggressive early deflation (AED) targeting distributed memory high-performance computing systems is presented. Starting from recent developments of the parallel QR algorithm [Granat et al. 2010], we describe a number of algorithmic and implementation improvements. These include communication avoiding algorithms via data redistribution and a refined strategy for balancing between multishift QR sweeps and AED. Guidelines concerning several important tunable algorithmic parameters are provided. As a result of these improvements, AED is no longer a computational bottleneck in the parallel QR algorithm. A performance model is established to explain the scalability behavior of the new parallel QR algorithm. Numerous computational experiments confirm that our new implementation significantly outperforms previous parallel implementations of the QR algorithm. The new software is available as a part of ScaLAPACK version 2.0.