Skip to yearly menu bar Skip to main content


Poster

AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine

Carlo Siebenschuh · Kyle Hippe · Ozan Gokdemir · Alexander Brace · Arham Khan · Khalid Hossain · Yadu Babuji · Nicholas Chia · Venkatram Vishwanath · Arvind Ramanathan · Rick Stevens · Ian Foster · Robert Underwood


Abstract: Language models for scientific tasks are trained on text from scientific publications---most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics for simple, clean PDFs to expensive machine vision able to parse challenging documents with complex layouts or degraded quality. The choice of a ``best'' parser for a particular document thus depends on not only 1) the accuracy and 2) the computational cost of different parsers but also 3) human preferences as to how to balance accuracy and cost. To address these issues, we introduce Adaptive Parallel Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to a document. We enlist scientists to select preferred parser outputs and incorporate this information through Direct Preference Optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and (aligned) predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse improves throughput by $11$x while still offering a higher accuracy of $0.3$\% when compared to state-of-the-art parsers. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets.

Chat is not available.