Implementing a System at US Patent and Trademark Office to Fully Automate the Conversion of Filing Documents to XML
Supervisory Program Manager
US Patent and Trademark Office
Data Conversion Laboratory
Many governmental and private organizations, such as the US Patent and Trademark Office (USPTO), gather massive collections of content, including legal documents, filings, and contracts. Most such collections consist of images and image-based PDFs; they're not searchable or minable for the critical information that these organizations need to function. As data collections grow larger and are measured in terabytes, conventional conversion techniques-as efficient as they may be-are not economically feasible. The Holy Grail has always been a fully automated process without human interaction. This paper will describe the implementation of such a system at the USPTO. The system, which has been fully functional for over two years, is processing millions of pages each month with turnaround often measured in minutes.
We will describe the approach taken to digitally pre-process incoming page images to improve the Optical Character Recognition (OCR) quality, clean up extraneous materials, and convert extracted text into fully formed XML with tagged structures and linked images. We also present an implementation approach for assembling banks of standard servers that operate in parallel with software to route and load-balance incoming materials, facilitating rapid expansion, flexibility, and scaling as production volume increases.
Finally, we will highlight the impact of these processes and the resulting transformation within the USPTO since the inception of the program.