White Hat Web Crawling: Industrial-Strength Web Crawling for Serious Content Acquisition

Mark Gross; Tammy Bilitzky; Rich Dominelli; Allan Lieberman

Abstract

Much original source material today appears only on the web or with the web version as the copy of record. We have been developing methods and bots to facilitate high-volume data retrieval from hundreds of websites, in a variety of source formats (HTML, RTF, DOCX, TXT, XML, etc.), in both European and Asian languages. We produce a unified data stream which we then convert into XML for ingestion into derivative databases, data analytics platforms, and other downstream systems. We will examine the thought behind our approaches, the analysis techniques we used to detect and deal with website and content anomalies, our methods to detect meaningful content changes, and our approaches to verification.

Overview

Vast amounts of business-critical information appear only on public websites that are constantly updated to present both new and modified content. While the information on many of these websites is extremely valuable, no standards exist today for the way content is organized, presented and formatted, or for how individual websites are constructed or accessed.

This creates a significant challenge for companies that require data sourced from these websites in a timely manner, which they need downloaded and structured to support business practices and downstream systems.

The paper will focus on specific impediments that we typically encounter and tactics we’ve adopted to overcome them in the process of creating a streamlined, automated process to crawl websites, scrape content and metadata, and transform the content into a standardized XML format. Our comments and recommendations are based on having successfully traversed hundreds of variated, multilingual, multi-platform, global websites.

We will elaborate on our methodology and bots used to facilitate high-volume data retrieval in a variety of source formats (HTML, RTF, DOCX, TXT, XML, etc.), in English, European and Asian languages, with varying organizational approaches.

White Hat versus Black Hat - Good Guys or the Bad Guys?

While Web Crawling sounds somewhat nefarious, there is an important white-hat side to it. Much original source material today appears only on the Web. For many government agencies and various NGO’s, the web version is the “document of record,” the most current version available, and where you are referred when you make inquiries regarding reports, articles, whitepapers, etc.

While there are many tools available to handle the basic crawling and scraping of websites, they mostly work on one website at time. Analyzing and traversing volumes of complex websites - somewhat like developing autonomous vehicles - requires the ability to adapt to changing conditions, across websites and over time. The presentation will examine the thought processes behind our approaches, including website analysis, techniques to detect and deal with website and content anomalies, methods to detect meaningful content changes, and approaches to verifying results.

Why Do We Need This Information?

There is a vast amount of data available on websites, from informative to entertaining to legal, with critical content that services a wide variety of purposes, depending on the business need. The most common endgame is the normalization, decomposition, and transformation of the information into a structured format to power derivative databases, data analytics platforms, and other downstream systems.

Why All the Fuss?

It is estimated that there are 4.52 billion webpages out in the wild (http://www.worldwidewebsize.com/). Many of these are maintained by webmasters who are certain that their architecture for running a web website is the best one, as opposed to the guy one page over.

Of course, it would be nice if all websites offered a convenient, reliable method to download and monitor their content, but most do not do so. It would also be helpful if the different websites complied with standards to make new and modified content easier to find and extract, but no such luck; the variations are endless and often at the whim of the developer and content owner. It would also be helpful if once the website was in place, its design and structure remained static; that does not happen either. Compounding it, in our hack-worried world, some restrict or limit access to prevent malicious intruders, at the expense of legitimate users. Finally, software bugs introduced inadvertently by developers add to the challenge.

It is Not “One Size Fits All”

“We want the content from www.very_important_content.com.” These marching orders launch our focused functional and technical analysis of each new website.

Understanding the design of each individual website is a prerequisite for successful crawler automation and content harvesting. Our methodology guides both the website analysts, and later, the developers, through a series of questions designed to derive the best approach for each unique website and content set.

Some critical questions to ask are the following:

How does the website work? Where is the data of interest and how is it accessed? The possibilities are endless and include traversing menus, sequencing through tables of content, clicking on headlines, and entering search terms.
How is the website content organized? Date order, subject matter, etc.? Understanding the way a website is organized is critical to locating the content you need and avoiding duplicates.
What is the website depth? How many links do we need to traverse to access the content? Depending on your business need, you may want to limit your search depth.
Is all the required metadata available on the website or does it need to be extracted from the content itself? Metadata is often even more important than the content itself and is needed for validation and search. Getting the metadata from the best source is key.
How large is the website? Is one crawler able to process it in its entirety? The resulting crawl process must be executed in a timely manner; this is a major consideration for the developer when configuring each website crawl.
How consistent is the design of the entire website? How large a sampling is required to successfully specify requirements for the crawler automation? Some websites are highly structured and organized. Others have a surprise on every page.

One Best Approach? Wishful Thinking….

One learns quickly that one size does not fit all. It is not feasible to design one approach to intelligently crawl even a small subset of these websites. Even within the same department, the web page layout and backend technology often varies, requiring frequent customizations.

Modern websites have progressed far beyond simple html pages to interactive, database driven applications using logic residing both on the client page and server based code. This forced a transition from NCSA Mosaic (last released OSX version size 1.7 mb) to the current version of Google Chrome, weighing in at a hefty 554 mb at the time of this writing. This growth in application size represents the ever-expanding feature set supported in modern browsers.

Our Methodology - How to Focus on What Matters

In order to attack these problems, we’ve identified a focused series of questions that guide the developer through the decision process and determine an optimized approach to extract content and metadata from each specific website. These include:

Does the website use a standard CMS (e.g. Drupal, Joomla or WordPress)?

Consistency is the primary advantage to crawling a website that uses a standard CMS. The page layouts follow a pattern, the lists of content are organized with the same tagging scheme and will often will share the same metadata tagging across pages.

What is the Underlying Technology Stack?

If the website is hosted using ASP.Net webforms, paging and navigation is typically implemented as form posts. If it is an Angular website, it may make heavy use of Ajax or a Single Page Application paradigm (SPA). The actual URL holding the content may not be immediately obvious, requiring emulation of the JavaScript enabled browser or monitoring requests in the browser’s html debugging tools to see how the data is being sourced. A similar situation will occur if a website makes heavy use of frames; often the actual content url is not the url in the address bar.

What security and authentication are in place?

Does the website require a logon? Does it require cookies or other headers that accompany the call and must be maintained between calls? The fastest way to crawl a website is to connect to the specific web address (URI) and retrieve the response, using the HTTP GET command, and then stream the results to a file. If the interaction between the server and browser is complex, it’s unlikely it is that this approach will work.

Rules for polite web crawling to avoid being blocked

The difference between a DDOS attack and an aggressive crawler is slim. It is a fairly simple task to write a web crawler which spawns many threads, all simultaneously grabbing content from a given website to quickly extract all the content. However, this method will quickly get your IP address blacklisted and block you from the website. A preferred method is to minimize the simultaneous connections and insert artificial pauses between the requests, mimicking normal user browser behavior. Even so, some websites will limit the number of files you can download, in a given day, from the same IP address. To avoid this, you either have to request files from multiple addresses or hook into the TOR network to use a different IP address on every request.

Is there an API/RSS feed available?

Some websites have a clean API available allowing you to pull the data in via a simple REST or SOAP call, including a few Federal websites. Others expose their content via RSS (Really Simple Syndication), eliminating the need to parse the HTML pages.

Does the website have bugs - and how severe?

Bugs can range from simple broken links and unavailable images, to flawed paging logic that only manifest itself when you are well into the development of a crawl. In some cases, webmasters are responsive and will address, or at least acknowledge the flaws in their website - but often you simply have to find a way to work around the flaws.

Crawler Magic - From Their Website to Ours

Rarely are two websites alike. A viable crawl solution must accommodate the unique aspects of a website without starting from scratch each time we face a new nuance. Our toolchain approach, in which a set of components are assembled into a crawler, is our preferred method for crawling large numbers of diverse websites in an efficient, timely manner and has proven very effective.

Some of the components we configure in our toolchain approach include:

Page Downloading

At its core, a web crawler is a mechanism for bulk downloading pages. The simplest mechanism is an HTTP GET, the HTTP command to access a URI and retrieve a response. This only returns the full page for simpler websites, but has a tremendous speed advantage and is our default mechanism. For sites that require cookies, we supplement the HTTP GET accordingly.

Pages are often loaded or changed dynamically by client-side scripts. Sections of text may be appended, deleted, or expanded. As our ultimate goal is downloading the complete contents of the page, we may need to emulate a browser.

Page Parsing

Parsing will grab elements from within the page and intelligently process them. There are several common approaches for selecting and navigating elements within web pages.

CSS selectors are commonly used by many JavaScript tools to quickly grab html elements and action them. Often many elements have no class, lack a distinct identifier, or repeat frequently.
Some pages rely on unique identifiers for the elements in question, but often only uniquely tag those elements they are interested in manipulating via CSS or JavaScript.
Many developer tools, e.g. Firebug and the Chrome Developer tool, let you query via Xpath to interactively preview your result, providing a more robust query language to quickly filter and navigate between elements.

The strength of using Xpath is based on the similarities between HTML and XML. HTML is relatively unstructured compared to XML and may not be well-formed. Thankfully, most languages have a forgiving parser that allow you to treat HTML as if it were XML. These parsers support a generic, Xpath-based mechanism for narrowing the relevant elements of a page and walking the elements of the page for metadata extraction and more complex filtering.

Metadata Extraction

In addition to the html documents, we are usually required to extract metadata from index or other pages. By walking the elements surrounding the link that led to this page, we are able to extract information surrounding the link that lead us to this page, similar to walking up and down the document object model.

Page Filtering

There are several options for filtering pages in a crawl:

Limit the section of the page examined for links using Xpath.
Examine the link itself for keywords that indicate that the content is not in scope or duplicated.
A final option is to apply logical filters, e.g. filtering out historic versions of a page.

Page Differencing

As you advance beyond simple file comparison, determining whether a page has changed on a website is a complex task; often requiring a multi-step process:

Isolate only those areas of interest on the page.
Strip tags that do not affect the meaning of the page such as head elements, style tags, JavaScript, and attributes within the tag.
Assess if the difference is material.

Switching from straight quotes to curly quotes or from normal spaces to non-breaking spaces are not usually meaningful. Other changes are more subtle such as paragraph transitions from preformatted text (<pre>) to lines contained within a paragraph or lines split by breaks - with no actual text differences.
Apply intelligence to chunk sentences and sentence fragments to compare each word.

No plan survives contact with a webmaster

Sites change, pages are updated with new character sets, update notification pages are frequently wrong, links die or are changed. We couple our automated crawling with automated validation to ensure that we have all the required files and metadata. When we find discrepancies, alerts are issued and our website analysts often reach out to the webmasters. From then on, it is uncertain as to whether we will get a response or resolution, and we often have to implement workarounds.

What is Next? Beyond 2018….

Over the past two and a half years, we have developed a series of best practices for web crawling and harvesting technologies, achieving fully automated processing against a wide range of diverse, complex and often poorly structured websites. Our methodology has been iteratively refined to accommodate the ever-changing landscape of internet content and facilitate a model of continuous improvement.

We are far from done. While not there yet, we are well on our way to eliminate manual intervention and further automate website analysis, reducing greatly the manual effort to research and resolve problems. We are starting to leverage our volumes of training data to create training sets for machine learning based troubleshooting and information extraction, and it is already demonstrating significant potential.

Our current road map includes utilizing TensorFlow, NLP and supervised machine learning to classify sections of text, extract references and metadata and to supplement our quality control, all targeted to improve the consistency and reliability of our results - and to do it faster and better.

Mark Gross

President

Data Conversion Laboratory

Mark Gross, President of Data Conversion Laboratory, is a recognized authority on XML implementation, document conversion, and data mining. Prior to founding DCL in 1981, he was with the consulting practice of Arthur Young & Co. Mark has a BS in Engineering from Columbia University and an MBA from New York University, and has taught at the New York University Graduate School of Business, the New School, and Pace University.

Tammy Bilitzky

Chief Information Officer

Data Conversion Laboratory

Tammy Bilitzky is Data Conversion Laboratory (DCL)’'s Chief Information Officer. Serving with DCL since 2013, Tammy is responsible for managing the company’s technology department; continuing its focus on resilient, high-quality, and innovative products; and helping to expand the business. She has extensive experience in using technology to deliver client value, supporting business-process transformation and managing complex, large-scale programs on and off shore. She holds a BS in computer science and business administration from Northeastern Illinois University and is a Project Management Professional, Six Sigma Green Belt, and Certified Scrum Master.

Rich Dominelli

Lead Software Engineer

Data Conversion Laboratory

As Lead Software Engineer, Rich brings over 25 years of System Architecture experience to Data Conversion Laboratory (DCL). Applying his education from Iona College, the University of Phoenix, and Stony Brook University, he has been solving problems and designing resilient solutions on everything from microcontrollers to mobile phones to state of the art web based meter data management systems. Most recently Rich has been focusing creating intelligent targeted web crawlers.

Allan Lieberman

Special Projects Manager

Data Conversion Laboratory

With a comprehensive technical background in both computer software development and large scale database design and applications, Special Projects Manager Allan Lieberman currently oversees Data Conversion Laboratory (DCL)'s efforts in identifying and accessing legal content on websites worldwide, and provides technical guidance both in-house and to clients. Allan joined DCL in 2012, following 25 years with the Information Systems department of Davis Polk & Wardwell, a leading global law firm, where his most recent position was Manager of Software Design and Systems Development. He holds a BA in Mathematics from City College of New York, and an MS in Computer Science from Polytechnic University of New York.

BalisageThe Markup Conference

Balisage Paper: White Hat Web Crawling: Industrial-Strength Web Crawling for Serious Content Acquisition

Mark Gross

Tammy Bilitzky

Rich Dominelli

Allan Lieberman

Table of Contents

Overview

White Hat versus Black Hat - Good Guys or the Bad Guys?

Why Do We Need This Information?

Why All the Fuss?

It is Not “One Size Fits All”

One Best Approach? Wishful Thinking….

Our Methodology - How to Focus on What Matters

Does the website use a standard CMS (e.g. Drupal, Joomla or WordPress)?

What is the Underlying Technology Stack?

What security and authentication are in place?

Rules for polite web crawling to avoid being blocked

Is there an API/RSS feed available?

Does the website have bugs - and how severe?

Crawler Magic - From Their Website to Ours

Page Downloading

Page Parsing

Metadata Extraction

Page Filtering

Page Differencing

No plan survives contact with a webmaster

What is Next? Beyond 2018….

Balisage Series on Markup Technologies