Class SourceOptionsWebCrawl

public class SourceOptionsWebCrawl
Object defining which URL to crawl and how to crawl it.
  • Method Details

    • newBuilder

      public SourceOptionsWebCrawl.Builder newBuilder()
      New builder.
      a SourceOptionsWebCrawl builder
    • url

      public String url()
      Gets the url.

      The starting URL to crawl.

      the url
    • limitToStartingHosts

      public Boolean limitToStartingHosts()
      Gets the limitToStartingHosts.

      When `true`, crawls of the specified URL are limited to the host part of the **url** field.

      the limitToStartingHosts
    • crawlSpeed

      public String crawlSpeed()
      Gets the crawlSpeed.

      The number of concurrent URLs to fetch. `gentle` means one URL is fetched at a time with a delay between each call. `normal` means as many as two URLs are fectched concurrently with a short delay between fetch calls. `aggressive` means that up to ten URLs are fetched concurrently with a short delay between fetch calls.

      the crawlSpeed
    • allowUntrustedCertificate

      public Boolean allowUntrustedCertificate()
      Gets the allowUntrustedCertificate.

      When `true`, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.

      the allowUntrustedCertificate
    • maximumHops

      public Long maximumHops()
      Gets the maximumHops.

      The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the **maximum_hops** from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.

      the maximumHops
    • requestTimeout

      public Long requestTimeout()
      Gets the requestTimeout.

      The maximum milliseconds to wait for a response from the web server.

      the requestTimeout
    • overrideRobotsTxt

      public Boolean overrideRobotsTxt()
      Gets the overrideRobotsTxt.

      When `true`, the crawler will ignore any `robots.txt` encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to `true` when a **gateway_id** is specied in the **credentials**.

      the overrideRobotsTxt
    • blacklist

      public List<String> blacklist()
      Gets the blacklist.

      Array of URL's to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing `` also excludes ``.

      the blacklist