SourceOptionsWebCrawl

public struct SourceOptionsWebCrawl : Codable, Equatable

Object defining which URL to crawl and how to crawl it.

  • The number of concurrent URLs to fetch. gentle means one URL is fetched at a time with a delay between each call. normal means as many as two URLs are fectched concurrently with a short delay between fetch calls. aggressive means that up to ten URLs are fetched concurrently with a short delay between fetch calls.

    See more

    Declaration

    Swift

    public enum CrawlSpeed : String
  • url

    The starting URL to crawl.

    Declaration

    Swift

    public var url: String
  • When true, crawls of the specified URL are limited to the host part of the url field.

    Declaration

    Swift

    public var limitToStartingHosts: Bool?
  • The number of concurrent URLs to fetch. gentle means one URL is fetched at a time with a delay between each call. normal means as many as two URLs are fectched concurrently with a short delay between fetch calls. aggressive means that up to ten URLs are fetched concurrently with a short delay between fetch calls.

    Declaration

    Swift

    public var crawlSpeed: String?
  • When true, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.

    Declaration

    Swift

    public var allowUntrustedCertificate: Bool?
  • The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.

    Declaration

    Swift

    public var maximumHops: Int?
  • The maximum milliseconds to wait for a response from the web server.

    Declaration

    Swift

    public var requestTimeout: Int?
  • When true, the crawler will ignore any robots.txt encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to true when a gateway_id is specied in the credentials.

    Declaration

    Swift

    public var overrideRobotsTxt: Bool?
  • Array of URL’s to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing https://ibm.com/watson also excludes https://ibm.com/watson/discovery.

    Declaration

    Swift

    public var blacklist: [String]?
  • Initialize a SourceOptionsWebCrawl with member variables.

    Declaration

    Swift

    public init(
        url: String,
        limitToStartingHosts: Bool? = nil,
        crawlSpeed: String? = nil,
        allowUntrustedCertificate: Bool? = nil,
        maximumHops: Int? = nil,
        requestTimeout: Int? = nil,
        overrideRobotsTxt: Bool? = nil,
        blacklist: [String]? = nil
    )

    Parameters

    url

    The starting URL to crawl.

    limitToStartingHosts

    When true, crawls of the specified URL are limited to the host part of the url field.

    crawlSpeed

    The number of concurrent URLs to fetch. gentle means one URL is fetched at a time with a delay between each call. normal means as many as two URLs are fectched concurrently with a short delay between fetch calls. aggressive means that up to ten URLs are fetched concurrently with a short delay between fetch calls.

    allowUntrustedCertificate

    When true, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.

    maximumHops

    The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.

    requestTimeout

    The maximum milliseconds to wait for a response from the web server.

    overrideRobotsTxt

    When true, the crawler will ignore any robots.txt encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to true when a gateway_id is specied in the credentials.

    blacklist

    Array of URL’s to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing https://ibm.com/watson also excludes https://ibm.com/watson/discovery.

    Return Value

    An initialized SourceOptionsWebCrawl.