SourceOptionsWebCrawl Structure Reference


                    
                    
                    CrawlSpeed

The number of concurrent URLs to fetch. gentle means one URL is fetched at a time with a delay between each call. normal means as many as two URLs are fectched concurrently with a short delay between fetch calls. aggressive means that up to ten URLs are fetched concurrently with a short delay between fetch calls.

Declaration

Swift

public enum CrawlSpeed : String

url

The starting URL to crawl.

Declaration

Swift

public var url: String


                    
                    
                    limitToStartingHosts

When true, crawls of the specified URL are limited to the host part of the url field.

Declaration

Swift

public var limitToStartingHosts: Bool?


                    
                    
                    crawlSpeed

The number of concurrent URLs to fetch. gentle means one URL is fetched at a time with a delay between each call. normal means as many as two URLs are fectched concurrently with a short delay between fetch calls. aggressive means that up to ten URLs are fetched concurrently with a short delay between fetch calls.

Declaration

Swift

public var crawlSpeed: String?


                    
                    
                    allowUntrustedCertificate

When true, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.

Declaration

Swift

public var allowUntrustedCertificate: Bool?


                    
                    
                    maximumHops

The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.

Declaration

Swift

public var maximumHops: Int?


                    
                    
                    requestTimeout

The maximum milliseconds to wait for a response from the web server.

Declaration

Swift

public var requestTimeout: Int?


                    
                    
                    overrideRobotsTxt

When true, the crawler will ignore any robots.txt encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to true when a gateway_id is specied in the credentials.

Declaration

Swift

public var overrideRobotsTxt: Bool?


                    
                    
                    blacklist

Array of URL’s to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing https://ibm.com/watson also excludes https://ibm.com/watson/discovery.

Declaration

Swift

public var blacklist: [String]?


                    
                    
                    init(url:limitToStartingHosts:crawlSpeed:allowUntrustedCertificate:maximumHops:requestTimeout:overrideRobotsTxt:blacklist:)

Initialize a SourceOptionsWebCrawl with member variables.

Declaration

Swift

public init(
    url: String,
    limitToStartingHosts: Bool? = nil,
    crawlSpeed: String? = nil,
    allowUntrustedCertificate: Bool? = nil,
    maximumHops: Int? = nil,
    requestTimeout: Int? = nil,
    overrideRobotsTxt: Bool? = nil,
    blacklist: [String]? = nil
)

Parameters

`url`	The starting URL to crawl.
`limitToStartingHosts`	When `true`, crawls of the specified URL are limited to the host part of the url field.
`crawlSpeed`	The number of concurrent URLs to fetch. `gentle` means one URL is fetched at a time with a delay between each call. `normal` means as many as two URLs are fectched concurrently with a short delay between fetch calls. `aggressive` means that up to ten URLs are fetched concurrently with a short delay between fetch calls.
`allowUntrustedCertificate`	When `true`, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.
`maximumHops`	The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
`requestTimeout`	The maximum milliseconds to wait for a response from the web server.
`overrideRobotsTxt`	When `true`, the crawler will ignore any `robots.txt` encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to `true` when a gateway_id is specied in the credentials.
`blacklist`	Array of URL’s to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing `https://ibm.com/watson` also excludes `https://ibm.com/watson/discovery`.

Return Value

An initialized SourceOptionsWebCrawl.