SourceOptionsWebCrawl
public struct SourceOptionsWebCrawl : Codable, Equatable
Object defining which URL to crawl and how to crawl it.
-
The number of concurrent URLs to fetch.
See moregentlemeans one URL is fetched at a time with a delay between each call.normalmeans as many as two URLs are fectched concurrently with a short delay between fetch calls.aggressivemeans that up to ten URLs are fetched concurrently with a short delay between fetch calls.Declaration
Swift
public enum CrawlSpeed : String -
The starting URL to crawl.
Declaration
Swift
public var url: String -
When
true, crawls of the specified URL are limited to the host part of the url field.Declaration
Swift
public var limitToStartingHosts: Bool? -
The number of concurrent URLs to fetch.
gentlemeans one URL is fetched at a time with a delay between each call.normalmeans as many as two URLs are fectched concurrently with a short delay between fetch calls.aggressivemeans that up to ten URLs are fetched concurrently with a short delay between fetch calls.Declaration
Swift
public var crawlSpeed: String? -
When
true, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.Declaration
Swift
public var allowUntrustedCertificate: Bool? -
The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
Declaration
Swift
public var maximumHops: Int? -
The maximum milliseconds to wait for a response from the web server.
Declaration
Swift
public var requestTimeout: Int? -
When
true, the crawler will ignore anyrobots.txtencountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set totruewhen a gateway_id is specied in the credentials.Declaration
Swift
public var overrideRobotsTxt: Bool? -
Array of URL’s to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing
https://ibm.com/watsonalso excludeshttps://ibm.com/watson/discovery.Declaration
Swift
public var blacklist: [String]? -
init(url:limitToStartingHosts: crawlSpeed: allowUntrustedCertificate: maximumHops: requestTimeout: overrideRobotsTxt: blacklist: ) Initialize a
SourceOptionsWebCrawlwith member variables.Declaration
Swift
public init( url: String, limitToStartingHosts: Bool? = nil, crawlSpeed: String? = nil, allowUntrustedCertificate: Bool? = nil, maximumHops: Int? = nil, requestTimeout: Int? = nil, overrideRobotsTxt: Bool? = nil, blacklist: [String]? = nil )Parameters
urlThe starting URL to crawl.
limitToStartingHostsWhen
true, crawls of the specified URL are limited to the host part of the url field.crawlSpeedThe number of concurrent URLs to fetch.
gentlemeans one URL is fetched at a time with a delay between each call.normalmeans as many as two URLs are fectched concurrently with a short delay between fetch calls.aggressivemeans that up to ten URLs are fetched concurrently with a short delay between fetch calls.allowUntrustedCertificateWhen
true, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.maximumHopsThe maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
requestTimeoutThe maximum milliseconds to wait for a response from the web server.
overrideRobotsTxtWhen
true, the crawler will ignore anyrobots.txtencountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set totruewhen a gateway_id is specied in the credentials.blacklistArray of URL’s to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing
https://ibm.com/watsonalso excludeshttps://ibm.com/watson/discovery.Return Value
An initialized
SourceOptionsWebCrawl.
View on GitHub
SourceOptionsWebCrawl Structure Reference