SourceOptionsWebCrawl
public struct SourceOptionsWebCrawl : Codable, Equatable
Object defining which URL to crawl and how to crawl it.
-
The number of concurrent URLs to fetch.
See moregentle
means one URL is fetched at a time with a delay between each call.normal
means as many as two URLs are fectched concurrently with a short delay between fetch calls.aggressive
means that up to ten URLs are fetched concurrently with a short delay between fetch calls.Declaration
Swift
public enum CrawlSpeed : String
-
The starting URL to crawl.
Declaration
Swift
public var url: String
-
When
true
, crawls of the specified URL are limited to the host part of the url field.Declaration
Swift
public var limitToStartingHosts: Bool?
-
The number of concurrent URLs to fetch.
gentle
means one URL is fetched at a time with a delay between each call.normal
means as many as two URLs are fectched concurrently with a short delay between fetch calls.aggressive
means that up to ten URLs are fetched concurrently with a short delay between fetch calls.Declaration
Swift
public var crawlSpeed: String?
-
When
true
, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.Declaration
Swift
public var allowUntrustedCertificate: Bool?
-
The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
Declaration
Swift
public var maximumHops: Int?
-
The maximum milliseconds to wait for a response from the web server.
Declaration
Swift
public var requestTimeout: Int?
-
When
true
, the crawler will ignore anyrobots.txt
encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set totrue
when a gateway_id is specied in the credentials.Declaration
Swift
public var overrideRobotsTxt: Bool?
-
Array of URL’s to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing
https://ibm.com/watson
also excludeshttps://ibm.com/watson/discovery
.Declaration
Swift
public var blacklist: [String]?
-
init(url:
limitToStartingHosts: crawlSpeed: allowUntrustedCertificate: maximumHops: requestTimeout: overrideRobotsTxt: blacklist: ) Initialize a
SourceOptionsWebCrawl
with member variables.Declaration
Swift
public init( url: String, limitToStartingHosts: Bool? = nil, crawlSpeed: String? = nil, allowUntrustedCertificate: Bool? = nil, maximumHops: Int? = nil, requestTimeout: Int? = nil, overrideRobotsTxt: Bool? = nil, blacklist: [String]? = nil )
Parameters
url
The starting URL to crawl.
limitToStartingHosts
When
true
, crawls of the specified URL are limited to the host part of the url field.crawlSpeed
The number of concurrent URLs to fetch.
gentle
means one URL is fetched at a time with a delay between each call.normal
means as many as two URLs are fectched concurrently with a short delay between fetch calls.aggressive
means that up to ten URLs are fetched concurrently with a short delay between fetch calls.allowUntrustedCertificate
When
true
, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.maximumHops
The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
requestTimeout
The maximum milliseconds to wait for a response from the web server.
overrideRobotsTxt
When
true
, the crawler will ignore anyrobots.txt
encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set totrue
when a gateway_id is specied in the credentials.blacklist
Array of URL’s to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing
https://ibm.com/watson
also excludeshttps://ibm.com/watson/discovery
.Return Value
An initialized
SourceOptionsWebCrawl
.