Object defining which URL to crawl and how to crawl it. More...

Classes
class	CrawlSpeedEnumValue
	The number of concurrent URLs to fetch. `gentle` means one URL is fetched at a time with a delay between each call. `normal` means as many as two URLs are fectched concurrently with a short delay between fetch calls. `aggressive` means that up to ten URLs are fetched concurrently with a short delay between fetch calls. More...

Properties
string	CrawlSpeed `[get, set]`
	The number of concurrent URLs to fetch. `gentle` means one URL is fetched at a time with a delay between each call. `normal` means as many as two URLs are fectched concurrently with a short delay between fetch calls. `aggressive` means that up to ten URLs are fetched concurrently with a short delay between fetch calls. Constants for possible values can be found using SourceOptionsWebCrawl.CrawlSpeedEnumValue More...

string	Url `[get, set]`
	The starting URL to crawl. More...

bool	LimitToStartingHosts `[get, set]`
	When `true`, crawls of the specified URL are limited to the host part of the url field. More...

bool	AllowUntrustedCertificate `[get, set]`
	When `true`, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers. More...

long	MaximumHops `[get, set]`
	The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on. More...

long	RequestTimeout `[get, set]`
	The maximum milliseconds to wait for a response from the web server. More...

bool	OverrideRobotsTxt `[get, set]`
	When `true`, the crawler will ignore any `robots.txt` encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to `true` when a gateway_id is specied in the credentials. More...

List< string >	Blacklist `[get, set]`
	Array of URL's to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing `https://ibm.com/watson` also excludes `https://ibm.com/watson/discovery`. More...

Detailed Description

Object defining which URL to crawl and how to crawl it.

Property Documentation

◆ AllowUntrustedCertificate

bool IBM.Watson.Discovery.v1.Model.SourceOptionsWebCrawl.AllowUntrustedCertificate

getset

When true, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.

◆ Blacklist

List<string> IBM.Watson.Discovery.v1.Model.SourceOptionsWebCrawl.Blacklist

getset

Array of URL's to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing https://ibm.com/watson also excludes https://ibm.com/watson/discovery.

◆ CrawlSpeed

string IBM.Watson.Discovery.v1.Model.SourceOptionsWebCrawl.CrawlSpeed

getset

The number of concurrent URLs to fetch. gentle means one URL is fetched at a time with a delay between each call. normal means as many as two URLs are fectched concurrently with a short delay between fetch calls. aggressive means that up to ten URLs are fetched concurrently with a short delay between fetch calls. Constants for possible values can be found using SourceOptionsWebCrawl.CrawlSpeedEnumValue

◆ LimitToStartingHosts

bool IBM.Watson.Discovery.v1.Model.SourceOptionsWebCrawl.LimitToStartingHosts

getset

When true, crawls of the specified URL are limited to the host part of the url field.

◆ MaximumHops

long IBM.Watson.Discovery.v1.Model.SourceOptionsWebCrawl.MaximumHops

getset

The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.

◆ OverrideRobotsTxt

bool IBM.Watson.Discovery.v1.Model.SourceOptionsWebCrawl.OverrideRobotsTxt

getset

When true, the crawler will ignore any robots.txt encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to true when a gateway_id is specied in the credentials.

◆ RequestTimeout

long IBM.Watson.Discovery.v1.Model.SourceOptionsWebCrawl.RequestTimeout

getset

The maximum milliseconds to wait for a response from the web server.

◆ Url

string IBM.Watson.Discovery.v1.Model.SourceOptionsWebCrawl.Url

getset

The starting URL to crawl.

The documentation for this class was generated from the following file:

src/IBM.Watson.Discovery.v1/Model/SourceOptionsWebCrawl.cs

Classes

Properties

Detailed Description

Property Documentation

◆ AllowUntrustedCertificate

◆ Blacklist

◆ CrawlSpeed

◆ LimitToStartingHosts

◆ MaximumHops

◆ OverrideRobotsTxt

◆ RequestTimeout

◆ Url