SourceOptionsWebCrawl

Object defining which URL to crawl and how to crawl it.

Hierarchy

SourceOptionsWebCrawl

Index

Properties

Optional allow_untrusted_certificate

allow_untrusted_certificate: boolean

When true, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.

Optional blacklist

blacklist: string[]

Array of URL's to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing https://ibm.com/watson also excludes https://ibm.com/watson/discovery.

Optional crawl_speed

crawl_speed: string

The number of concurrent URLs to fetch. gentle means one URL is fetched at a time with a delay between each call. normal means as many as two URLs are fectched concurrently with a short delay between fetch calls. aggressive means that up to ten URLs are fetched concurrently with a short delay between fetch calls.

Optional limit_to_starting_hosts

limit_to_starting_hosts: boolean

When true, crawls of the specified URL are limited to the host part of the url field.

Optional maximum_hops

maximum_hops: number

The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the maximum_hops from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.

Optional override_robots_txt

override_robots_txt: boolean

When true, the crawler will ignore any robots.txt encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to true when a gateway_id is specied in the credentials.

Optional request_timeout

request_timeout: number

The maximum milliseconds to wait for a response from the web server.

url

url: string

The starting URL to crawl.