Class SourceOptionsWebCrawl
- All Implemented Interfaces:
com.ibm.cloud.sdk.core.service.model.ObjectModel
public class SourceOptionsWebCrawl
extends com.ibm.cloud.sdk.core.service.model.GenericModel
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classSourceOptionsWebCrawl.BuilderBuilder.static interfaceSourceOptionsWebCrawl.CrawlSpeedThe number of concurrent URLs to fetch. -
Method Summary
Modifier and Type Method Description BooleanallowUntrustedCertificate()Gets the allowUntrustedCertificate.List<String>blacklist()Gets the blacklist.StringcrawlSpeed()Gets the crawlSpeed.BooleanlimitToStartingHosts()Gets the limitToStartingHosts.LongmaximumHops()Gets the maximumHops.SourceOptionsWebCrawl.BuildernewBuilder()New builder.BooleanoverrideRobotsTxt()Gets the overrideRobotsTxt.LongrequestTimeout()Gets the requestTimeout.Stringurl()Gets the url.Methods inherited from class com.ibm.cloud.sdk.core.service.model.GenericModel
equals, hashCode, toString
-
Method Details
-
newBuilder
New builder.- Returns:
- a SourceOptionsWebCrawl builder
-
url
Gets the url.The starting URL to crawl.
- Returns:
- the url
-
limitToStartingHosts
Gets the limitToStartingHosts.When `true`, crawls of the specified URL are limited to the host part of the **url** field.
- Returns:
- the limitToStartingHosts
-
crawlSpeed
Gets the crawlSpeed.The number of concurrent URLs to fetch. `gentle` means one URL is fetched at a time with a delay between each call. `normal` means as many as two URLs are fectched concurrently with a short delay between fetch calls. `aggressive` means that up to ten URLs are fetched concurrently with a short delay between fetch calls.
- Returns:
- the crawlSpeed
-
allowUntrustedCertificate
Gets the allowUntrustedCertificate.When `true`, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.
- Returns:
- the allowUntrustedCertificate
-
maximumHops
Gets the maximumHops.The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the **maximum_hops** from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
- Returns:
- the maximumHops
-
requestTimeout
Gets the requestTimeout.The maximum milliseconds to wait for a response from the web server.
- Returns:
- the requestTimeout
-
overrideRobotsTxt
Gets the overrideRobotsTxt.When `true`, the crawler will ignore any `robots.txt` encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to `true` when a **gateway_id** is specied in the **credentials**.
- Returns:
- the overrideRobotsTxt
-
blacklist
Gets the blacklist.Array of URL's to be excluded while crawling. The crawler will not follow links which contains this string. For example, listing `https://ibm.com/watson` also excludes `https://ibm.com/watson/discovery`.
- Returns:
- the blacklist
-