public class SourceOptionsWebCrawl
extends com.ibm.cloud.sdk.core.service.model.GenericModel
| Modifier and Type | Class and Description | 
|---|---|
| static class  | SourceOptionsWebCrawl.BuilderBuilder. | 
| static interface  | SourceOptionsWebCrawl.CrawlSpeedThe number of concurrent URLs to fetch. | 
| Modifier and Type | Method and Description | 
|---|---|
| Boolean | allowUntrustedCertificate()Gets the allowUntrustedCertificate. | 
| List<String> | blacklist()Gets the blacklist. | 
| String | crawlSpeed()Gets the crawlSpeed. | 
| Boolean | limitToStartingHosts()Gets the limitToStartingHosts. | 
| Long | maximumHops()Gets the maximumHops. | 
| SourceOptionsWebCrawl.Builder | newBuilder()New builder. | 
| Boolean | overrideRobotsTxt()Gets the overrideRobotsTxt. | 
| Long | requestTimeout()Gets the requestTimeout. | 
| String | url()Gets the url. | 
public SourceOptionsWebCrawl.Builder newBuilder()
public String url()
The starting URL to crawl.
public Boolean limitToStartingHosts()
When `true`, crawls of the specified URL are limited to the host part of the **url** field.
public String crawlSpeed()
The number of concurrent URLs to fetch. `gentle` means one URL is fetched at a time with a delay between each call. `normal` means as many as two URLs are fectched concurrently with a short delay between fetch calls. `aggressive` means that up to ten URLs are fetched concurrently with a short delay between fetch calls.
public Boolean allowUntrustedCertificate()
When `true`, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.
public Long maximumHops()
The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the **maximum_hops** from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
public Long requestTimeout()
The maximum milliseconds to wait for a response from the web server.
public Boolean overrideRobotsTxt()
When `true`, the crawler will ignore any `robots.txt` encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to `true` when a **gateway_id** is specied in the **credentials**.
Copyright © 2023 IBM Cloud. All rights reserved.