public class SourceOptionsWebCrawl
extends com.ibm.cloud.sdk.core.service.model.GenericModel
Modifier and Type | Class and Description |
---|---|
static class |
SourceOptionsWebCrawl.Builder
Builder.
|
static interface |
SourceOptionsWebCrawl.CrawlSpeed
The number of concurrent URLs to fetch.
|
Modifier and Type | Method and Description |
---|---|
Boolean |
allowUntrustedCertificate()
Gets the allowUntrustedCertificate.
|
List<String> |
blacklist()
Gets the blacklist.
|
String |
crawlSpeed()
Gets the crawlSpeed.
|
Boolean |
limitToStartingHosts()
Gets the limitToStartingHosts.
|
Long |
maximumHops()
Gets the maximumHops.
|
SourceOptionsWebCrawl.Builder |
newBuilder()
New builder.
|
Boolean |
overrideRobotsTxt()
Gets the overrideRobotsTxt.
|
Long |
requestTimeout()
Gets the requestTimeout.
|
String |
url()
Gets the url.
|
public SourceOptionsWebCrawl.Builder newBuilder()
public String url()
The starting URL to crawl.
public Boolean limitToStartingHosts()
When `true`, crawls of the specified URL are limited to the host part of the **url** field.
public String crawlSpeed()
The number of concurrent URLs to fetch. `gentle` means one URL is fetched at a time with a delay between each call. `normal` means as many as two URLs are fectched concurrently with a short delay between fetch calls. `aggressive` means that up to ten URLs are fetched concurrently with a short delay between fetch calls.
public Boolean allowUntrustedCertificate()
When `true`, allows the crawl to interact with HTTPS sites with SSL certificates with untrusted signers.
public Long maximumHops()
The maximum number of hops to make from the initial URL. When a page is crawled each link on that page will also be crawled if it is within the **maximum_hops** from the initial URL. The first page crawled is 0 hops, each link crawled from the first page is 1 hop, each link crawled from those pages is 2 hops, and so on.
public Long requestTimeout()
The maximum milliseconds to wait for a response from the web server.
public Boolean overrideRobotsTxt()
When `true`, the crawler will ignore any `robots.txt` encountered by the crawler. This should only ever be done when crawling a web site the user owns. This must be be set to `true` when a **gateway_id** is specied in the **credentials**.
Copyright © 2024 IBM Cloud. All rights reserved.