Error Handling

Failures in a Research Agent or Workflows request fall into two layers, and they require different handling. HTTP-level errors arrive before the SSE stream starts and surface as conventional HTTP status codes. Stream-level errors arrive inside the SSE stream as typed messages and may or may not terminate the request. This guide covers both, and the recommended response for each.

HTTP-level errors

These are returned by the API before any SSE event is emitted. The response body is typically a JSON object with a detail field (for validation errors) or a plain error message. Your HTTP client should check the response status before consuming the stream.

Status	Cause	Recommended action
`400`	Malformed request body or invalid field value.	Fix the request payload. The response body usually identifies the offending field via the validation `loc` array. Do not retry as-is.
`401`	Missing or invalid API key.	Verify the `X-API-Key` header. Do not retry until the credential is rotated or restored.
`403`	The API key is valid but not entitled to this resource (model tier, template, or feature).	Surface the entitlement message; ask the user to upgrade or grant access. Do not retry.
`404`	Resource not found. Common when `from_checkpoint_id`, `chat_id`, or a workflow `template_id` does not exist or has been deleted.	Validate the identifier; reset and create a new conversation if appropriate.
`422`	Request body failed schema validation.	Inspect `detail[].loc` and `detail[].msg`; correct the payload.
`429`	Rate limit exceeded.	Back off and retry. See Retry and backoff below.
`5xx`	Transient server-side failure.	Retry with backoff; alert if persistent.

Use a small wrapper such as response.raise_for_status() (in Python’s requests) to convert HTTP errors into exceptions before entering the streaming loop. Trying to consume the SSE stream from a non-2xx response will silently produce zero events.

Stream-level errors

Once the HTTP response is established with a 2xx status, errors are delivered as typed messages inside the stream. Three message types matter:

Type	Severity	Effect on stream	Recommended action
`LLM_RETRY`	Informational	Continues. Agent retries internally.	Log if useful; otherwise ignore. No user-facing impact.
`TOOL_ERROR`	Recoverable	Continues. Agent may try a different tool or proceed without that source.	Log with `tool_name`. Do not raise. Surface to the user only if the final answer is materially degraded.
`ERROR`	Fatal	Terminates. No `COMPLETE` event will follow.	Surface to the user. Abort the stream consumer. Log with full context.

`LLM_RETRY`

The agent’s upstream LLM call hit a transient failure (rate limit, timeout, short-lived error) and is being retried automatically. The stream will resume without further action.

{"type": "LLM_RETRY", "message": "Retrying after upstream rate limit"}

Consumers should not act on this event other than emitting a debug-level log line. There is no user-facing surface beyond an optional “agent is retrying…” indicator.

`TOOL_ERROR`

A specific tool failed. The tool_name field identifies which one. The agent will continue, possibly calling a different tool or synthesizing without that source. Most integrations log these events and only surface a user-visible warning if the final answer is materially degraded (for example, if every search call failed).

{"type": "TOOL_ERROR", "tool_name": "search", "error": "Upstream search timed out"}

A common pattern is to count TOOL_ERROR events per tool name and, if the count exceeds a threshold during the same request, surface a soft warning at the end (“Some sources could not be retrieved; the answer may be incomplete.”).

`ERROR`

The request cannot continue. The stream terminates immediately after this event; no COMPLETE will follow. Surface the error to the user and stop reading the stream.

{"type": "ERROR", "error": "Request failed: invalid checkpoint id"}

The error field is a human-readable string suitable for logging, but not always suitable for direct display to end users. Wrap it in your own UX-friendly message where appropriate.

Worked example: error-aware streaming handler

The handler below distinguishes HTTP-level failures, stream-level errors, and informational retries. It logs per-event details and only raises when the request can no longer succeed.

import json
import logging
import requests

log = logging.getLogger(__name__)


class ResearchAgentError(Exception):
    """Raised for unrecoverable Research Agent failures."""


def stream_with_error_handling(api_key: str, message: str) -> str:
    endpoint = "https://agents.bigdata.com/v1/research-agent"
    headers = {"X-API-KEY": api_key, "Content-Type": "application/json"}
    payload = {"message": message, "research_effort": "standard"}

    answer = ""
    tool_errors: dict[str, int] = {}

    with requests.post(endpoint, headers=headers, json=payload, stream=True, timeout=300) as r:
        # HTTP-level errors: convert to typed exceptions before streaming.
        if r.status_code == 401:
            raise ResearchAgentError("API key is missing or invalid")
        if r.status_code == 403:
            raise ResearchAgentError(f"Not entitled: {r.text}")
        if r.status_code == 404:
            raise ResearchAgentError(f"Resource not found: {r.text}")
        if r.status_code == 429:
            # The API does not currently set a Retry-After header on 429.
            # Use exponential backoff with jitter (see stream_with_retries below).
            raise ResearchAgentError("Rate limited")
        if r.status_code >= 500:
            raise ResearchAgentError(f"Upstream failure: {r.status_code}")
        r.raise_for_status()

        for raw_line in r.iter_lines(decode_unicode=True):
            if not raw_line or not raw_line.startswith("data: "):
                continue
            try:
                event = json.loads(raw_line[6:])
            except json.JSONDecodeError:
                log.debug("Skipping malformed SSE line")
                continue

            msg = event.get("message", {})
            msg_type = msg.get("type")

            if msg_type == "ANSWER":
                answer += msg.get("content", "")

            elif msg_type == "LLM_RETRY":
                log.info("Agent retrying: %s", msg.get("message"))

            elif msg_type == "TOOL_ERROR":
                tool = msg.get("tool_name") or "unknown"
                tool_errors[tool] = tool_errors.get(tool, 0) + 1
                log.warning("Tool error from %s: %s", tool, msg.get("error"))

            elif msg_type == "ERROR":
                raise ResearchAgentError(msg.get("error", "Unknown error"))

            elif msg_type == "COMPLETE":
                if tool_errors:
                    log.warning(
                        "Completed with tool errors: %s",
                        ", ".join(f"{k}={v}" for k, v in tool_errors.items()),
                    )
                return answer

    # Reached end of stream without COMPLETE: treat as truncated.
    raise ResearchAgentError("Stream ended without COMPLETE event")

This handler intentionally distinguishes:

Pre-stream HTTP errors — raised immediately so callers do not waste cycles reading an empty stream.
In-stream recoverable events — logged at appropriate levels but never raised.
In-stream fatal events — raised so callers can surface the failure.
Truncated streams — detected by the absence of a COMPLETE event, preventing silent “empty answer” bugs.

Retry and backoff

For 429 and 5xx HTTP responses, exponential backoff with jitter is the safe default:

import random
import time


def stream_with_retries(api_key: str, message: str, max_attempts: int = 4) -> str:
    delay = 1.0
    last_exc: Exception | None = None
    for attempt in range(1, max_attempts + 1):
        try:
            return stream_with_error_handling(api_key, message)
        except ResearchAgentError as exc:
            text = str(exc)
            if "Rate limited" not in text and "Upstream failure" not in text:
                raise
            last_exc = exc
            if attempt == max_attempts:
                break
            sleep_for = delay + random.uniform(0, delay)
            log.info("Retry %d/%d after %.1fs", attempt, max_attempts, sleep_for)
            time.sleep(sleep_for)
            delay *= 2

    assert last_exc is not None
    raise last_exc

Do not retry on 400, 401, 403, or 404: those reflect client-side or identity problems that will not resolve on retry. In-stream events do not benefit from retry at the request level. LLM_RETRY and TOOL_ERROR already represent the agent’s own internal retry behavior; the request is doing the right thing without your help.

When to abort versus continue

Situation	Abort	Continue
HTTP `400` / `401` / `403` / `404`	Yes; fix and re-issue.	—
HTTP `429`	Yes; back off, then retry.	—
HTTP `5xx`	Yes; back off, then retry.	—
`ERROR` message	Yes; stream is over.	—
`LLM_RETRY` message	—	Yes; the agent is recovering.
`TOOL_ERROR` message	—	Yes; the agent may compensate. Tally counts for end-of-stream UX.
Truncated stream (no `COMPLETE`)	Yes; the response is incomplete.	—
Malformed `data:` line (`JSONDecodeError`)	—	Yes; skip the line only, not the stream.

Next steps

Streaming responses

Full reference for every message type the stream may emit.

Conversation continuity

Recover from a 404 on from_checkpoint_id by resetting the thread.

How to guides

Research Service

Search Service

Proprietary Content

Knowledge Graph

HTTP-level errors

Stream-level errors

`LLM_RETRY`

`TOOL_ERROR`

`ERROR`

Worked example: error-aware streaming handler

Retry and backoff

When to abort versus continue

Next steps

Streaming responses

Conversation continuity

​HTTP-level errors

​Stream-level errors

​LLM_RETRY

​TOOL_ERROR

​ERROR

​Worked example: error-aware streaming handler

​Retry and backoff

​When to abort versus continue

​Next steps

Streaming responses

Conversation continuity

HTTP-level errors

Stream-level errors

`LLM_RETRY`

`TOOL_ERROR`

`ERROR`

Worked example: error-aware streaming handler

Retry and backoff

When to abort versus continue

Next steps