Skip to main content
Failures in a Research Agent or Workflows request fall into two layers, and they require different handling. HTTP-level errors arrive before the SSE stream starts and surface as conventional HTTP status codes. Stream-level errors arrive inside the SSE stream as typed messages and may or may not terminate the request. This guide covers both, and the recommended response for each.

HTTP-level errors

These are returned by the API before any SSE event is emitted. The response body is typically a JSON object with a detail field (for validation errors) or a plain error message. Your HTTP client should check the response status before consuming the stream.
StatusCauseRecommended action
400Malformed request body or invalid field value.Fix the request payload. The response body usually identifies the offending field via the validation loc array. Do not retry as-is.
401Missing or invalid API key.Verify the X-API-Key header. Do not retry until the credential is rotated or restored.
403The API key is valid but not entitled to this resource (model tier, template, or feature).Surface the entitlement message; ask the user to upgrade or grant access. Do not retry.
404Resource not found. Common when from_checkpoint_id, chat_id, or a workflow template_id does not exist or has been deleted.Validate the identifier; reset and create a new conversation if appropriate.
422Request body failed schema validation.Inspect detail[].loc and detail[].msg; correct the payload.
429Rate limit exceeded.Back off and retry. See Retry and backoff below.
5xxTransient server-side failure.Retry with backoff; alert if persistent.
Use a small wrapper such as response.raise_for_status() (in Python’s requests) to convert HTTP errors into exceptions before entering the streaming loop. Trying to consume the SSE stream from a non-2xx response will silently produce zero events.

Stream-level errors

Once the HTTP response is established with a 2xx status, errors are delivered as typed messages inside the stream. Three message types matter:
TypeSeverityEffect on streamRecommended action
LLM_RETRYInformationalContinues. Agent retries internally.Log if useful; otherwise ignore. No user-facing impact.
TOOL_ERRORRecoverableContinues. Agent may try a different tool or proceed without that source.Log with tool_name. Do not raise. Surface to the user only if the final answer is materially degraded.
ERRORFatalTerminates. No COMPLETE event will follow.Surface to the user. Abort the stream consumer. Log with full context.

LLM_RETRY

The agent’s upstream LLM call hit a transient failure (rate limit, timeout, short-lived error) and is being retried automatically. The stream will resume without further action.
{"type": "LLM_RETRY", "message": "Retrying after upstream rate limit"}
Consumers should not act on this event other than emitting a debug-level log line. There is no user-facing surface beyond an optional “agent is retrying…” indicator.

TOOL_ERROR

A specific tool failed. The tool_name field identifies which one. The agent will continue, possibly calling a different tool or synthesizing without that source. Most integrations log these events and only surface a user-visible warning if the final answer is materially degraded (for example, if every search call failed).
{"type": "TOOL_ERROR", "tool_name": "search", "error": "Upstream search timed out"}
A common pattern is to count TOOL_ERROR events per tool name and, if the count exceeds a threshold during the same request, surface a soft warning at the end (“Some sources could not be retrieved; the answer may be incomplete.”).

ERROR

The request cannot continue. The stream terminates immediately after this event; no COMPLETE will follow. Surface the error to the user and stop reading the stream.
{"type": "ERROR", "error": "Request failed: invalid checkpoint id"}
The error field is a human-readable string suitable for logging, but not always suitable for direct display to end users. Wrap it in your own UX-friendly message where appropriate.

Worked example: error-aware streaming handler

The handler below distinguishes HTTP-level failures, stream-level errors, and informational retries. It logs per-event details and only raises when the request can no longer succeed.
import json
import logging
import requests

log = logging.getLogger(__name__)


class ResearchAgentError(Exception):
    """Raised for unrecoverable Research Agent failures."""


def stream_with_error_handling(api_key: str, message: str) -> str:
    endpoint = "https://agents.bigdata.com/v1/research-agent"
    headers = {"X-API-KEY": api_key, "Content-Type": "application/json"}
    payload = {"message": message, "research_effort": "standard"}

    answer = ""
    tool_errors: dict[str, int] = {}

    with requests.post(endpoint, headers=headers, json=payload, stream=True, timeout=300) as r:
        # HTTP-level errors: convert to typed exceptions before streaming.
        if r.status_code == 401:
            raise ResearchAgentError("API key is missing or invalid")
        if r.status_code == 403:
            raise ResearchAgentError(f"Not entitled: {r.text}")
        if r.status_code == 404:
            raise ResearchAgentError(f"Resource not found: {r.text}")
        if r.status_code == 429:
            # The API does not currently set a Retry-After header on 429.
            # Use exponential backoff with jitter (see stream_with_retries below).
            raise ResearchAgentError("Rate limited")
        if r.status_code >= 500:
            raise ResearchAgentError(f"Upstream failure: {r.status_code}")
        r.raise_for_status()

        for raw_line in r.iter_lines(decode_unicode=True):
            if not raw_line or not raw_line.startswith("data: "):
                continue
            try:
                event = json.loads(raw_line[6:])
            except json.JSONDecodeError:
                log.debug("Skipping malformed SSE line")
                continue

            msg = event.get("message", {})
            msg_type = msg.get("type")

            if msg_type == "ANSWER":
                answer += msg.get("content", "")

            elif msg_type == "LLM_RETRY":
                log.info("Agent retrying: %s", msg.get("message"))

            elif msg_type == "TOOL_ERROR":
                tool = msg.get("tool_name") or "unknown"
                tool_errors[tool] = tool_errors.get(tool, 0) + 1
                log.warning("Tool error from %s: %s", tool, msg.get("error"))

            elif msg_type == "ERROR":
                raise ResearchAgentError(msg.get("error", "Unknown error"))

            elif msg_type == "COMPLETE":
                if tool_errors:
                    log.warning(
                        "Completed with tool errors: %s",
                        ", ".join(f"{k}={v}" for k, v in tool_errors.items()),
                    )
                return answer

    # Reached end of stream without COMPLETE: treat as truncated.
    raise ResearchAgentError("Stream ended without COMPLETE event")
This handler intentionally distinguishes:
  • Pre-stream HTTP errors — raised immediately so callers do not waste cycles reading an empty stream.
  • In-stream recoverable events — logged at appropriate levels but never raised.
  • In-stream fatal events — raised so callers can surface the failure.
  • Truncated streams — detected by the absence of a COMPLETE event, preventing silent “empty answer” bugs.

Retry and backoff

For 429 and 5xx HTTP responses, exponential backoff with jitter is the safe default:
import random
import time


def stream_with_retries(api_key: str, message: str, max_attempts: int = 4) -> str:
    delay = 1.0
    last_exc: Exception | None = None
    for attempt in range(1, max_attempts + 1):
        try:
            return stream_with_error_handling(api_key, message)
        except ResearchAgentError as exc:
            text = str(exc)
            if "Rate limited" not in text and "Upstream failure" not in text:
                raise
            last_exc = exc
            if attempt == max_attempts:
                break
            sleep_for = delay + random.uniform(0, delay)
            log.info("Retry %d/%d after %.1fs", attempt, max_attempts, sleep_for)
            time.sleep(sleep_for)
            delay *= 2

    assert last_exc is not None
    raise last_exc
Do not retry on 400, 401, 403, or 404: those reflect client-side or identity problems that will not resolve on retry. In-stream events do not benefit from retry at the request level. LLM_RETRY and TOOL_ERROR already represent the agent’s own internal retry behavior; the request is doing the right thing without your help.

When to abort versus continue

SituationAbortContinue
HTTP 400 / 401 / 403 / 404Yes; fix and re-issue.
HTTP 429Yes; back off, then retry.
HTTP 5xxYes; back off, then retry.
ERROR messageYes; stream is over.
LLM_RETRY messageYes; the agent is recovering.
TOOL_ERROR messageYes; the agent may compensate. Tally counts for end-of-stream UX.
Truncated stream (no COMPLETE)Yes; the response is incomplete.
Malformed data: line (JSONDecodeError)Yes; skip the line only, not the stream.

Next steps

Streaming responses

Full reference for every message type the stream may emit.

Conversation continuity

Recover from a 404 on from_checkpoint_id by resetting the thread.