Structured Output Extraction

The Research Agent generates a Markdown answer by default. For programmatic consumers — screeners, dashboards, downstream pipelines — a Markdown blob is often the wrong shape. The structured_output_schema field on a request asks the agent to produce a parallel JSON object that conforms to a schema you provide, extracted from the same research material that produced the answer. The schema acts as a typed contract between the agent and your code. This page covers how structured_output_schema works end-to-end: when the extraction runs, the schema constraints, a worked example, schema-design tips, and how to combine structured output with the other Research Agent features.

How it works

structured_output_schema is a JSON Schema document attached to the request. When set, the agent runs a dedicated extraction step after the main answer has been produced. The extracted object is emitted as a single STRUCTURED_OUTPUT message before the final COMPLETE event:

... ANSWER chunks ...
STRUCTURED_OUTPUT      <- one event, contains the parsed JSON
COMPLETE

Because extraction runs after the answer, the agent has the full research context (search results, intermediate reasoning, the answer text itself) available when populating the schema. This makes structured output fundamentally different from prompt-engineering “return JSON in your answer” patterns — the JSON does not have to be part of the visible answer, and the agent uses native function-calling / JSON-mode to produce it. Extraction is all-or-nothing: when it succeeds you get one schema-conformant object, and in the rare case it fails the request still completes but no STRUCTURED_OUTPUT event is emitted (see Schema conformance).

Schema constraints

A few rules to keep in mind when authoring your schema:

Top-level must be a JSON object. If you need a list of items, wrap them in an object with a single array property (see List outputs below).
Nested objects, arrays, and primitives are all supported. Use type, properties, required, items, and enum as you would in any JSON Schema.
Field descriptions matter. The agent reads each property’s description to know what to extract. Specific, instruction-shaped descriptions ("Net income in millions of USD for the most recent fiscal year") produce dramatically better extractions than generic ones ("net income").
Required fields should reflect what must always be present. Marking every field as required can cause extraction failures when one piece of data is genuinely absent; mark only fields you can guarantee.

Worked example: extract financial metrics

The request below asks for a one-paragraph credit summary and a parallel JSON object with quantitative metrics. The Markdown answer is for human readers; the JSON is for downstream code.

import json
import requests
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv("BIGDATA_API_KEY")

schema = {
    "type": "object",
    "properties": {
        "company": {
            "type": "string",
            "description": "Common name of the company, e.g. 'NVIDIA'."
        },
        "fiscal_year": {
            "type": "string",
            "description": "The fiscal year the metrics refer to, e.g. 'FY2025'."
        },
        "revenue_billions_usd": {
            "type": "number",
            "description": "Total revenue for the fiscal year in billions of USD."
        },
        "gross_margin_pct": {
            "type": "number",
            "description": "Gross margin as a percentage (e.g. 69.85 for 69.85%)."
        },
        "operating_margin_pct": {
            "type": "number",
            "description": "Operating margin as a percentage."
        },
        "key_risk_factors": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Top 3-5 risk factors disclosed in the period, each as a short noun phrase."
        }
    },
    "required": ["company", "fiscal_year"]
}

payload = {
    "message": "Write a one-paragraph credit summary of NVIDIA covering FY2025 results.",
    "research_effort": "standard",
    "structured_output_schema": schema,
}

headers = {"X-API-KEY": api_key, "Content-Type": "application/json"}

answer_text = ""
structured: dict | None = None

with requests.post(
    "https://agents.bigdata.com/v1/research-agent",
    headers=headers,
    json=payload,
    stream=True,
    timeout=300,
) as r:
    r.raise_for_status()
    for raw_line in r.iter_lines(decode_unicode=True):
        if not raw_line or not raw_line.startswith("data: "):
            continue
        try:
            event = json.loads(raw_line[6:])
        except json.JSONDecodeError:
            continue
        msg = event.get("message", {})
        if msg.get("type") == "ANSWER":
            answer_text += msg.get("content", "")
        elif msg.get("type") == "STRUCTURED_OUTPUT":
            structured = msg.get("content")
        elif msg.get("type") == "ERROR":
            raise RuntimeError(msg.get("error"))

print("Markdown answer:\n", answer_text)
print("\nStructured output:\n", json.dumps(structured, indent=2) if structured else "(extraction failed)")

The structured variable receives a dict matching the schema, ready to be persisted or passed downstream.

List outputs

JSON Schema’s top level must be an object, so to return a list of items wrap the array in a single property:

schema = {
    "type": "object",
    "properties": {
        "competitors": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "name": {"type": "string", "description": "Competitor company name."},
                    "ticker": {"type": "string", "description": "Stock ticker, if public."},
                    "differentiator": {
                        "type": "string",
                        "description": "One-line summary of what makes this competitor distinct."
                    }
                },
                "required": ["name"]
            },
            "description": "Up to five direct competitors, ordered by competitive intensity."
        }
    },
    "required": ["competitors"]
}

A request using this schema returns {"competitors": [...]}. Downstream code accesses the array via structured["competitors"].

Schema-design tips

Lead with the description. A property’s description is the single most important field for extraction quality. Write it like an instruction to the agent: what to find, where to find it, and the expected unit or format.
Use enum to constrain categorical fields. For sentiment, ratings, or classification fields, list the allowed values explicitly: {"type": "string", "enum": ["positive", "neutral", "negative"]}. The agent will pick one rather than inventing an out-of-band value.
Prefer flat schemas. Deep nesting (3+ levels) tends to reduce extraction reliability. If you need rich structure, split it across multiple requests rather than packing one massive schema.
Keep arrays short and bounded. Specify expected length in the description (“up to five”, “exactly three”) so the agent does not over- or under-fill.
Mark required minimally. Only fields you are certain the research will surface should be required. Optional fields gracefully degrade to missing keys when the data is absent.
Test on representative prompts. Different prompts can elicit different shapes of evidence; an extraction schema that works for “summarize NVIDIA’s earnings” may behave differently for “compare NVIDIA and AMD.”

Schema conformance

Structured output is produced with the model’s native structured-output mode, which enforces your schema. When a STRUCTURED_OUTPUT message is emitted, its content already conforms to the schema you supplied — you do not need to re-validate it client-side, and you will not receive a partially-filled object. Optional fields you did not mark required may simply be absent when the research did not surface them; that is normal and expected. Extraction is all-or-nothing. In the rare case the model cannot map the research onto your schema, the request still completes but no STRUCTURED_OUTPUT message is emitted. This is not a routine outcome to build fallbacks around — if you hit it repeatedly for a given schema, simplify the schema (see the design tips above) and report it, rather than absorbing it silently.

Combining with other features

With `tools_configs`

Use a tightly scoped search config alongside a strict schema to build deterministic, narrow pipelines. For example, a “company snapshot” pipeline might scope to one entity ID and extract a fixed financial-metrics schema:

payload = {
    "message": "Build a financial snapshot of the company.",
    "research_effort": "standard",
    "tools_configs": {
        "search": {"query_filters": {"entities": {"all_of": ["D8442A"]}}}
    },
    "structured_output_schema": snapshot_schema,
}

This pattern is the foundation for screener-style applications.

With conversation continuity

Structured output is computed once per request. A follow-up turn with the same chat_id runs its own extraction against its own answer; previous turns’ structured outputs are not re-emitted. If you need a running structured state across turns, accumulate the values on the client side.

Next steps

Streaming responses

See where STRUCTURED_OUTPUT falls in the message lifecycle.

Search tool configuration

Pair a strict schema with a tightly scoped search for deterministic pipelines.

Multi-turn conversations

Run structured extraction across multiple turns of a continuing conversation.

Conversation continuity

chat_id, checkpoint_id, and the persistence model for follow-up turns.

How to guides

Research Service

Search Service

Proprietary Content

Knowledge Graph

How it works

Schema constraints

Worked example: extract financial metrics

List outputs

Schema-design tips

Schema conformance

Combining with other features

With `tools_configs`

With conversation continuity

Next steps

Streaming responses

Search tool configuration

Multi-turn conversations

Conversation continuity

​How it works

​Schema constraints

​Worked example: extract financial metrics

​List outputs

​Schema-design tips

​Schema conformance

​Combining with other features

​With tools_configs

​With conversation continuity

​Next steps

Streaming responses

Search tool configuration

Multi-turn conversations

Conversation continuity

How it works

Schema constraints

Worked example: extract financial metrics

List outputs

Schema-design tips

Schema conformance

Combining with other features

With `tools_configs`

With conversation continuity

Next steps