Grounding and Citations

Every claim the Research Agent and Workflows produce is grounded in retrieved source documents. The stream surfaces those source attributions as GROUNDING messages, each containing one or more references that link a specific span of the answer text to a specific source. Used correctly, GROUNDING lets you render verifiable inline citations and footnote-style source lists. Used incorrectly, it silently mis-attributes claims to the wrong sentences. This guide explains the structure of a GroundingReference, the buffering rule clients must follow, how grounding interacts with AUDIT traces, and a complete worked example that turns a streamed response into a Markdown answer with inline citations.

Anatomy of a `GroundingReference`

A GROUNDING message carries a references array. Each reference describes one attribution.

Field	Description
`start`	Inclusive character offset where the cited span begins in the cumulative answer text.
`end`	Exclusive character offset where the cited span ends.
`tool_name`	The tool that produced this reference (commonly `"search"`).
`audit_id`	Identifier linking to a specific entry in a preceding `AUDIT` message.
`source`	The cited document or web result. Populated only for search results; `null` for every other tool (earnings calendar, company tearsheet, charts, and so on), where the citation is anchored at the whole-tool level via `audit_id` rather than to a single document.

When populated, source is one of two shapes, discriminated on its type field — both produced only by the search tool:

type: "BIGDATA" — a document from the Bigdata.com content platform.
type: "EXTERNAL" — a result from an external web source.

For the full field-level schema of each (the BigdataDocument and ExternalResult schemas), see the Research Agent API reference rather than relying on a copy here. A null source is normal and expected — it does not mean the reference is broken. It means the citing tool is not a search tool, so the reference grounds an answer span to that tool’s result as a whole. Tie it back to the matching AUDIT trace through audit_id, and render it as a tool-level attribution (or leave it out of a document-only citation list).

The buffering rule

start and end are offsets into the cumulative answer text accumulated across all ANSWER chunks. Apply them only after you have finished reading the stream (or at least all of the answer chunks). Treating them as offsets into a single chunk or a partial buffer will misalign every citation.

Concretely, build the answer text by concatenating ANSWER.content values in arrival order, exactly as they appear. Do not trim, normalize, or rewrap the text between chunks: any character-level edit invalidates downstream offsets. Once the stream is fully consumed (or the answer phase is finished), slice the buffer using the reference offsets to extract the cited spans.

answer = ""
references = []

# ... in the streaming loop ...
if msg_type == "ANSWER":
    answer += msg.get("content", "")
elif msg_type == "GROUNDING":
    references.extend(msg.get("references", []))

# Later, after the stream has produced enough ANSWER chunks:
for ref in references:
    cited_span = answer[ref["start"]:ref["end"]]
    source = ref.get("source")
    # source is only present for search results; other tools ground at the
    # tool level, where source is null.
    label = source["hd"] if source else f"(tool-level: {ref['tool_name']})"
    print(f"{cited_span!r} -> {label}")

GROUNDING messages typically arrive interleaved with ANSWER chunks; the answer text up to ref["end"] is guaranteed to exist by the time you apply the slice, which is why post-stream resolution is the safest pattern.

Linking AUDIT traces

The audit_id field on each reference matches the tool_id field on entries in preceding AUDIT messages. This linkage lets you retrieve the full tool execution context for a citation — for example, the search query that produced the source document.

# Build a lookup keyed by tool_id.
audits = {}

# ... in the streaming loop ...
if msg_type == "AUDIT":
    for trace in msg.get("audit_traces", []):
        audits[trace["tool_id"]] = trace

# At rendering time:
for ref in references:
    trace = audits.get(ref["audit_id"])
    if trace and trace.get("audit_type") == "SearchAuditV1":
        query = trace.get("query", {}).get("text", "")
        print(f"  cited via search query: {query!r}")

Audit traces are also useful for diagnostics: if a citation looks wrong, the matching trace shows exactly which query produced the source.

Worked example: render inline citations

The script below executes a Research Agent request, buffers the streamed answer, collects GROUNDING references, and produces a Markdown document where each cited span is annotated with a numeric superscript that points at a footnote-style source list.

import json
import requests
from collections import OrderedDict


def render_with_citations(api_key: str, prompt: str) -> str:
    endpoint = "https://agents.bigdata.com/v1/research-agent"
    headers = {"X-API-KEY": api_key, "Content-Type": "application/json"}
    payload = {"message": prompt, "research_effort": "standard"}

    answer = ""
    references = []

    with requests.post(endpoint, headers=headers, json=payload, stream=True, timeout=300) as r:
        r.raise_for_status()
        for raw_line in r.iter_lines(decode_unicode=True):
            if not raw_line or not raw_line.startswith("data: "):
                continue
            try:
                event = json.loads(raw_line[6:])
            except json.JSONDecodeError:
                continue

            msg = event.get("message", {})
            msg_type = msg.get("type")

            if msg_type == "ANSWER":
                answer += msg.get("content", "")
            elif msg_type == "GROUNDING":
                references.extend(msg.get("references", []))
            elif msg_type == "ERROR":
                raise RuntimeError(msg.get("error"))

    return _build_markdown(answer, references)


def _build_markdown(answer: str, references: list) -> str:
    # Deduplicate sources, preserving first-seen order.
    source_index = OrderedDict()
    for ref in references:
        src = ref.get("source")
        if not src:
            continue
        key = src.get("id") or src.get("url") or src.get("hd")
        if key and key not in source_index:
            source_index[key] = src

    # Map each source key to its citation number.
    numbering = {key: i + 1 for i, key in enumerate(source_index)}

    # Insert superscript markers from right to left so earlier offsets stay valid.
    annotated = answer
    insertions = []
    for ref in references:
        src = ref.get("source")
        if not src:
            continue
        key = src.get("id") or src.get("url") or src.get("hd")
        n = numbering.get(key)
        if n is None:
            continue
        insertions.append((ref["end"], f"[^{n}]"))

    for offset, marker in sorted(insertions, key=lambda x: -x[0]):
        annotated = annotated[:offset] + marker + annotated[offset:]

    # Append the footnote list using the brand-standard attribution format.
    footnotes = []
    for key, src in source_index.items():
        n = numbering[key]
        footnotes.append(_format_footnote(n, src))

    return annotated + "\n\n" + "\n".join(footnotes)


def _format_footnote(n: int, src: dict) -> str:
    name = src.get("src_name") or src.get("action", {}).get("name") or "Unknown source"
    ts = src.get("ts") or ""
    date = ts[:10] if ts else ""
    url = src.get("url") or src.get("action", {}).get("url") or ""
    label = f"{name}" + (f" - {date}" if date else "")
    return f"[^{n}]: [{label}]({url})" if url else f"[^{n}]: {label}"

The resulting Markdown carries inline [^1], [^2], etc. markers at the end of each cited span, followed by a footnote list whose entries use the Bigdata.com brand-standard format Source name - YYYY-MM-DD linked to the source URL. _build_markdown skips references whose source is null — the tool-level citations described above. If you want to surface those too, render them from their tool_name and audit_id rather than expecting a document.

Deduplication strategies

A single source often supports multiple claims in the answer, so the same source may appear in many references. Most renderings should map each unique source to a single footnote number (as the example above does), so users see one entry per source rather than a duplicate-laden list. Good deduplication keys, in order of preference:

source.id — present on BIGDATA and EXTERNAL sources alike; the most reliable identifier.
source.url — stable for EXTERNAL sources; not always present for BIGDATA.
source.hd (headline) — the fallback when no id or url exists; can collide between unrelated documents that share a title.

Avoid deduplicating on audit_id: a single tool call can return many sources, so two references with the same audit_id may point at different documents.

Source attribution format

The Bigdata.com brand-standard inline format for source citations is:

Source name - MMM DD, YYYY

linked to the source’s canonical URL when available. For BIGDATA sources, derive the name from source.src_name and the date from source.ts. For EXTERNAL sources, derive both from the nested source.action object. Use ISO-style YYYY-MM-DD when locale-aware month abbreviations are not feasible. Reports and end-of-response source sections should additionally aggregate every unique source into a “Sources” (or “Data sources”) section at the bottom of the output, matching the same format.

How to guides

Research Service

Search Service

Proprietary Content

Knowledge Graph

Anatomy of a `GroundingReference`

The buffering rule

Linking AUDIT traces

Worked example: render inline citations

Deduplication strategies

Source attribution format

Next steps

Streaming responses

Error handling

​Anatomy of a GroundingReference

​The buffering rule

​Linking AUDIT traces

​Worked example: render inline citations

​Deduplication strategies

​Source attribution format

​Next steps

Streaming responses

Error handling

Anatomy of a `GroundingReference`

The buffering rule

Linking AUDIT traces

Worked example: render inline citations

Deduplication strategies

Source attribution format

Next steps