Skip to main content
Bigdata Briefs 2.0 generates short, materially relevant bullet-point briefs for any company on any date. There is no need to supply the topics or event types to track: the pipeline works out what is relevant for each entity and searches for it on its own, though topic and source customization would also be possible. An LLM verifies whether each claim is genuinely new, and no prior history is required, because novelty is checked against news retrieved directly from Bigdata. For each entity and date window, it extracts material bullet points from the retrieved news and filters them for relevance and novelty before writing them to the database. A universe or custom portfolio is processed entity by entity, in parallel, and results can be consumed through a Web App, a REST API, or an integration with Claude via MCP.

GitHub

Setup, configuration, and API reference

Web App

Try it live: real companies, real briefs, updated daily.

Pipeline Overview

The core of the pipeline is five sequential phases that every entity moves through:
  1. Search: exploratory pass to discover active themes, fiscal quarter resolution, targeted per-theme retrieval
  2. Bullet Generation: LLM generates bullets from each theme’s evidence, scored for relevance
  3. Grounding Check: each bullet is validated against its cited source text
  4. Novelty Check via Embedding: embedding-based retrieval of past bullets, LLM coarse decision
  5. Novelty Check via Search: claim-level verification against current evidence
These five phases produce the published bullets. An optional narrative step can summarise them afterwards. The search phase runs in three sequential steps, with fiscal-quarter resolution running in parallel:
  1. Exploratory search: a broad query retrieves news for the entity over the target window. The goal is discovery: which topics and events were active for this entity in this period.
  2. Concept extraction: an LLM reads the exploratory results and identifies up to five distinct thematic areas, extracting the specific concepts and terms that characterize each within the current news flow. These become the inputs to targeted retrieval.
  3. Concept-driven search: each theme gets its own retrieval pass using its extracted concepts. This separates the evidence pool by theme, so a dominant news story does not crowd out coverage of other developments that happened in the same window.
In parallel with discovery, fiscal quarter resolution queries the Bigdata /v1/events-calendar/query endpoint to resolve the entity’s current fiscal quarter title (e.g. “Q1 2026”). This label is injected into the bullet-generation and novelty prompts so the LLM understands the fiscal context of the window being processed, and can reason correctly about references to “the latest quarter”, guidance, earnings, or period-specific disclosures. Theme extraction An example of the Concept Extraction phase where the LLM identified 5 themes from the exploratory search results and assigned specific search terms to each.

2. Bullet Generation

Bullet generation runs as a subgraph loop, once per theme. For each theme, the LLM receives the retrieved evidence and generates a set of bullet points. Bullets from all previously processed themes are included in the prompt so the model avoids surfacing the same information again. Evidence naturally overlaps across themes, and without this cross-theme context the same fact can appear twice phrased differently. Each generated bullet is then scored on a 1-to-5 relevance scale, reflecting three criteria:
  • Actionability: can a reader act on it?
  • Materiality: does it matter for the entity?
  • Direct impact: is the entity the subject, not a passing mention?
Bullets below the threshold are dropped before the grounding phase.

3. Grounding Check

Before entering the novelty pipeline, each bullet is validated against the source chunks it cites. An LLM reads the bullet alongside the full text and headline of each cited source chunk and decides whether every substantive claim in the bullet is directly supported by the cited evidence. The check also verifies that the bullet is about the correct entity and contains nothing hallucinated or nonsensical. The decision is binary, valid or invalid, with no rewrite path: invalid bullets are discarded outright, and only bullets with all claims traceable to their cited sources proceed to the novelty phases.

4. Novelty Check via Embedding

This phase runs a fast, inexpensive novelty check against the entity’s publication history to eliminate clear repeats before they reach the more precise, and more costly, search-based verification. It requires a database of previously published bullets, so it is skipped in stateless mode (where no history is stored). Each bullet is embedded and used to retrieve the most semantically similar previously published bullets for the same entity from a vector index. The lookback window is configurable (NOVELTY_LOOKBACK_DAYS, default 30 days). An LLM reads the current bullet alongside these retrieved candidates and assigns one of three decisions:
  • Keep: no significant overlap with prior published content.
  • Rewrite: the bullet contains new information but overlaps with something already published.
  • Discard: the bullet is a repeat of already-published content.
Keep and rewrite bullets both move to the next phase, where they are re-evaluated against live evidence with a higher-precision judgment. As an entity accumulates a history of published bullets, this check becomes increasingly effective: more past content is available for comparison, more repeats are caught early, and fewer bullets reach the search phase. Over time this translates into lower cost and faster runs for entities that are monitored regularly. Novelty check via embedding: discard example An example of the Novelty Check via Embedding where a forward guidance bullet was discarded because the same figures had already been published the day before, as identified by comparing against previously generated bullets. This phase verifies novelty at the claim level using targeted evidence retrieval, in four steps:
  1. Sentence splitting: if a bullet contains multiple clearly distinct events (facts that could stand as separate news items), it is first split into parts, each processed independently. Bullets that describe one situation with context, including contrast or prior-fact clauses, are not split.
  2. Claim decomposition: each part is decomposed into its individual factual claims. A claim is any discrete assertion that can be independently verified: a figure, a decision, a disclosure, a corporate action.
  3. Per-claim evidence retrieval: for each claim, a targeted search retrieves the evidence most relevant to that specific assertion from the Bigdata API. The search is scoped to the claim’s subject and date context, not the full bullet text.
  4. Per-claim novelty judgment: an LLM reads the claim alongside its retrieved evidence and assigns one of five labels:
LabelMeaning
novelThe claim is new and supported by evidence
oldThe claim was already reported in prior coverage
partially_novelThe topic is known but the claim adds a specific new detail (a figure, a name, a date)
novel_trivialThe claim is new but carries no material information
novel_unsupportedThe claim cannot be verified against the evidence because it is an inference, opinion, comparative judgment, forward-looking projection, or downstream consequence, or the evidence actively contradicts it
The per-claim labels are then aggregated into a bullet-level verdict that determines the final action:
VerdictConditionAction
novelAll claims are novelPublish as-is
novel_with_contextAt least one novel + at least one old or partially_novelRewrite: foreground the novel element, subordinate the known context
novel_noisyAt least one novel + only novel_trivial or novel_unsupported noiseRewrite: strip noise, publish the novel claims as a clean sentence
partial_updateExactly one partially_novel claim, no old (any novel_trivial/novel_unsupported noise is dropped)Rewrite: known topic becomes subordinate context, new detail introduced after pivot marker
partial_update_with_contextAt least one partially_novel + at least one oldRewrite: old claims become the context clause, partially-novel claims introduced as the new material
multi_partial_update2+ partially_novel claims, no novel, no oldRewrite: synthesise shared known baseline as subordinate clause, introduce all new details after pivot
discard_not_newOnly old and/or novel_trivial: no new information of any kindDiscard
discard_unsupportedNo novel or partially_novel, but at least one novel_unsupportedDiscard
All rewrite paths follow the same pattern: known context is subordinated in the opening clause, the novel element is foregrounded after a pivot marker: for example, “While X has been a known concern, the company disclosed Y in its Q1 2026 filing.” Each verdict maps to its own targeted rewrite prompt rather than a generic instruction, so the final text reflects exactly what is new, what is context, and how to frame the relationship between the two. Rewritten bullets are rescored for relevance before being saved. A bullet that still carries already-known context alongside its new information (the “mixed” verdicts above) is flagged in the API response with is_fully_novel: false. Novelty check via search: rewrite example An example of the Novelty Check via Search where an NVIDIA bullet was rewritten: the Vera CPU launch was already reported (as confirmed by evidence retrieved from Bigdata), while the Oracle deployment scale was new and was foregrounded.

Optional Narrative

A run can optionally produce a short editorial summary per entity: one flowing sentence (max 32 words) that captures the day’s brief, useful as a headline or preview. It is off by default and is generated only when generate_narrative: true is passed to POST /api/v1/batch/run-parallel. The summary is built from all active bullets published for that entity on the same UTC calendar day, not just the bullets from the current run. If multiple runs complete on the same day, each subsequent run produces a new summary reflecting the cumulative picture. A run that produces no active bullets of its own does not generate a summary, even if other runs on the same day have content. Narratives are retrieved separately via POST /api/v1/reports/narratives.

Running Locally

For the complete setup, all environment variables, and configuration options, see the project README.

Prerequisites

  • A Bigdata.com API key
  • An OpenAI API key
  • Docker (option A) or uv (option B)

Option A: Docker

docker build -t bigdata_briefs .

docker run -d \
  --name bigdata_briefs \
  -p 8000:8000 \
  -e BIGDATA_API_KEY=<your-bigdata-api-key> \
  -e OPENAI_API_KEY=<your-openai-api-key> \
  bigdata_briefs

Option B: uv

uv sync

cp .env.example .env
# Edit .env to set BIGDATA_API_KEY and OPENAI_API_KEY

uv run uvicorn bigdata_briefs.api.app:app --host 0.0.0.0 --port 8000

Verify

curl http://localhost:8000/health
The interactive Swagger UI is available at http://localhost:8000/docs when the service is started with ENABLE_DOCS=1 (it is off by default).

Ways to use Briefs

Briefs 2.0 can be run three ways: a Web App, the REST API, and an MCP server.

Web App

A desk-style Web App at http://localhost:8000/app/desk/, built around a custom portfolio you monitor daily. It has three areas:
  • The Brief: a reading view with a company picker, a top-movers portfolio brief, and an upcoming-events calendar. Clicking a company opens its tearsheet: the day’s bullets grouped by theme with inline source citations, an editorial narrative, signal and sentiment history, plus Audit and Archive tabs.
  • Portfolio: manage the tracked companies (the my_portfolio universe).
  • Costs: per-run cost forensics.
No code required. The app displays briefs but does not generate them: a run has to be triggered to populate and refresh it. There are two ways to keep it current:
  • Automatically, with the built-in cron job that fires every weekday morning when the service runs under Docker (ENABLE_CRON=1, see REST API).
  • On demand, by sending a run request to the running service.
The daily run refreshes the portfolio, so the briefs are ready as soon as you open the app.

MCP server

Briefs is exposed over the Model Context Protocol, so an assistant such as Claude can generate and read briefs through tools. Two servers are provided:
  • briefs-mcp: a thin client to a running service, so every run is persisted to that service’s database. Use it when you want a single, accumulating history of everything generated. As an entity builds up published bullets, the embedding-based novelty check (phase 4) catches repeats early, before the more expensive search-based stage, which makes subsequent runs both faster and cheaper.
  • briefs-mcp-stateless: self-contained, runs the pipeline in-process with no database. Nothing is written and there is no prior state to manage, so it is the simpler option when you do not care about what was generated before. Because it keeps no publication history, it falls back to search-only novelty: the embedding-based novelty check (phase 4) is skipped, and novelty is verified entirely against live evidence in phase 5.
Both expose start_briefs_run and get_run_results; the stateful server adds get_bullets and get_narratives. Configuration and client setup are in the project README.

REST API

The primary programmatic interface: trigger runs, control the date window, and retrieve published bullets and narratives over HTTP. Every capability of the pipeline is reachable here. For hands-off daily coverage, runs can also be scheduled with cron: the Docker image ships a built-in job (opt-in with ENABLE_CRON=1) that triggers run-parallel every weekday morning, keeping the Web App current. See the project README for setup and details. The full endpoint reference continues just below.

API Calls

Triggering runs

Two endpoints launch pipeline runs:
  • run-parallel: for one-off or scheduled batches.
  • scan: for backfilling a historical record.

Run Parallel

POST /api/v1/batch/run-parallel runs the pipeline for a set of entities. Target them in one of three ways:
  • entity_ids: a list of specific entities, e.g. "entity_ids": ["D8442A", "0157B1", "228D42"].
  • universe: a named universe, e.g. "universe": "dow_30".
  • Neither: omit both to run every entity in the database.
The window is set in one of two ways:
  • Explicit window. Pass force_window_start and force_window_end for a specific period. Wide windows are not recommended: one day is ideal (see Parameters).
  • Automatic (window_mode). When no forced dates are given, the start is computed from the entity’s run history (the end is always now), in one of two modes:
    • continuous (default): start exactly where the previous run ended, with no cap, so consecutive runs tile the timeline with no gaps. The first run on a new entity starts at UTC midnight of the current day.
    • update: cover the trailing 24 hours (72 on Mondays, UTC), resuming from the previous run with no overlap. Self-initializing: the first run on a new entity just works, and the same call the next day continues from where it left off. Ideal for hands-off daily coverage.
Explicit window:
curl -X POST http://localhost:8000/api/v1/batch/run-parallel \
  -H "Content-Type: application/json" \
  -d '{
    "entity_ids": ["D8442A", "0157B1", "228D42"],
    "force_window_start": "2026-04-22T00:00:00",
    "force_window_end": "2026-04-22T23:59:59"
  }'
Automatic window (update mode):
curl -X POST http://localhost:8000/api/v1/batch/run-parallel \
  -H "Content-Type: application/json" \
  -d '{
    "entity_ids": ["D8442A"],
    "window_mode": "update"
  }'
In either mode, swap entity_ids for a "universe", or omit both to run the whole database.
Universe IDEntitiesDescription
dow_3030Dow Jones Industrial Average
eurostoxx_5050Eurozone blue-chip index
top_eu_100100European large caps
top_eu_500500European wide large-cap universe
top_us_1010Ten largest US listings
top_us_100100Top US listings by market cap
top_us_500500Broad US large-cap universe
my_portfoliodynamicYour custom watchlist, managed via the API or app (see Portfolio)
Beyond entity_ids / universe and the window, run-parallel accepts:
ParameterDefaultDescription
window_modecontinuousHow to compute the window when no forced dates are given: continuous (resume from the last run’s end, no gaps) or update (the trailing 24h, 72h on Mondays).
categories["news"]Source categories: news, news_premium.
force_overlapfalseRun even if the window overlaps an already-completed run for the entity (use to re-run or backfill).
generate_narrativefalseAlso produce a one-sentence editorial summary per entity (see Narrative).
ranking_metricnullWhen set (e.g. media_attention_momentum), generate a top-5 portfolio brief after the batch completes.
The full parameter reference is in the project README.
When multiple entities run in parallel, they share two process-wide limits: a 450 QPM cap on Bigdata API calls, enforced by a token bucket that blocks any pipeline thread trying to exceed the budget, and a connection semaphore that caps concurrent in-flight Bigdata requests to 40. OpenAI calls are not explicitly rate-limited by the service but are indirectly throttled by the entity concurrency setting (MAX_CONCURRENT_ENTITIES, default 10). For universe-scale runs, the throughput ceiling is the available QPM budget divided by the average search calls per entity per day. This ceiling can be raised by using multiple Bigdata API keys across separate service instances, each with its own 450 QPM budget.
Poll progress via the batch status endpoint:
curl http://localhost:8000/api/v1/batch/parallel/<batch_id>/status

Scan: build a historical record

Use POST /api/v1/scan when you need to build or backfill a historical record for a portfolio. It takes a single entity_id or a universe, an explicit date range, splits it into windows, and processes them sequentially, producing a separate brief per window.
  • Window boundary. By default each window spans one UTC calendar day (midnight to midnight). Set boundary_time (HH:MM UTC) to shift the daily split point: for example 13:30 aligns each window to the US market open, so each brief covers one trading session. Friday windows automatically extend through the weekend to Monday, producing five windows per week with no gaps.
  • Edges. start_time and end_time are optional: start_time sets the clock on start_date only (the opening of the first window); end_time sets the clock on end_date only (the close of the last window). When omitted, the first window opens at midnight and the last closes at boundary_time of the final day, producing a partial window at each edge.
  • Sources. Source categories are set with source_categories.
Historical range (midnight boundary, default):
curl -X POST http://localhost:8000/api/v1/scan \
  -H "Content-Type: application/json" \
  -d '{
    "universe": "dow_30",
    "start_date": "2026-04-01",
    "end_date": "2026-04-30",
    "source_categories": ["news"]
  }'
Market-open to market-open (13:30 UTC):
curl -X POST http://localhost:8000/api/v1/scan \
  -H "Content-Type: application/json" \
  -d '{
    "universe": "dow_30",
    "start_date": "2026-04-01",
    "end_date": "2026-04-30",
    "boundary_time": "13:30"
  }'
Up to now (omit end_date: splits into windows and runs until the current time):
curl -X POST http://localhost:8000/api/v1/scan \
  -H "Content-Type: application/json" \
  -d '{
    "universe": "dow_30",
    "start_date": "2026-04-01"
  }'
For each entity, the scan automatically resolves the effective start from the last completed run, so re-running a scan over an already-covered range is safe: windows that already have a run are skipped. Poll per-entity, per-day progress:
curl "http://localhost:8000/api/v1/scan/status?entity_ids=D8442A,0157B1&start_date=2026-04-01&end_date=2026-04-30"

Parameters

Two parameters have the largest effect on both output quality and pipeline cost: the source set used during retrieval, and the date window each run covers. Both are worth configuring deliberately before running at scale. Source selection. The available source categories are news (the default) and news_premium.
  • Premium sources produce cleaner input: content is more focused, and the stories that surface are more likely to be material for the entity. Fewer bullets are generated per run, and a higher proportion pass the relevance and novelty filters, so the cost per published bullet is lower because less work is discarded.
  • General news is useful for entities with limited premium coverage, where it can meaningfully increase recall and surface developments that would otherwise be missed. The tradeoff is more noise: more bullets are generated per run and a larger share are filtered out during relevance scoring and novelty verification, which increases both compute-token and grounding-token costs per published bullet. The effect is most pronounced for entities already well covered by premium sources, where adding general news brings in more duplicate and low-relevance content without a proportional gain in new information.
Date window. The date window controls how much news is retrieved per run. A 24-hour window is a good baseline: it keeps prompts focused, search coverage complete, and costs predictable. Wider windows degrade all three:
  • Prompt size: all retrieved evidence is assembled into LLM prompts. More news means larger prompts, which can exceed context limits or reduce the model’s ability to focus on the most relevant material.
  • Search coverage: each query has a maximum result count. As the window grows, more relevant results may exist than the cap allows, so some developments in the period will not be represented.
  • Cost: wider windows produce more bullet candidates, each of which goes through relevance scoring and two novelty stages. Cost grows roughly in proportion to the volume of news in the window.
  • Temporal coherence: over a wide period, bullets may reflect different states of a developing situation. Within a 24-hour window this is rarely an issue; over multi-week windows it can produce briefs where earlier expectations and later outcomes sit side by side without a clear progression.
As an entity accumulates a publication history, per-run costs also decrease naturally: the embedding-based novelty check becomes increasingly effective at catching repeats early, before they reach the more expensive search-based verification stage.

Retrieving results

Three endpoints read back what the pipeline produced: published bullets, full pipeline detail (including discarded bullets), and editorial narratives.

Published bullets

POST /api/v1/reports/bullets returns published bullet points for one or more entities, grouped by run and ordered newest-first. Each bullet includes the final text, source citations (headline, chunk text, date), and novelty metadata. Bullets that are not fully novel (rewritten to carry new information alongside already-known context) are flagged with is_fully_novel: false. Use the optional max_runs to limit how many runs are returned per entity (1 for the latest run only, omit for all).
curl -X POST http://localhost:8000/api/v1/reports/bullets \
  -H "Content-Type: application/json" \
  -d '{
    "entity_ids": ["D8442A", "0157B1"]
  }'
Passing an empty entity_ids list retrieves all entities in the database:
curl -X POST http://localhost:8000/api/v1/reports/bullets \
  -H "Content-Type: application/json" \
  -d '{
    "entity_ids": []
  }'

Full pipeline detail

POST /api/v1/reports/bullets/detail returns every bullet considered by the pipeline: both published and discarded: for one or more entities. Accepts optional from_date and to_date filters (ISO 8601). For discarded bullets, it includes the stage that eliminated them and the reason:
  • relevance_score: scored too low on financial materiality
  • grounding: text not verifiable against cited sources
  • novelty_embedding: already reported in a previous run
  • novelty_search: per-claim verdicts with the evidence chunks that already covered the information
curl -X POST http://localhost:8000/api/v1/reports/bullets/detail \
  -H "Content-Type: application/json" \
  -d '{
    "entity_ids": ["D8442A"],
    "from_date": "2026-04-01T00:00:00",
    "to_date": "2026-04-30T23:59:59"
  }'
As with reports/bullets, passing an empty entity_ids list returns detail for all entities in the database.

Narratives

POST /api/v1/reports/narratives returns the per-entity editorial summaries produced when a run was launched with generate_narrative: true. Accepts entity_ids or universe (mutually exclusive; omit both for all entities), plus optional from_date / to_date (ISO 8601) to bound the report-date range. Results are newest-first.
curl -X POST http://localhost:8000/api/v1/reports/narratives \
  -H "Content-Type: application/json" \
  -d '{
    "entity_ids": ["D8442A", "0157B1"],
    "from_date": "2026-04-01"
  }'

Portfolio

my_portfolio is a special, database-backed universe: a custom watchlist you curate once and then run like any other universe. Unlike the pre-defined universes (static CSV files), it reflects live state, so edits take effect on the next run. It backs the Web App and can be passed as "universe": "my_portfolio" anywhere a universe is accepted: run-parallel, scan, and the report endpoints. Members are managed over HTTP. These write endpoints require the API key when PIPELINE_API_KEY is set. List the portfolio:
curl http://localhost:8000/api/frontend/portfolio
Add entities (name and ticker are resolved automatically if the entity has already been processed):
curl -X POST http://localhost:8000/api/frontend/portfolio \
  -H "Content-Type: application/json" \
  -d '{"entity_ids": ["0157B1", "D8442A", "228D42"]}'
Remove entities (one or many in a single call):
curl -X DELETE http://localhost:8000/api/frontend/portfolio \
  -H "Content-Type: application/json" \
  -d '{"entity_ids": ["0157B1", "D8442A"]}'
Both add and remove take the same entity_ids list and return a per-entity results list (added / already_exists on add, removed / not_found on remove). Run it like any universe:
curl -X POST http://localhost:8000/api/v1/batch/run-parallel \
  -H "Content-Type: application/json" \
  -d '{"universe": "my_portfolio", "window_mode": "continuous"}'

Large-Scale Portfolio Generation

For generating briefs for large portfolios (hundreds of companies), see the Large-Scale Portfolio Briefs Generation notebook in the bigdata-cookbook repository. This notebook demonstrates how to:
  • Process large numbers of companies in configurable batches
  • Load company identifiers from CSV files
  • Monitor batch processing with status polling
  • Export results to JSON and Excel formats
The notebook uses this briefs service API to generate briefs programmatically, making it ideal for portfolio managers and analysts who need to monitor many companies simultaneously. The service can handle large numbers of companies in a single request, and the notebook shows how to organize batch processing for scheduling across time zones or running concurrent service instances.