Skip to main content

Why It Matters

Traditional RAG and search workflows are built for a handful of entities at a time: you point at a few tickers or names, run a query, get results. Scaling to a full universe (e.g. Global All-Cap, ~10,000 names) means orchestrating thousands of HTTP requests: per-query rate limits (QPS), connection pools, retries, backoff, and timeouts become your problem. You either throttle down and wait, or you push QPS and risk 429s and failed runs.

What It Does

The Batch Search API removes that burden. You submit one file with your queries. The service processes them asynchronously and returns one result file. No client-side loops, no QPS management, no thousands of round-trips.

How It Works

This guide walks you through a complete workflow:
  • Load a universe from a CSV and configure topic and time range
  • Build queries and submit a single batch job (create, upload JSONL, poll, download)
  • Post-process results with by-query chunk assignment and aggregate scores
  • Inspect top positive and negative companies, top chunks for the most negative, and a sector–country heatmap as a bottom-up macro indicator

A Real-World Use Case

We ask how the Global All-Cap universe (10,000 companies) has been affected by a given topic (e.g. Trump administration policies) over a recent window (e.g. the last six months). The goal is a ranked list of top-affected names and bottom-up macro signals (a sector–country heatmap) grounded in document-level evidence. Topic used in this guide: “The company is affected by the Trump administration’s policies.” (Configurable in section 3) Ready to get started? Let’s dive in! Open in GitHub

1. Setup

Load dependencies and paths. Authentication uses BIGDATA_API_KEY from .env. We also define where the universe CSV and result files live.
import os
import json
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv

load_dotenv(".env")

import sys
sys.path.insert(0, str(Path(".").resolve()))
from batch_api_client import BatchAPIClient, poll_until_complete, MetricsTracker

NOTEBOOK_DIR = Path(".").resolve()
CSV_PATH = NOTEBOOK_DIR / "global_all_caps.csv"
RESULTS_DIR = NOTEBOOK_DIR / "results"
RESULTS_DIR.mkdir(exist_ok=True)

2. Load the Global All-Cap universe

Read the universe CSV. Columns RP_ENTITY_ID, COMPANY_NAME, COUNTRY, and SECTOR are used to build queries (10 entities per query) and later to join results for the sector–country heatmap.
df_universe = pd.read_csv(CSV_PATH)
print(f"Loaded {len(df_universe):,} companies. Columns: {list(df_universe.columns)}")
df_universe.head(10)
Loaded 10,000 companies. Columns: ['RP_ENTITY_ID', 'COMPANY_NAME', 'COUNTRY', 'SECTOR']

3. Configuration: topic and time range

Set the search topic (e.g. companies affected by the Trump administration’s policies) and the time window (ISO timestamps). These are applied to every query in the batch.
TOPIC = "The company is affected by the Trump administration's policies"

from datetime import datetime, timedelta

TIME_END = datetime.now()
TIME_START = (TIME_END - timedelta(days=182)).strftime("%Y-%m-%dT%H:%M:%S")
TIME_END = TIME_END.strftime("%Y-%m-%dT%H:%M:%S")

print(f"Topic: {TOPIC}")
print(f"Time range: {TIME_START} to {TIME_END}")

4. Build queries

Here we generate one query per group of 10 companies from the universe, using entity.any_of to provide the list of entity IDs. Each query uses the same topic, filters, and time window. This is specific to this example, but the service can handle queries with entirely different filters, topics, and time ranges within the same batch job. For a universe of 10,000 companies, this approach yields 1,000 total queries, all sent together in a single batch job. In principle you could generate one query per company (10,000 queries), but that can substantially increase total latency and many smaller companies may return few or no hits. Batching entities (e.g. 10 per query) keeps the job size manageable. An optimal approach is to group companies with similar levels of media attention together within each batch, so high-visibility names do not dominate and smaller companies have a better chance of returning relevant results. The filter settings here aim for precision: an entity control filter for Trump (22C3AF) and a reranker threshold of 0.7 favor more relevant results but may reduce recall. Other use cases may opt for higher recall and apply verification (e.g. via LLMs) in post-processing.
ENTITIES_PER_QUERY = 10
MAX_CHUNKS_PER_QUERY = 20

def build_query(entity_ids, topic, time_start, time_end, max_chunks=100):
    return {
        "auto_enrich_filters": False,
        "text": topic,
        "filters": {
            "timestamp": {"start": time_start, "end": time_end},
            "entity": {"any_of": entity_ids, "all_of": ["22C3AF"]}
        },
        "ranking_params": {"source_boost": 0, "freshness_boost": 0, "reranker": {"enabled": True, "threshold": 0.7}},
        "max_chunks": max_chunks,
    }

n = len(df_universe)
queries = []
for start in range(0, n, ENTITIES_PER_QUERY):
    chunk = df_universe.iloc[start : start + ENTITIES_PER_QUERY]
    entity_ids = chunk["RP_ENTITY_ID"].astype(str).tolist()
    queries.append(build_query(entity_ids, TOPIC, TIME_START, TIME_END, MAX_CHUNKS_PER_QUERY))
query_lines = [{"query": q} for q in queries]
num_queries = len(query_lines)
print(f"Total queries: {num_queries:,} (each with {ENTITIES_PER_QUERY} entities) → one batch job")

5. Submit one batch job

This section handles the end-to-end process of submitting a batch search job to the API. The steps are:
  • Create a new batch job on the API. You receive a batch job ID and a presigned URL to upload the queries.
  • Upload the queries (packed as a JSONL string) using the provided URL.
  • Poll until the batch job status is marked complete.
  • Extract the final output location from the API’s status response.
  • Download the results from the provided URL and save them to disk (e.g. in results/, using a filename based on the batch ID).
This workflow assumes all queries are uploaded at once. The batch API handles them asynchronously. There are no client-side rate limiting or loops. Server-side batching is responsible for resource throttling and progress management.
api_key = os.getenv("BIGDATA_API_KEY")
if not api_key:
    raise ValueError("Set BIGDATA_API_KEY in .env")

jsonl_content = ("\n".join(json.dumps(line) for line in query_lines) + "\n").encode("utf-8")

client = BatchAPIClient(base_url="https://api.bigdata.com", api_key=api_key)
try:
    batch_id, presigned_url, _ = client.create_batch_job()
    print(f"Created batch: {batch_id}")
    client.upload_queries_file(presigned_url, jsonl_content)
    print(f"Uploaded {num_queries:,} queries.")
    metrics = MetricsTracker(batch_id, num_queries)
    status_data = poll_until_complete(client, batch_id, metrics, poll_interval=15)
    RESULTS_PATH = RESULTS_DIR / f"batch_{batch_id}_results.jsonl"
    data = status_data.get("data") if isinstance(status_data.get("data"), dict) else status_data
    if isinstance(data.get("body"), str):
        try:
            data = json.loads(data["body"])
        except json.JSONDecodeError:
            pass
    output_file = data.get("outputFile") or data.get("output_file") or {}
    output_url = output_file.get("url") if isinstance(output_file, dict) else None
    if not output_url:
        output_url = data.get("downloadUrl") or data.get("download_url")
    if output_url:
        results_bytes, _ = client.download_results(output_url)
        RESULTS_PATH.write_bytes(results_bytes)
        print(f"Results saved: {RESULTS_PATH}")
finally:
    client.close()

6. Post-processing: deduplicate chunks, entity detections, join to universe, aggregate

In this step we post-process the batch API results to obtain entity-level metrics. The goal is to combine both the quantity of media attention (how many relevant chunks mention each entity) and the qualitative sentiment of those mentions. We first deduplicate chunks by their unique identifiers (each (doc_id, cnum) pair) so each passage is counted once. We assign each chunk only to entities that were in the query’s entity.any_of for the response line that returned that chunk, and that also appear in the chunk’s detections and in our universe. That avoids co-mention bias: each chunk contributes only to the entities we asked about in that query. We then cross-reference entity IDs with the universe to attach SECTOR, COUNTRY, and COMPANY_NAME. We build a long table: each row is a chunk–entity pair with relevance and sentiment. The core aggregation is a score per entity: the sum of (sentiment × relevance) over all associated chunks. That captures not only how often an entity is mentioned but how strongly and in which direction. It also weights mentions by relevance to the search theme. The result is a wide table: one row per entity (with at least one chunk), with score and metadata. Because we sum (not average) sentiment × relevance, scores are unbounded and reflect total weighted sentiment. This table supports rankings and sector–country heatmaps where each cell reflects both strength and breadth of media attention.
path_to_use = RESULTS_PATH  # or pick latest from RESULTS_DIR
universe_ids = set(df_universe["RP_ENTITY_ID"].astype(str))
long_rows = []
seen = set()

with open(path_to_use) as f:
    for line_idx, line in enumerate(f):
        rec = json.loads(line)
        q = query_lines[line_idx].get("query", {}) if line_idx < len(query_lines) else {}
        query_ids = {str(e) for e in ((q.get("filters") or {}).get("entity") or {}).get("any_of") or []}
        for doc in rec.get("response", {}).get("results") or rec.get("results") or []:
            doc_id = doc.get("id", "")
            for ch in doc.get("chunks") or []:
                rel, sent, cnum = ch.get("relevance"), ch.get("sentiment"), ch.get("cnum")
                if rel is None or sent is None:
                    continue
                for d in ch.get("detections") or []:
                    if d.get("type") != "entity" or not (eid := d.get("id")) or eid not in universe_ids or eid not in query_ids:
                        continue
                    key = (eid, doc_id, cnum)
                    if key not in seen:
                        seen.add(key)
                        long_rows.append({"entity_id": eid, "relevance": rel, "sentiment": sent})
                        break

df_long = pd.DataFrame(long_rows)
entity_score = df_long.groupby("entity_id").apply(lambda g: (g["sentiment"] * g["relevance"]).sum(), include_groups=False).rename("score")
entity_volume = df_long.groupby("entity_id").size().rename("volume")
entity_agg = pd.concat([entity_score, entity_volume], axis=1).reset_index().rename(columns={"entity_id": "RP_ENTITY_ID"})
df_metrics = entity_agg.merge(df_universe[["RP_ENTITY_ID", "COMPANY_NAME", "SECTOR", "COUNTRY"]], on="RP_ENTITY_ID", how="inner")

7. Results and visuals

Top 5 positive and top 5 negative by score

A single, broad topic prompt was enough to surface 1,847 companies (19% of the top 10,000 publicly listed globally) in chunks linked to Trump policies, illustrating the approach’s value for global screening. To illustrate the most notable cases, we showcase the top 5 positive and top 5 negative entities in terms of score. The chart below presents these results, with the five most negative scores (shown in red, on the left) and the five most positive scores (shown in green, on the right). The score for each company is calculated by summing the product of sentiment and relevance across all its chunks. Top 5 positive and top 5 negative companies by score

Top chunks for the most negative company

To explore what underlies the scores, we can select and display the top chunks (ranked by relevance) for the company with the lowest aggregate score. Each row shows the full chunk text along with relevance and sentiment for that chunk. Sentiment is computed at the chunk level. Additional weighting or advanced techniques can be applied to improve entity-level accuracy and interpretability.

Dominion Energy Inc. (top 5 chunks)

RelevanceSentimentChunk Text
0.985-0.85Dominion Energy (NYSE:D) closed -2.7% on Monday after the Trump administration ordered Danish wind farm owner and operator Ørsted to halt all activities on its Revolution Wind project off the coast of Rhode Island, which is under construction
0.975-0.631% this hour. Independent power producers, Vistra and Constellation Energy are both down right now. Constellation by as much as 10%. Dominion Energy can restart construction of a wind project off the coast of Virginia while it continues a legal fight over the Trump administration’s order to stop the $11 billion development. A federal judge issued a preliminary injunction today blocking the government from enforcing its stop work order after Dominion claimed it was suffering irreparable harm. The company says it’s losing millions every day that project sits idle. And OpenAI will start testing advertisements in its chat GPT app, marking a major shift for the company as it seeks to bolster revenue and offset some of the costs associated with building and supporting AI. The ads will appear in the coming weeks for some U.S. users who use its free version as well as its newer low-cost go plan. That one costs about eight bucks a month. More expensive plans will go ad-free.
0.968-0.79Investment management platform Builder Clearwater Analytics added 8% after agreeing to be acquired by private equity firms Permira and Warburg Pincus for $8.4 billion, including debt. And now, Monday’s Unfortunates. Trump Media and Technology Group lost nearly 14%. Dominion energy shares fell nearly 5% after the Trump administration halted five East Coast wind projects, including Dominion Energy’s coastal Virginia offshore wind. In Stokart, parent Maple Bear dropped more than 3% after it said it would end the use of artificial intelligence-driven pricing tests for its grocery platform. The pricing tests had caused some customers to pay more for identical items in the same store than other customers. Shares of Dollar Tree fell 4.2%, Meme Stock GameStop dropped 3.5%, and Honeywell shares fell more than 1% after a regulatory filing stated that it expects to take a one-time charge in the fourth quarter, lowering gap sales by $310 million, and operating income by $370 million. That’s all for today. My name is Tony Jackson, and we’ll talk more stocks tomorrow.
0.946-0.72Dominion Energy (D) shares fell 3.7%. The Trump administration paused the lease for the company’s Coastal Virginia Offshore Wind project, along with four other projects under construction in the US, citing national security risks.
0.946-0.80On Monday, the Trump administration ordered Dominion Energy Inc. (NYSE:D), which owns the Coastal Virginia Offshore Wind, alongside other wind developers, to temporarily halt their construction of wind projects amid national security concerns raised by the Pentagon.

Bottom-up macro indicators

The large-scale batch is especially powerful here. We aggregate entity-level scores by sector and country to create bottom-up macro indicators. We build a heatmap using all sectors and a G12 selection of countries. Each cell shows the sum of score over entities in that sector–country. Color scale: blue = negative, white = neutral, red = positive. For a topic like “how companies are affected by the Trump administration’s policies,” this heatmap surfaces which sector–country combinations are most negatively or positively associated with media coverage and sentiment. Sectors and countries with strongly negative scores may indicate higher exposure to regulatory, tariff, or policy risk. Strongly positive ones may reflect perceived benefits or resilience. The view is built from entity-level evidence rather than pre-aggregated statistics, so it stays grounded in the underlying documents and chunks returned by the Batch API. Sector–country heatmap (sum of score)

Practical Applications

This kind of enterprise-strength data processing opens up possibilities that simply aren’t available when you’re limited to a few searches:
  • Global All-Cap thematic screening: Run one batch job to score how the full Global All-Cap universe (or any large universe) is affected by a topic, e.g. policy shifts, tariffs, or sector-specific events, then rank companies and build sector–country heatmaps from document-level evidence, as demonstrated in this guide.
  • Comprehensive market sentiment: Track sentiment across the entire Russell 1000 or any equity universe on a regular cadence. Quickly identify names where coverage is turning positive or negative, and aggregate to sector or country level for a top-down view.
  • Macro and country-level monitoring: Screen sovereign entities, central banks, or country-tagged news to build bottom-up macro indicators, e.g. policy-risk scores per country, fiscal-sentiment heatmaps, or cross-border contagion signals, without relying on pre-aggregated macro data.
  • Supply chain and geopolitical risk: Detect early warnings across global supply networks by screening thousands of suppliers, logistics providers, and raw-material producers in one batch. Flag concentration risk by geography or sector before disruptions propagate.
  • Competitive intelligence: Monitor every competitor and adjacent industry simultaneously. Score how a product launch, regulatory ruling, or M&A rumour affects each name in a peer group and compare exposure across sectors.
  • Regulatory and policy impact: Analyze how new legislation, sanctions, or trade policy affects every sector at once. Aggregate entity-level scores to a sector–country heatmap to see where regulatory risk concentrates.
  • Extension to other asset classes: The same workflow applies beyond equities. Screen commodities (e.g. metals, energy, agriculture) or FX pairs by replacing entity IDs with commodity or currency entities and adjusting the topic. For example, score how tariff announcements affect the full commodity complex, or how central-bank rhetoric shifts sentiment across G10 currencies. One batch job, same post-processing pipeline. ​

Summary

This guide demonstrates an end-to-end workflow for analyzing how the global all-cap universe is affected by a configurable topic (here, Trump administration policies). We load ~10,000 companies, build queries (e.g. 10 entities per query), and submit a single batch job: one upload, one download, no client-side rate-limit handling. Post-processing assigns chunks by query (avoiding co-mention bias), aggregates entity scores as the sum of sentiment × relevance, and joins results to sector and country metadata. The same batch output supports analysis at multiple levels: entity-level (rank companies, drill into top chunks), sector-level and country-level (aggregate exposure by industry or geography), and sector–country (heatmap tying both dimensions together). We produce a ranked list of companies (top positive and negative), a selection of top chunks for the most negative entity, and a sector–country heatmap as a bottom-up macro indicator. Potential enhancements. This workflow is basic but can be extended: e.g. smart batching or post-processing to correct volume bias (oversampling low-coverage entities, normalizing scores by expected coverage, or stratified sampling). Verification LLM layers can filter false positives. Entity-level refinement (mention-level extraction or advanced techniques) can improve accuracy and interpretability. For full code and to run the analysis yourself, see the Batch Search API notebook in the bigdata-cookbook repository (Batch_Search_API.ipynb). For API details, refer to the Batch Search API reference in the Bigdata documentation.