Skip to main content

Why It Matters

Large-universe screening is easy to describe and expensive to run naively: take a theme, apply it to thousands of companies, split the time range into windows, and collect the Search results. That full-grid pattern is comprehensive and dependable, but it allocates the same request budget to every entity and period even when coverage is highly uneven. Smart Batching adds a planning layer before execution. It estimates where the content is likely to be, groups similar-volume entities together, splits high-volume periods more carefully, and then allocates Search budget proportionally. Standard Bigdata.com Search still does the retrieval and ranking; Smart Batching decides how to spend the requests when the universe is large.

What It Does

This cookbook walks through a complete Smart Batching workflow:
  • Load a company universe from CSV.
  • Build a reusable search plan with plan_search().
  • Execute that plan at a chosen chunk_percentage.
  • Deduplicate documents and convert chunks to a DataFrame.
  • Save and reload plans so different sampling levels can be tested without replanning.
  • Interpret benchmark results across speed, relevance distribution, and coverage.
Open in GitHub

How It Works

Smart Batching uses a plan-then-execute flow.

Step 1: Plan

plan_search() has two planning substeps: Step 1a: Group entities into volume baskets. The planner queries the Bigdata.com Search Volume API to estimate how many chunks each entity will produce for the topic over the full time range. It then groups entities into baskets based on expected volume: low-volume entities can be merged together, while high-volume entities can be isolated or placed in smaller baskets. This preserves the cross-sectional distribution of the universe, so companies are not flattened into a single bucket and high-coverage names do not crowd out the long tail. Step 1b: Break each basket down by time. For each basket, the planner pulls the expected-volume time series and partitions the full date range into batch periods. High-volume periods get shorter windows; low-volume periods can use longer windows. This preserves temporal structure in the same way Step 1a preserves cross-sectional structure: periods with heavy coverage get their own batches, and quiet periods are consolidated so execution does not spend unnecessary requests on sparse windows. The result is a search plan with expected chunk counts and basket definitions. You can inspect it before making the larger retrieval run.

Step 2: Execute

execute_search() turns the plan into Search API calls. The chunk_percentage parameter controls the retrieval budget:
  • 1.0 requests the full planned budget.
  • 0.1 requests about 10% of expected chunks per basket.
  • 0.01 requests about 1% of expected chunks per basket.
Because the Search API returns ranked chunks, reducing the per-basket budget focuses execution on the higher-ranked part of each basket while keeping the universe and time structure represented.

1. Setup

The notebook in the cookbook repository is the best way to run the full workflow: From the Smart_Batching directory, create a local environment and install the cookbook requirements:
uv venv
uv pip install -r requirements.txt
uv run jupyter lab test_smart_batching.ipynb
If you are adding the library to your own Python project, install it with uv:
uv add bigdata-smart-batching
Create a .env file in the Smart_Batching directory:
BIGDATA_API_KEY=your_api_key_here
BIGDATA_API_BASE_URL=https://api.bigdata.com
Load BIGDATA_API_BASE_URL before importing bigdata_smart_batching. The package reads the base URL at import time, so restart the notebook kernel if you change it.
import os
from pathlib import Path

from dotenv import load_dotenv

env_path = Path.cwd() / ".env"
if env_path.exists():
    load_dotenv(env_path)

api_base_url = os.getenv("BIGDATA_API_BASE_URL", "https://api.bigdata.com")
os.environ["BIGDATA_API_BASE_URL"] = api_base_url

from bigdata_smart_batching import (
    convert_to_dataframe,
    deduplicate_documents,
    execute_search,
    load_plan,
    load_universe_from_csv,
    plan_search,
    save_plan,
)

2. Configure a search

The example notebook uses a customer-confidence screen over the historical US top-company universe.
api_key = os.getenv("BIGDATA_API_KEY")
if not api_key:
    raise ValueError("Set BIGDATA_API_KEY in your .env file")

search_text = "Decline in customer confidence in the company"
universe_csv = "id_name_mapping_us_top_3000.csv"
start_date = "2021-01-01"
end_date = "2021-06-30"
chunk_percentage = 0.1
You can load the universe directly from CSV before planning:
companies = load_universe_from_csv(universe_csv)
print(f"Loaded {len(companies):,} companies")

3. Create a search plan

Planning estimates the expected retrieval footprint before Search execution begins. Step 1a creates volume baskets across entities, and Step 1b splits those baskets across time so both cross-sectional and temporal structure are represented. In the captured notebook run, the planner loaded 4,731 entity IDs, found chunks for 3,745 companies, estimated 69,426 chunks, and created 77 baskets.
plan = plan_search(
    text=search_text,
    universe=universe_csv,
    start_date=start_date,
    end_date=end_date,
    api_key=api_key,
    api_base_url=api_base_url,
    volume_query_mode="iterative",
    max_iterations_per_batch=10,
)

print(f"Expected chunks: {plan['total_expected_chunks']:,}")
print(f"Baskets: {len(plan['baskets']):,}")
Save the plan when you want to rerun the same screen at different sampling levels:
save_plan(plan, "customer_confidence_plan.json")
plan = load_plan("customer_confidence_plan.json")

4. Execute and deduplicate

Execution runs the Search calls represented by the plan. Use chunk_percentage to select the retrieval budget.
raw_results = execute_search(
    search_plan=plan,
    chunk_percentage=chunk_percentage,
    requests_per_minute=100,
)

results = deduplicate_documents(raw_results)
print(f"Retrieved {len(results):,} deduplicated documents")
Then convert the output into a chunk-level DataFrame:
df_chunks = convert_to_dataframe(results)
df_chunks.head()
In the captured 10% notebook run, execution returned 6,173 raw documents, 4,958 deduplicated documents, and completed the Search phase in about 47 seconds.

5. Choose chunk_percentage

Treat chunk_percentage as a budget control:
  • Use 0.01 for alerting, LLM verification, and signal-first workflows where the highest-ranked chunks are enough.
  • Use 0.1 for balanced research workflows, entity-level scoring, trend analysis, and dashboards.
  • Use 1.0 when completeness matters more than latency, while still benefiting from volume-aware grouping.
The right setting depends on the downstream task. A portfolio-monitoring alert can usually favor precision and speed. A research dashboard may need more moderate-relevance material to support aggregation and auditability.

6. Benchmark Results

The benchmark compared a full-grid baseline with Smart Batching at 100%, 10%, and 1% across three screening topics. Timing includes planning and execution for Smart Batching.
Benchmark execution time comparison for full grid and Smart Batching
  • Tariffs China 2025: full grid took 452.2 seconds. Smart Batching took 42.5 seconds at 100%, 32.0 seconds at 10%, and 30.9 seconds at 1%.
  • Leadership 2023: full grid took 323.2 seconds. Smart Batching took 20.7 seconds at 100%, 10.2 seconds at 10%, and 9.4 seconds at 1%.
  • Confidence Decline 2021: full grid took 451.6 seconds. Smart Batching took 20.9 seconds at 100%, 11.0 seconds at 10%, and 10.2 seconds at 1%.
The scatter plots show how relevance distributions change as the retrieval budget is reduced. The main pattern is not that every Smart Batching setting has higher average relevance; at 100%, averages can be lower because the grouping and allocation differ from the full-grid baseline. The useful signal is that lower chunk_percentage settings concentrate the returned set into higher-ranked chunks while preserving date coverage. Tariffs China 2025 — relevance distribution
Tariffs China 2025 date versus relevance scatter plots
Leadership 2023 — relevance distribution
Leadership 2023 date versus relevance scatter plots
Confidence Decline 2021 — relevance distribution
Confidence Decline 2021 date versus relevance scatter plots
The coverage charts compare Smart Batching results against the full-grid reference by relevance bin. They are useful for deciding whether a sampling setting is appropriate for your task. Tariffs China 2025 — coverage by relevance range
Tariffs China 2025 coverage by relevance range
Leadership 2023 — coverage by relevance range
Leadership 2023 coverage by relevance range
Confidence Decline 2021 — coverage by relevance range
Confidence Decline 2021 coverage by relevance range

Summary

Smart Batching is a practical optimization layer for large Search workflows. It keeps the retrieval quality of Bigdata.com Search, adds a planning step that understands expected volume, and gives you a direct budget control through chunk_percentage. For full code and a runnable notebook, see the Smart Batching folder in the bigdata-cookbook repository, especially test_smart_batching.ipynb. For direct integration, install the bigdata-smart-batching Python package.