Why?

In an investment landscape increasingly shaped by long-term structural trends — such as decarbonization, AI adoption, or supply chain reshaping — investors need tools to quickly identify which companies are exposed to the themes driving markets. Whether you’re looking for new investment ideas, managing thematic portfolios, or assessing risks tied to specific trends, thematic screening enables you to systematically map theme-company connections using vast amounts of unstructured data.

The bigdata-research-tools package provides a specialised class, ThematicScreener, designed to help you uncover these connections across your chosen universe of companies. By applying thematic logic across different document types, you can assess how companies are positioned relative to the themes that matter:

  • DocumentType.NEWS → capturing thematic mentions in news coverage
  • DocumentType.TRANSCRIPTS → spotting companies’ strategic alignment to a theme in earnings calls
  • DocumentType.FILINGS → identifying disclosures, risks, or opportunities in SEC filings

Each Thematic Screener workflow follows a clear process:

  1. Select the company universe to screen for thematic exposure
  2. Define your own thematic labels that describe the target theme, or use our LLM generated mindmapper to build a theme taxonomy
  3. Retrieve and process relevant documents using Bigdata’s search tools
  4. Apply LLM-based classification to tag theme mentions at the company and industry level
  5. Analyze the outputs to quantify exposure, trends, and relative positioning

This notebook will walk you through using the Thematic Screener to generate actionable insights on how specific companies link to major investment themes — turning unstructured data into structured, decision-ready intelligence.

Ready to get started? Let’s dive in!

Open in Colab

Setup and Imports

Below is the Python code required for setting up our environment and importing necessary libraries.

import os
from logging import Logger, getLogger
from typing import Dict, List, Optional

import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import numpy as np
import pandas as pd
from pandas import DataFrame, merge
import plotly
import plotly.io as pio
import seaborn as sns
pio.renderers.default = 'colab' # Change this if you're running in a different environment
plotly.offline.init_notebook_mode()

from bigdata_client import Bigdata
from bigdata_client.models.entities import Company
from bigdata_client.models.search import DocumentType

from bigdata_research_tools.labeler.screener_labeler import ScreenerLabeler
from bigdata_research_tools.search.screener_search import search_by_companies
from bigdata_research_tools.themes import (
    SourceType,
    generate_theme_tree,
    stringify_label_summaries,
)

from bigdata_research_tools.screeners import ExecutiveNarrativeFactor
from bigdata_research_tools.screeners.utils import get_scored_df

# For Excel functionality
from bigdata_research_tools.excel import ExcelManager

Define Output Paths

We define the output paths for our thematic screening results.

# Define output file paths for our results
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)

export_path = f"{output_dir}/thematic_screener_results.xlsx"

Load Environment Variables

The Thematic Screener requires API credentials for both the Bigdata API and the LLM API (in this case, OpenAI). Make sure you have these credentials available as environment variables or in a secure credential store.

Never hardcode credentials directly in your notebook or scripts.

# Secure way to access credentials
from google.colab import userdata

BIGDATA_USERNAME = userdata.get('BIGDATA_USERNAME')
BIGDATA_PASSWORD = userdata.get('BIGDATA_PASSWORD')

# Set environment variables for any new client instances
os.environ["BIGDATA_USERNAME"] = BIGDATA_USERNAME
os.environ["BIGDATA_PASSWORD"] = BIGDATA_PASSWORD

# Use them in your code
bigdata = Bigdata(BIGDATA_USERNAME, BIGDATA_PASSWORD)

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

Defining your Screening Parameters

  • Main Theme (main_theme): The central concept to explore
  • Company Universe (companies): The set of companies to screen
  • Time Period (start_date and end_date): The date range over which to run the search
  • Document Type (document_type): Specify which documents to search over (transcripts, filings, news)
  • Sources (sources): Specify set of sources within a document type, for example which news outlets (available via Bigdata API) you wish to search over
  • Fiscal Year (fiscal_year): If the document type is transcripts or filings, fiscal year needs to be specified
  • Model Selection (llm_model): The LLM model used to mindmap the theme and label the search result chunks
  • Rerank Threshold (rerank_threshold): By setting this value, you’re enabling the cross-encoder which reranks the results and selects those whose relevance is above the percentile you specify (0.7 being the 70th percentile). More information on the re-ranker can be found here.
  • Focus (focus): Specify a focus within the main theme. This will then be used in building the LLM generated mindmapper
# ===== Theme Definition =====
main_theme = "Supply Chain Reshaping"

# ===== Company Universe (from Watchlist) =====
# Get S&P 100 watchlist from Bigdata.com
sp500_watchlist_id = "393b09cb-ecff-4625-b92c-18561a8d8bb4"
watchlist = bigdata.watchlists.get(sp500_watchlist_id)
companies = bigdata.knowledge_graph.get_entities(watchlist.items)

# ===== LLM Specification =====
llm_model = "openai::gpt-4o-mini"

# ===== Transcript Configuration =====
document_type = DocumentType.TRANSCRIPTS
fiscal_year = 2024

# ===== Enable/Disable Reranker =====
rerank_threshold = None

# ===== Specify Time Range =====
start_date = "2024-03-01"
end_date = "2025-03-01"

Mindmap a Theme Taxonomy with Bigdata Research Tools

You can leverage Bigdata Research Tools to generate a comprehensive theme taxonomy with an LLM, breaking down a megatrend into smaller, well-defined concepts for more targeted analysis. More details on the implementation can be found in the API Reference here.

theme_tree = generate_theme_tree(
    main_theme=main_theme,
    dataset=SourceType.CORPORATE_DOCS,
)

theme_tree.visualize()

The taxonomy tree includes descriptive sentences that explicitly connect each sub-theme back to the “Supply Chain Reshaping” main theme, ensuring all search results remain contextually relevant to our central trend.

node_summaries = theme_tree.get_summaries()
node_summaries

Retrieve Content

With the theme taxonomy and screening parameters, you can leverage the Bigdata API to run a search on company transcripts. We need to define 3 more parameters for searching:

  • Frequency (freq): The frequency of the date ranges to search over. Supported values:
    • ‘Y’: Yearly intervals.
    • ‘M’: Monthly intervals.
    • ‘W’: Weekly intervals.
    • ‘D’: Daily intervals. Defaults to ‘3M’.
  • Document Limit (document_limit): The maximum number of documents to return per query to Bigdata API.
  • Batch Size (batch_size): The number of entities to include in a single batched query.
freq = "6M"
document_limit = 100
batch_size = 10

df_sentences = search_by_companies(
    companies=companies,
    sentences=node_summaries,
    start_date=start_date,
    end_date=end_date,
    scope=document_type,
    fiscal_year=fiscal_year,
    rerank_threshold=rerank_threshold,
    freq=freq,
    document_limit=document_limit,
    batch_size=batch_size,
)

df_sentences.head(5)

DataFrame Summary: 5 rows × 16 columns

Label the Results

Use an LLM to analyze each text chunk and determine its relevance to the sub-themes. Any chunks which aren’t explicitly linked to supply chain reshaping will be filtered out.

labeler = ScreenerLabeler(llm_model=llm_model)
df_labels = labeler.get_labels(
    main_theme=main_theme,
    labels=list(theme_tree.get_terminal_label_summaries().keys()),
    texts=df_sentences["masked_text"].tolist(),
)

# Merge and process results
df = merge(df_sentences, df_labels, left_index=True, right_index=True)
df = labeler.post_process_dataframe(df)

Assess Thematic Exposure

We’ll look at the top 10 most exposed companies to supply chain reshaping. The function get_scored_df will calculate the composite thematic score, summing up the scores across the sub-themes for each company (df_company) or industry (df_industry).

df_company, df_industry = DataFrame(), DataFrame()

df_company = get_scored_df(
    df, index_columns=["Company", "Ticker", "Industry"], pivot_column="Theme"
)
df_industry = get_scored_df(
    df, index_columns=["Industry"], pivot_column="Theme"
)

Now, let’s visualize the results using Plotly to create an interactive dashboard:

import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
pio.renderers.default = 'colab' #Change this if you're running the notebook on Jupyter or VS code
plotly.offline.init_notebook_mode()


def create_thematic_exposure_dashboard(df_company, n_companies=10):
    """
    Creates a comprehensive dashboard for analyzing thematic exposure of companies.

    Parameters:
    -----------
    df_company : pandas.DataFrame
        DataFrame containing company data with columns for 'Company', 'Industry',
        'Composite Score', and multiple thematic exposure columns.
    n_companies : int, default=10
        Number of companies to include in the analysis.

    Returns:
    --------
    tuple
        A tuple containing two Plotly figures:
        - Main dashboard with four panels (heatmap, bar chart, scatter, bar chart)
        - Industry-level analysis heatmap
    """
    # Select top n companies and reset index
    df = df_company[:n_companies].reset_index(drop=True).copy()

    # Extract theme column names (all columns between 'Industry' and 'Composite Score')
    theme_columns = list(df.iloc[:, 3:-1].columns)

    # Create subplots layout
    fig = make_subplots(
        rows=4, cols=1,
        specs=[[{"type": "heatmap"}], [{"type": "bar"}],
               [{"type": "scatter"}], [{"type": "bar"}]],
        row_heights=[0.25, 0.25, 0.25, 0.25],
        column_widths=[2],
        vertical_spacing=0.18,
        horizontal_spacing=0.1,
        subplot_titles=(
            'Thematic Exposure Heatmap (Raw Scores)',
            'Total Thematic Exposure Score',
            'Top Thematic Exposures by Company',
            'Thematic Scores across Sub-Themes'
        )
    )

    # Add each visualization to the dashboard
    add_raw_scores_heatmap(fig, df, theme_columns)
    add_total_scores_barchart(fig, df)
    add_top_themes_by_company_scatter(fig, df, theme_columns)
    add_dominant_themes_barchart(fig, df, theme_columns)

    # Create industry-level analysis as a separate figure
    industry_fig = create_industry_analysis(df, theme_columns)

    # Format the main dashboard layout
    format_dashboard_layout(fig)

    return fig, industry_fig


def add_raw_scores_heatmap(fig, df, theme_columns):
    """
    Adds a heatmap of raw thematic scores to the dashboard.

    Parameters:
    -----------
    fig : plotly.graph_objects.Figure
        The figure to add the heatmap to
    df : pandas.DataFrame
        The data frame containing company and theme data
    theme_columns : list
        List of column names representing thematic categories
    """
    heatmap_z = df[theme_columns].values
    heatmap_x = theme_columns
    heatmap_y = df['Company'].tolist()

    fig.add_trace(
        go.Heatmap(
            z=heatmap_z,
            x=heatmap_x,
            y=heatmap_y,
            colorscale='YlGnBu',
            text=heatmap_z.astype(int),
            texttemplate="%{text}",
            showscale=True,
        ),
        row=1, col=1
    )


def add_total_scores_barchart(fig, df):
    """
    Adds a horizontal bar chart of total thematic scores by company.
    Companies are sorted by score in descending order.

    Parameters:
    -----------
    fig : plotly.graph_objects.Figure
        The figure to add the bar chart to
    df : pandas.DataFrame
        The data frame containing company scores
    """
    companies = df['Company'].tolist()
    total_scores = df['Composite Score'].tolist()

    # Sort by score for better visualization (highest first)
    sorted_indices = np.argsort(total_scores)[::-1]
    sorted_companies = [companies[i] for i in sorted_indices]
    sorted_scores = [total_scores[i] for i in sorted_indices]

    fig.add_trace(
        go.Bar(
            y=sorted_companies,
            x=sorted_scores,
            orientation='h',
            marker=dict(
                color=sorted_scores,
                colorscale='Viridis',
                showscale=False
            ),
            text=sorted_scores,
            textposition='outside',
            textfont=dict(size=10),  # Smaller text to avoid overlap
        ),
        row=2, col=1
    )


def add_top_themes_by_company_scatter(fig, df, theme_columns):
    """
    Adds a scatter plot showing the top 3 thematic exposures for each company.
    The size of each marker represents the thematic score value.

    Parameters:
    -----------
    fig : plotly.graph_objects.Figure
        The figure to add the scatter plot to
    df : pandas.DataFrame
        The data frame containing company and theme data
    theme_columns : list
        List of column names representing thematic categories
    """
    max_score = df[theme_columns].values.max()
    companies_unique = df['Company'].unique()

    # Create a unified trace per company to reduce clutter
    for i, company in enumerate(companies_unique):
        company_data = df[df['Company'] == company]
        if len(company_data) == 0:
            continue

        company_row = company_data.iloc[0]
        company_scores = company_row[theme_columns].values

        # Get indices of top 3 themes
        top_indices = np.argsort(company_scores)[-3:]

        x_values = []
        y_values = []
        sizes = []
        hover_texts = []

        for idx in top_indices:
            if company_scores[idx] > 0:  # Only plot if score > 0
                theme = theme_columns[idx]
                score = company_scores[idx]
                size = (score / max_score) * 80

                x_values.append(company)
                y_values.append(theme)
                sizes.append(size)
                hover_texts.append(f"{company}<br>{theme}: {int(score)}")

        if len(x_values) > 0:
            fig.add_trace(
                go.Scatter(
                    x=x_values,
                    y=y_values,
                    mode='markers',
                    marker=dict(
                        size=sizes,
                        sizemode='area',
                        sizeref=0.15,
                        color=i,
                        colorscale='Turbo',
                        showscale=False,
                        opacity=0.7,
                        line=dict(width=1, color='DarkSlateGrey'),
                    ),
                    text=hover_texts,
                    hoverinfo='text',
                    name=company,
                ),
                row=3, col=1
            )


def add_dominant_themes_barchart(fig, df, theme_columns):
    """
    Adds a horizontal bar chart showing the most dominant themes across all companies.

    Parameters:
    -----------
    fig : plotly.graph_objects.Figure
        The figure to add the bar chart to
    df : pandas.DataFrame
        The data frame containing theme data
    theme_columns : list
        List of column names representing thematic categories
    """
    # Calculate totals for each theme across all companies
    theme_totals = df[theme_columns].sum()
    theme_names = theme_totals.index.tolist()
    theme_values = theme_totals.values.tolist()

    # Sort themes by value (descending)
    sorted_indices = np.argsort(theme_values)[::-1]
    top_themes = [theme_names[i] for i in sorted_indices]
    top_values = [theme_values[i] for i in sorted_indices]

    fig.add_trace(
        go.Bar(
            y=top_themes,
            x=top_values,
            orientation='h',
            marker=dict(
                color=top_values,
                colorscale='Reds_r',
                showscale=False
            ),
            text=top_values,
            textposition='outside',
            textfont=dict(size=10),
        ),
        row=4, col=1
    )


def create_industry_analysis(df, theme_columns):
    """
    Creates a separate heatmap showing average thematic scores by industry.

    Parameters:
    -----------
    df : pandas.DataFrame
        The data frame containing company and industry data
    theme_columns : list
        List of column names representing thematic categories

    Returns:
    --------
    plotly.graph_objects.Figure
        A heatmap figure showing industry-level analysis
    """
    # Group by industry and calculate mean scores
    industry_data = []

    for industry, group in df.groupby('Industry'):
        for theme in theme_columns:
            industry_data.append({
                'Industry': industry,
                'Theme': theme,
                'Score': group[theme].mean()
            })

    industry_df = pd.DataFrame(industry_data)

    # Create a pivot table for the heatmap
    industry_pivot = industry_df.pivot(index='Industry', columns='Theme', values='Score')

    # Create the industry analysis figure
    industry_fig = go.Figure(data=go.Heatmap(
        z=industry_pivot.values,
        x=industry_pivot.columns,
        y=industry_pivot.index,
        colorscale='YlOrRd',
        text=np.round(industry_pivot.values, 1),
        texttemplate="%{text}",
    ))

    # Format the industry analysis figure
    industry_fig.update_layout(
        title='Industry-Level Thematic Exposure (Average Scores)',
        height=500,
        width=1000,
        margin=dict(l=60, r=50, t=80, b=50),
    )

    return industry_fig


def format_dashboard_layout(fig):
    """
    Formats the dashboard layout with appropriate titles, margins, and axis labels.

    Parameters:
    -----------
    fig : plotly.graph_objects.Figure
        The dashboard figure to format
    """
    # Update overall layout
    fig.update_layout(
        height=1600,
        width=1800,
        title_text="Thematic Exposure Analysis Dashboard",
        showlegend=False,
        margin=dict(l=60, r=50, t=100, b=50),
    )

    # Update axis titles and formatting
    fig.update_xaxes(
        title_text="",
        row=1, col=1,
        tickangle=45,
        tickfont=dict(size=9),
        automargin=True,
    )
    fig.update_yaxes(title_text="Company", row=1, col=1, automargin=True)
    fig.update_xaxes(title_text="Total Score", row=2, col=1, automargin=True)
    fig.update_yaxes(title_text="", row=2, col=1)
    fig.update_xaxes(title_text="", row=3, col=1, automargin=True)
    fig.update_yaxes(title_text="Theme", row=3, col=1)
    fig.update_xaxes(title_text="Total Score Across Companies", row=4, col=1, automargin=True)


# Create and display the dashboard
main_fig, industry_fig = create_thematic_exposure_dashboard(df_company)
main_fig.show()
industry_fig.show()

Extract Key Insights

The visualizations reveal key insights about how companies are positioning themselves within the supply chain reshaping theme:

AI and Machine Learning Emerges as the Core Enabler

With the highest cumulative score across all companies, AI and Machine Learning is the most dominant theme, highlighting its foundational role in predictive analytics, automation, and optimization within modern supply chains.

Circular Economy and Automation as Structural Shifts

The strong presence of Circular Economy Practices and Automation & Robotics indicates a structural shift toward sustainable and efficient supply chain models—companies are not just digitizing but rethinking operational design.

Tech-Centric Players Lead the Pack

Siemens AG, Infineon Technologies AG, and Qualcomm Inc. are the frontrunners in thematic exposure, underscoring that companies at the intersection of industrial technology and digital infrastructure are best positioned to drive—and benefit from—supply chain transformation.

IoT Integration as a Bridge Between Physical and Digital

IoT’s high ranking shows its critical role in connecting assets, enabling real-time visibility, and facilitating advanced automation, especially for manufacturers and hardware-driven firms.

Industry Polarisation

Sector Engagement

  • Semiconductors and Computer Services industries show the strongest average exposure, reflecting their integral role in enabling supply chain tech (e.g., sensors, connectivity, software).
  • Traditional Sectors like Diversified Industrials show broader but shallower engagement, suggesting they are still in earlier phases of thematic adoption.

Strategic Focus

Concentration vs. Diversification in Exposure

Most companies exhibit thematic concentration, focusing efforts on a few high-impact areas rather than spreading across all themes—likely reflecting strategic prioritization rather than lack of alignment.

Export the Results

Export the data as Excel files for further analysis or to share with the team.

try:
    # Create the Excel manager
    excel_manager = ExcelManager()

    # Define the dataframes and their sheet configurations
    df_args = [
        (df, "Semantic Labels", (0, 0)),
        (df_company, "By Company", (2, 4)),
        (df_industry, "By Industry", (2, 2))
    ]

    # Save the workbook
    excel_manager.save_workbook(df_args, export_path)

except Exception as e:
    print(f"Warning while exporting to excel: {e}")

Conclusion

The Thematic Screener provides a powerful way to identify companies that are most aligned with or exposed to specific investment themes. By leveraging BigData’s search capabilities and applying LLM-based classification, you can:

  1. Discover thematic leaders - Find companies with the strongest strategic alignment to emerging trends
  2. Compare across industries - Identify which sectors are most proactive in addressing thematic challenges and opportunities
  3. Identify investment opportunities - Spot companies that may be undervalued relative to their thematic positioning
  4. Monitor thematic evolution - Track how themes gain or lose prominence across your investment universe over time

Whether you’re building thematic portfolios, conducting sector research, or seeking alpha through theme-based strategies, the Thematic Screener transforms unstructured data into structured, decision-ready intelligence.