Why It Matters

Technology companies face an increasingly complex regulatory landscape spanning AI governance, data privacy, antitrust scrutiny, and platform accountability. Tracking compliance risks across multiple companies and jurisdictions manually is time-consuming and fragmented, while regulatory developments appear scattered across news coverage, SEC filings, and earnings transcripts.

What It Does

The GenerateReport class in the bigdata-research-tools package systematically analyzes regulatory exposure across company watchlists using unstructured data from news, filings, and transcripts. Built for risk managers and investment professionals, it transforms scattered regulatory information into quantifiable risk intelligence and identifies proactive company mitigation strategies.

How It Works

The GenerateReport combines automated theme taxonomies, multi-source content retrieval, and LLM-powered risk scoring to deliver:
  • Sector-wide regulatory mapping across technology domains (AI, Social Media, Hardware & Chips, E-commerce, Advertising)
  • Company-specific risk quantification using Media Attention, Risk/Financial Impact, and Uncertainty metrics
  • Mitigation strategy extraction from corporate communications to identify compliance approaches
  • Structured output for reporting ranking regulatory issues by intensity and business impact

A Real-World Use Case

This cookbook demonstrates the complete workflow through analyzing regulatory challenges across the “Magnificent 7” tech companies, showing how the generator automatically creates comprehensive risk assessments and extracts company response strategies from multiple document sources. Ready to get started? Let’s dive in! Open in GitHub

Prerequisites

To run the Report Generator workflow, you can choose between two options:
  • 💻 GitHub cookbook
    • Use this if you prefer working locally or in a custom environment.
    • Follow the setup and execution instructions in the README.md.
    • API keys are required:
      • Option 1: Follow the key setup process described in the README.md
      • Option 2: Refer to this guide: How to initialise environment variables
        • ❗ When using this method, you must manually add the OpenAI API key:
          # OpenAI credentials
          OPENAI_API_KEY = "<YOUR_OPENAI_API_KEY>"
          
  • 🐳 Docker Installation
    • Docker installation is available for containerized deployment.
    • Provides an alternative setup method with containerized deployment, simplifying the environment configuration for those preferring Docker-based solutions.

Setup and Imports

Below is the Python code required for setting up our environment and importing necessary libraries.
from src.report_generator import GenerateReport #the source will be changed
from src.summary.summary import TopicSummarizerSector, TopicSummarizerCompany
from src.response.company_response import CompanyResponseProcessor
from src.tool import create_styled_table, prepare_data_report_0, generate_html_report


from bigdata_research_tools.themes import generate_theme_tree
from bigdata_research_tools.search.screener_search import search_by_companies
from bigdata_research_tools.labeler.screener_labeler import ScreenerLabeler
from bigdata_research_tools.excel import ExcelManager
from bigdata_client.models.search import DocumentType
from bigdata_client import Bigdata

import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
import pandas as pd
from IPython.display import display, HTML


# Define output file paths for our report
output_dir = f"{current_dir}/output"
os.makedirs(output_dir, exist_ok=True)

export_path = f"{output_dir}/regulatory_issues_report.xlsx"

Defining the Report Parameters

Fixed Parameters

  • General Theme (general_theme): The central regulatory concept to explore across all technology domains
  • Specific Focus Areas (list_specific_focus): Technology sectors where regulatory issues are particularly relevant
  • Bigdata (bigdata): Bigdata connection

Customizable Parameters

  • Watchlist (my_watchlist_id): The set of companies to analyze. This is the ID of your watchlist in the watchlist section of the app.
  • Model Selection (llm_model): The LLM model used to label search result document chunks and generate summaries
  • Frequency (search_frequency): The frequency of the date ranges to search over. Supported values:
    • Y: Yearly intervals
    • M: Monthly intervals
    • W: Weekly intervals
    • D: Daily intervals. Defaults to 3M
  • Time Period (start_date and end_date): The date range over which to run the analysis
  • Fiscal Year (fiscal_year): If the document type is transcripts or filings, fiscal year needs to be specified
  • Focus (focus): Specify a focus within the main theme. This will then be used in building the LLM generated mindmapper
  • Document Limits (document_limit_news, document_limit_filings, document_limit_transcripts): The maximum number of documents to return per query to Bigdata API for each category of documents
  • Batch Size (batch_size): The number of entities to include in a single batched query
# ===== Fixed Parameters =====

# General regulatory theme
general_theme = 'Regulatory Issues'

# Specific focus areas within technology sectors
list_specific_focus = ['AI', 'Social Media', 'Hardware and Chips', 'E-commerce', 'Advertising']

# ===== Customizable Parameters =====

# Company Universe (from Watchlist)
my_watchlist_id = "ff900c90-1007-4971-a91d-49e6f9bb798c" # Magnificent 7
watchlist = bigdata.watchlists.get(my_watchlist_id)
companies = bigdata.knowledge_graph.get_entities(watchlist.items)
company_names = [company.name for company in companies]

# LLM Specification
llm_model = "openai::gpt-4o-mini"

# Search Frequency
search_frequency='M'

# Specify Time Range
start_date="2025-01-01"
end_date="2025-04-20"

# Fiscal Year
fiscal_year = 2025

# Document Limits
document_limit_news=10
document_limit_filings=5
document_limit_transcripts=5

# Others
batch_size=1

Generate Report

We initialize the class GenerateReport and in the following section of the cookbook, we will go through each step used by this class to generate the report. In the colab cookbook you can skip the step-by-step process and directly run the generate_report() method in the section Direct Method.
report_generator = GenerateReport(
        watchlist_id=my_watchlist_id,
        general_theme='Regulatory Issues',
        list_specific_focus=['AI', 'Social Media', 'Hardware and Chips', 'E-commerce', 'Advertising'],
        llm_model=llm_model,
        api_key=OPENAI_API_KEY,
        start_date=start_date,
        end_date=end_date,
        fiscal_year = fiscal_year,
        search_frequency=search_frequency,
        document_limit_news=document_limit_news,
        document_limit_filings=document_limit_filings,
        document_limit_transcripts=document_limit_transcripts,
        batch_size=batch_size,
        bigdata=bigdata
)

Mindmap a Theme Taxonomy with Bigdata Research Tools

You can leverage Bigdata Research Tools to generate a comprehensive theme taxonomy with an LLM, breaking down regulatory themes into smaller, well-defined concepts for more targeted analysis across different technology focus areas.
# Generate the Theme Tree
themes_tree_dict = {}
for focus in list_specific_focus:
    theme_tree = generate_theme_tree(
                        main_theme=general_theme,
                        focus=focus
                    )
    themes_tree_dict[focus] = theme_tree

theme_tree.visualize()
Regulatory Theme Tree Visualization showing Regulatory Issues broken down into sub-themes across tech sectors
The taxonomy tree includes descriptive sentences that explicitly connect each sub-theme back to the Regulatory Issues general theme, ensuring all search results remain contextually relevant to our central trend.

Retrieve Content

With the theme taxonomy and screening parameters, you can leverage the Bigdata API to run searches on company news, filings, and transcripts across different regulatory focus areas.
# Run searches on News, Filings, and Transcripts
df_sentences_news = []
df_sentences_filings = []
df_sentences_transcripts = []

scopes_config = [
    (DocumentType.NEWS, document_limit_news, df_sentences_news, None),
    (DocumentType.FILINGS, document_limit_filings, df_sentences_filings, fiscal_year),
    (DocumentType.TRANSCRIPTS, document_limit_transcripts, df_sentences_transcripts, fiscal_year)
]

# Search using summaries
for scope, document_limit, df_list, year in scopes_config:
    for focus in list_specific_focus:
        df_sentences = search_by_companies(
            companies=companies,
            sentences=list(themes_tree_dict[focus].get_terminal_label_summaries().values()),
            fiscal_year=year,
            start_date=start_date,
            end_date=end_date,
            scope=scope,
            freq=search_frequency,
            document_limit=document_limit,
            batch_size=batch_size
        )
        df_sentences['theme'] = general_theme + ' in ' + focus
        df_list.append(df_sentences)

# Concatenate results
df_sentences_news = pd.concat(df_sentences_news)
df_sentences_filings = pd.concat(df_sentences_filings)
df_sentences_transcripts = pd.concat(df_sentences_transcripts)

Label the Results

Use an LLM to analyze each document chunk and determine its relevance to the regulatory themes. Any document chunks which aren’t explicitly linked to Regulatory Issues will be filtered out.
# Label the search results with our theme labels
labeler = ScreenerLabeler(llm_model=llm_model)

# Initialize empty lists for labeled data
df_news_labeled = []
df_filings_labeled = []
df_transcripts_labeled = []

# Configure data sources
sources_config = [
    (df_sentences_news, df_news_labeled),
    (df_sentences_filings, df_filings_labeled),
    (df_sentences_transcripts, df_transcripts_labeled)
]

for df_sentences, labeled_list in sources_config:
    for focus in list_specific_focus:
        df_sentences_theme = df_sentences.loc[(df_sentences.theme == general_theme + ' in ' + focus)]
        df_sentences_theme.reset_index(drop=True, inplace=True)
        df_labels = labeler.get_labels(
            main_theme=general_theme + ' in ' + focus,
            labels=list(themes_tree_dict[focus].get_terminal_label_summaries().keys()),
            texts=df_sentences_theme["masked_text"].tolist()
        )
        df_merged_labels = pd.merge(df_sentences_theme, df_labels, left_index=True, right_index=True)
        labeled_list.append(df_merged_labels)

# Concatenate results
df_news_labeled = pd.concat(df_news_labeled)
df_filings_labeled = pd.concat(df_filings_labeled)
df_transcripts_labeled = pd.concat(df_transcripts_labeled)

Document Distribution Visualization

You can visualize the tables showing the count of different document types for each company in the given universe. This helps you understand the distribution and availability of regulatory information across different sources for each entity. Table for All Retrieved Documents about Regulatory Issues
df_statistic_resources = pd.concat([df_news_labeled, df_filings_labeled, df_transcripts_labeled])
create_styled_table(df_statistic_resources, title='Retrieved Document Count by Company and Document Type', companies_list = company_names)
All Regulatory Document Count
Table for Relevant Documents about Regulatory Issues
df_statistic_resources_relevant = df_statistic_resources.loc[~df_statistic_resources.label.isin(['', 'unassigned', 'unclear'])]
create_styled_table(df_statistic_resources_relevant, title='Relevant Document Count by Company and Document Type', companies_list = company_names)
Relevant Regulatory Document Count

Summarizer

The following code is used to create summaries for regulatory themes at both sector-wide and company-specific levels using the information from the retrieved documents.
# Run the process to summarize the documents and compute media attention by topic, sector-wide
summarizer_sector = TopicSummarizerSector(
   model=llm_model.split('::')[1],
   api_key=OPENAI_API_KEY,
   df_labeled=df_news_labeled,
   general_theme=general_theme,
   list_specific_focus=list_specific_focus,
   themes_tree_dict=themes_tree_dict,
   logger=GenerateReport.logger
)
df_by_theme = summarizer_sector.summarize()

# Run the process to summarize the documents and score media attention, risk and uncertainty by topic at company level
summarizer_company = TopicSummarizerCompany(
   model=llm_model.split('::')[1],
   api_key=OPENAI_API_KEY,
   logger=GenerateReport.logger,
   verbose=True
)
df_by_company = asyncio.run(
   summarizer_company.process_topic_by_company(
       df_labeled=df_news_labeled,
       list_entities=companies
   )
)

Company Response Analysis

Extract company mitigation strategies and regulatory responses from filings and transcripts to understand how companies are proactively addressing regulatory challenges.
# Concatenate Filings and Transcripts dataframes
df_filings_labeled['doc_type'] = 'Filings'
df_transcripts_labeled['doc_type'] = 'Transcripts'
df_ft_labeled = pd.concat([df_filings_labeled, df_transcripts_labeled])
df_ft_labeled = df_ft_labeled.reset_index(drop=True)

# Run the process to extract company's mitigation plan from the documents (filings and transcripts)
response_processor = CompanyResponseProcessor(
   model=llm_model.split('::')[1],
   api_key=OPENAI_API_KEY,
   logger=GenerateReport.logger,
   verbose=True
)

df_response_by_company = asyncio.run(
   response_processor.process_response_by_company(
       df_labeled=df_ft_labeled,
       df_by_company=df_by_company,
       list_entities=companies
   )
)

# Merge the companies responses to the dataframe with issue summaries and scores
df_by_company_with_responses = pd.merge(df_by_company, df_response_by_company, on=['entity_id', 'entity_name', 'topic'], how='left')
df_by_company_with_responses['filings_response_summary'] = df_by_company_with_responses['response_summary']

# Extract the company's mitigation plan for each regulatory issue from the News documents
df_news_response_by_company = asyncio.run(
   response_processor.process_response_by_company(
       df_labeled=df_news_labeled,
       df_by_company=df_by_company,
       list_entities=companies
   )
)

df_news_response_by_company = df_news_response_by_company.rename(
   columns={'response_summary': 'news_response_summary', 'n_response_documents': 'news_n_response_documents'}
)
df_by_company_with_responses = pd.merge(df_by_company_with_responses, df_news_response_by_company,
                                       on=['entity_id', 'entity_name', 'topic'], how='left')

report_by_theme = df_by_theme
report_by_company = df_by_company_with_responses

Generate Final Report

The following code provides an example of how the final regulatory issues report can be formatted, ranking topics based on their Media Attention, Risk/Financial Impact, and Uncertainty. You can customize the ranking system by specifying the number of top themes to display with user_selected_nb_topics_themes.

# Generate the html report
top_by_theme, top_by_company = prepare_data_report_0(
     df_by_theme = df_by_theme,
     df_by_company_with_responses = df_by_company_with_responses,
     user_selected_nb_topics_themes = 3,
)

html_content = generate_html_report(top_by_theme, top_by_company, 'Regulatory Issues in the Tech Sector')

with open(output_dir+'/report.html', 'w') as file:
     file.write(html_content)

display(HTML(html_content))

Report: Regulatory Issues in the Tech Sector

Sector-Wide Issues

Company-Specific Issues

The report provides a detailed analysis of the most relevant sector-wide issues and analyzes individual companies, highlighting three key aspects:
  • Most Reported Issue: The regulatory topic receiving the highest volume of media coverage
  • Biggest Risk: The regulatory issue with the highest potential financial and business impact
  • Most Uncertain Issue: The regulatory matter with the greatest ambiguity and unpredictability
Each aspect is analyzed using its own ranking system, allowing for a tailored and detailed view of company-specific regulatory challenges and their strategic responses.

Export the Results

Export the data as Excel files for further analysis or to share with the team.
try:
    # Create the Excel manager
    excel_manager = ExcelManager()

    # Define the dataframes and their sheet configurations
    df_args = [
        (df_by_company_with_responses, "Report Regulatory Issues by companies", (2, 3)),
        (df_by_theme, "Report Regulatory Issues by theme", (1, 1))
    ]

    # Save the workbook
    excel_manager.save_workbook(df_args, export_path)

except Exception as e:
    print(f"Warning while exporting to excel: {e}")

Conclusion

The Regulatory Issues Report Generator provides a comprehensive automated framework for analyzing regulatory risks and company mitigation strategies across the technology sector. By systematically combining advanced information retrieval with LLM-powered analysis, this workflow transforms unstructured regulatory information into structured, decision-ready intelligence. Through the automated analysis of regulatory challenges across multiple technology domains, you can:
  1. Analyze regulatory intensity - Compare regulatory scrutiny levels across different technology sectors (AI, Social Media, Hardware & Chips, E-commerce, Advertising) to identify compliance challenges
  2. Assess company-specific risk profiles - Compare how companies within your watchlist are exposed to different regulatory issues using quantitative scoring across Media Attention, Risk/Financial Impact, and Uncertainty dimensions
  3. Monitor proactive compliance strategies - Track how companies are responding to regulatory challenges through their filings, transcripts, and public communications, identifying best practices and strategic approaches
  4. Quantify regulatory uncertainty - The comprehensive scoring system provides clear metrics to identify which regulatory issues pose the greatest ambiguity and unpredictability for strategic planning
  5. Generate sector-wide intelligence - Create comprehensive reports that inform regulatory strategy, compliance planning, and investment decisions across technology companies
  6. Analyze regulatory landscape for specific periods - Generate comprehensive snapshots of regulatory challenges and company responses for defined time periods, enabling informed risk assessment and strategic planning
From conducting regulatory due diligence to building compliance-focused investment strategies or assessing sector-wide regulatory risks, the Report Generator automates the research process while maintaining the depth and nuance required for professional regulatory intelligence. The standardized scoring methodology ensures consistent evaluation across companies, regulatory domains, and time periods, making it an invaluable tool for systematic regulatory risk assessment in the rapidly evolving technology sector.