Previous: Step 7 - Dive deeper with BDDF This section provides an overview of the Bigdata Document Format (BDDF), which outlines the structure and requirements for submitting data in JSON format.

Schema

schema.version
string
required
The schema version that you are using. This is the first field in the document and it’s used to identify the schema version you are using.
The schema described on this page is version 1.3

Document

document.id
string
required
A unique ID that identifies this document in your universe. If you don’t have one, we’d kindly ask for you to randomly generate one for each of your documents. We’d recommend a UUID (visit https://www.uuidgenerator.net/ for inspiration) or an alphanumeric string.

Document - Revision

This object contains information about a specific revision or update to the document.
document.revision.chain_id
string
required
The document chain identifier from your universe. If you publish updates to your documents and want to keep just the latest version, use this (and sequence_id) field to indicate which chain the updated document belongs to. Typically, this is the ID of the first document in the chain.
document.revision.sequence_id
string
required
The document sequence identifier within the chain. As you keep publishing updates, keep us informed about the sequence of updates - that will help us keep the latest version for you if, for example, we receive multiple updates in a short period of time. Typically, this is a simple integer field that you keep increasing with each update you publish (1, 2, 3… or you can use unix timestamp 1720915200, 1721260800, 1721692800…).

Document - Source

This array contains source related information.
document.source.id
string
required
The unique id of the source in your universe, if you have any source taxonomy at your end - use that. If not, we suggest you think of how you’d like your sources to be organized (per topic, per geography, per website you scrape etc.) Typically, this is a UUID or an alphanumerical string.
document.source.name
string
required
The official canonical name for the source if you don’t have sources taxonomy at your end, we suggest you use something descriptive here. You won’t be able to change it later on.

Document - Timestamps

This array contains information about the timestamps associated with the document in UTC timezone.
document.timestamps_utc.published
timestamp
required
If you maintain multiple different timestamps related to the document in your universe, store the main one here. Otherwise use the date the document was published to its intended audience (in UTC timezone).
document.timestamps_utc.created
timestamp
Add the timestamp that marks the exact moment the document was first created (in UTC timezone).
document.timestamps_utc.last_modified
timestamp
Add the date and time when the document was last updated or modified (in UTC timezone).

Document - Metadata

This object contains additional information about the document.
document.metadata.primary_entity_id
string
If you have the information about the main entity this document is about, store your ID here. It could be useful to us if we ever decide to do the mapping between your and our IDs, in order to further enhance our entity detections. For example, the primary entity could be the company related to the transcript - if this document is about Microsoft’s earnings call and you have an internal ID for Microsoft in your universe, this is the place to store it.
document.metadata.primary_entity_name
string
The canonical name in your system that corresponds to the primary entity. For example, the primary entity could be the company related to the transcript.
document.metadata.filename
string
Provide the document filename here. Typically, we’ve seen people building some logic based on the filenames so it’s a good practice to have it inside the document as well.
document.metadata.url
string
Document original URL which refers to the web address link where the document was originally hosted or made available online.
document.metadata.media_type
string
Document media type.
document.metadata.language
string
Include the original language in which the document is written, following ISO standard as listed here.
Add any copyright information you possess, this could include details about ownership, usage and legal protection.
document.metadata.document_type
string
If you identify document types, this is the place to provide your classification values in free text form. Some of the examples of values we’ve seen here are: annual-report, interim-report, quarterly-report, earnings-release, shareholder-meetings-notice, earnings-call-transcript, earnings-call-slides, special-events-slides, sustainability-report, initial-registration-statement, podcast, audio-recording…

Document - Metadata - Reporting Period

document.metadata.reporting_period.fiscal_year
int
Fiscal year, for example: 2024
document.metadata.reporting_period.fiscal_period
enum
Fiscal period - should be enum with the following values allowed: FY, H1, H2, Q1, Q2, Q3, Q4, M01, M02, M03, M04, M05, M06, M07, M08, M09, M10, M11, M12If you leave this field empty, we’ll assume it’s FY

Document - Metadata - Codes

This is an array that helps our classification understand better what is the document about - we use it to hint our detections so any standard identifier (such as ticker or ISIN) can help us a lot
document.metadata.codes.type
string
If you store any (standard) identifiers, such as ISIN, CUSIP etc., store the type here (and provide the value in the field below).
document.metadata.codes.value
string
Represents the value of the code associated with the document.
document.metadata.custom
string
Put here any additional metadata that doesn’t fit in other fields but describes the document such as labels or custom fields that will enhance its searchability.

Content

Both Title and Body objects follow the same block structure, with the same schema - don’t get confused if you see the same field names below ;)

Content - Title

content.title.content_type
string
required
IANA media type of the content block (text/plain, application/html).
content.title.value
string
required
Add the title of the document
content.title.role
string
The role of the content block in the document. For titles it’s typically HEADING
content.title.url
string
If you have any linked documents in the title or it contains graphical elements such as images, this is the place to put the URL

Content - Title - Section

The array of sections within the text. Doesn’t make much sense for title so feel free to skip this element
content.title.section.name
string
required
If you detect sections within a document, specify which section this content block belongs to. For Title it can just be a simple Title
content.title.section.parents
string[]
The array of parents in order of nesting for this section. For Title it will most likely be empty
content.title.section.metadata
string
If you have any section metadata, put it here.
content.title.pages
string[]
Array of page numbers - doesn’t make much sense for Title so either put 1 or just skip this field completely

Content - Body

content.body.content_type
string
required
IANA media type of the content block (text/plain, application/html).
content.body.value
string
required
Add the text of the content block.
content.body.role
string
The role of the content block in the document (NORMAL, HEADING, FOOTER, HEADER…).
content.body.url
string
If you have images or other asset elements as a part of your documents, this is the place to put their URL

Content - Body - Section

If you detect sections and are able to provide them, then certain fields are mandatory (like content.body.section.name), otherwise feel free to skip the whole section.
content.body.section.name
string
required
If you detect sections within a document, specify which section this content block belongs to.
content.body.section.parents
string[]
The array of parents in order of nesting for this section.
content.body.section.metadata
string
If you have any section metadata, put it here.
content.body.pages
string[]
Array of page numbers - typically it will be a single int indicating the page of the original document (for example PDF) where this content block (paragraph) is. In case the paragraph spans pages, provide multiple page values in this array.