Previous: Step 7 - Dive deeper with BDDF
This section provides an overview of the Bigdata Document Format
(BDDF), which outlines the structure and requirements for submitting data
in JSON format.
Schema
The schema version that you are using. This is the first field in the document and it’s used to identify the schema version you are using.
The schema described on this page is version 1.3
Document
A unique ID that identifies this document in your universe. If you don’t have one, we’d kindly ask for you to randomly generate one for each of your documents. We’d recommend a UUID (visit https://www.uuidgenerator.net/ for inspiration) or an alphanumeric string.
Document - Revision
This object contains information about a specific revision or update to the document.
document.revision.chain_id
The document chain identifier from your universe. If you publish updates to your documents and want to keep just the latest version, use this (and sequence_id) field to indicate which chain the updated document belongs to. Typically, this is the ID of the first document in the chain.
document.revision.sequence_id
The document sequence identifier within the chain. As you keep publishing updates, keep us informed about the sequence of updates - that will help us keep the latest version for you if, for example, we receive multiple updates in a short period of time. Typically, this is a simple integer field that you keep increasing with each update you publish (1, 2, 3… or you can use unix timestamp 1720915200, 1721260800, 1721692800…).
Document - Source
This array contains source related information.
The unique id of the source in your universe, if you have any source taxonomy at your end - use that. If not, we suggest you think of how you’d like your sources to be organized (per topic, per geography, per website you scrape etc.) Typically, this is a UUID or an alphanumerical string.
The official canonical name for the source if you don’t have sources taxonomy at your end, we suggest you use something descriptive here. You won’t be able to change it later on.
Document - Timestamps
This array contains information about the timestamps associated with the
document in UTC timezone.
document.timestamps_utc.published
If you maintain multiple different timestamps related to the document in your universe, store the main one here. Otherwise use the date the document was published to its intended audience (in UTC timezone).
document.timestamps_utc.created
Add the timestamp that marks the exact moment the document was first created (in UTC timezone).
document.timestamps_utc.last_modified
Add the date and time when the document was last updated or modified (in UTC timezone).
This object contains additional information about the document.
document.metadata.primary_entity_id
If you have the information about the main entity this document is about, store your ID here. It could be useful to us if we ever decide to do the mapping between your and our IDs, in order to further enhance our entity detections. For example, the primary entity could be the company related to the transcript - if this document is about Microsoft’s earnings call and you have an internal ID for Microsoft in your universe, this is the place to store it.
document.metadata.primary_entity_name
The canonical name in your system that corresponds to the primary entity. For example, the primary entity could be the company related to the transcript.
document.metadata.filename
Provide the document filename here. Typically, we’ve seen people building some logic based on the filenames so it’s a good practice to have it inside the document as well.
Document original URL which refers to the web address link where the document was originally hosted or made available online.
document.metadata.media_type
Document media type.
document.metadata.language
Include the original language in which the document is written, following ISO standard as listed here.
document.metadata.copyright
Add any copyright information you possess, this could include details about ownership, usage and legal protection.
document.metadata.document_type
If you identify document types, this is the place to provide your classification values in free text form. Some of the examples of values we’ve seen here are:
annual-report, interim-report, quarterly-report, earnings-release, shareholder-meetings-notice, earnings-call-transcript, earnings-call-slides,
special-events-slides, sustainability-report, initial-registration-statement, podcast, audio-recording…
document.metadata.reporting_period.fiscal_year
Fiscal year, for example: 2024
document.metadata.reporting_period.fiscal_period
Fiscal period - should be enum with the following values allowed: FY, H1, H2, Q1, Q2, Q3, Q4, M01, M02, M03, M04, M05, M06, M07, M08, M09, M10, M11, M12If you leave this field empty, we’ll assume it’s FY
This is an array that helps our classification understand better what is the document about - we use it to hint our detections so any standard identifier (such as ticker or ISIN) can help us a lot
document.metadata.codes.type
If you store any (standard) identifiers, such as ISIN, CUSIP etc., store the type here (and provide the value in the field below).
document.metadata.codes.value
Represents the value of the code associated with the document.
Put here any additional metadata that doesn’t fit in other fields but describes the document such as labels or custom fields that will enhance its searchability.
Content
Both Title and Body objects follow the same block structure, with the same schema - don’t get confused if you see the same field names below ;)
Content - Title
content.title.content_type
IANA media type of the content block (text/plain, application/html).
Add the title of the document
The role of the content block in the document. For titles it’s typically HEADING
If you have any linked documents in the title or it contains graphical elements such as images, this is the place to put the URL
Content - Title - Section
The array of sections within the text. Doesn’t make much sense for title so feel free to skip this element
content.title.section.name
If you detect sections within a document, specify which section this content block belongs to. For Title it can just be a simple Title
content.title.section.parents
The array of parents in order of nesting for this section. For Title it will most likely be empty
content.title.section.metadata
If you have any section metadata, put it here.
Array of page numbers - doesn’t make much sense for Title so either put 1 or just skip this field completely
Content - Body
content.body.content_type
IANA media type of the content block (text/plain, application/html).
Add the text of the content block.
The role of the content block in the document (NORMAL, HEADING, FOOTER, HEADER…).
If you have images or other asset elements as a part of your documents, this is the place to put their URL
Content - Body - Section
If you detect sections and are able to provide them, then certain fields are mandatory (like content.body.section.name), otherwise feel free to skip the whole section.
content.body.section.name
If you detect sections within a document, specify which section this content block belongs to.
content.body.section.parents
The array of parents in order of nesting for this section.
content.body.section.metadata
If you have any section metadata, put it here.
Array of page numbers - typically it will be a single int indicating the page of the original document (for example PDF) where this content block (paragraph) is. In case the paragraph spans pages, provide multiple page values in this array.