Format Schema
This section provides an overview of the Bigdata Document Format (BDDF), which outlines the structure and requirements for submitting data in JSON format. The table below details the base nodes, field names, data types, and whether the fields are required or optional.
Schema
The schema version that you are using. This is the first field in the document and it’s used to identify the schema version you are using.
The schema described on this page is version 1.2
Document
A unique ID that identifies this document in your universe. If you don’t have one, we’d kindly ask for you to randomly generate one for each of your documents. We’d recommend a UUID (visit https://www.uuidgenerator.net/ for inspiration) or an alphanumeric string.
Revision
This array that cotains information about a specific revision or update to the document.
The document chain identifier from your universe. If you publish updates to your documents and want to keep just the latest version, use this (and sequence_id) field to indicate which chain the updated document belongs to. Typically, this is the ID of the first document in the chain.
The document sequence identifier within the chain. As you keep publishing updates, keep us informed about the sequence of updates - that will help us keep the latest version for you if, for example, we receive multiple updates in a short period of time. Typically, this is a simple integer field that you keep increasing with each update you publish.
Source
This array contains source related information.
The unique id of the source in the your universe, if you have any source taxonomy at your end - use that. If not, we suggest you think of how you’d like your sources to be organized (per topic, per geography, per website you scrape etc.) Typically, this is a UUID or an alphanumerical string.
The official canonical name for the source if you don’t have sources taxonomy at your end, we suggest you use something descriptive here. You won’t be able to change it later on.
Timestamps
This array contains information about the timestamps associated with the document in UTC timezone.
If you maintain multiple different timestamps related to the document in your universe, store the main one here. Otherwise use the date the document was published to its intended audience (in UTC timezone).
Add the timestamp that marks the exact moment the document was first created (in UTC timezone).
Add the date and time when the document was last updated or modified (in UTC timezone).
Metadata
This array contains additional information about the document.
If you have the information about the main entity this document is about, store your ID here. It could be useful to us if we ever decide to do the mapping between your and our IDs, in order to further enhance our entity detections. For example, the primary entity could be the company related to the transcript - if this document is about Microsoft’s earnings call and you have an internal ID for Microsoft in your universe, this is the place to store it.
The canonical name in your system that corresponds to the primary entity. For example, the primary entity could be the company related to the transcript.
Provide the document filename here. Typically, we’ve seen people building some logic based on the filenames so it’s a good practice to have it inside the document as well.
Document original URL which refers to the web address link where the document was originally hosted or made available online.
Document media type.
Include the original language in which the document is written, following ISO standard as listed here.
Add any copyright information you possess, this could include details about ownership, usage and legal protection.
Codes
If you store any (standard) identifiers, such as ISIN, CUSIP etc., store the type here (and provide the value in the field below.
Represents the value of the code associated with the document.
Put here any additional metadata that doesn’t fit in other fields but describes the document such as labels or custom fields that will enhance its searchability.
Content
Add the title/headline of the document.
Body
IANA media type of the content block (text/plain, application/html).
Add the text of the content block.
The role of the content block in the document (NORMAL, HEADING, FOOTER, HEADER).
Section
The array of sections within the text.
If you detect sections within a document, specify which section this content block belongs to.
The array of parents in order of nesting for this section.
If you have any section metadata, put it here.