This section provides an overview of the Bigdata Document Format (BDDF), which outlines the structure and requirements for submitting data in JSON format. The table below details the base nodes, field names, data types, and whether the fields are required or optional.

Schema

version
string

The schema version that you are using. This is the first field in the document and it’s used to identify the schema version you are using.
The schema described on this page is version 1.2

Document

id
string
required

A unique ID that identifies this document in your universe. If you don’t have one, we’d kindly ask for you to randomly generate one for each of your documents. We’d recommend a UUID (visit https://www.uuidgenerator.net/ for inspiration) or an alphanumeric string.

Revision

This array that cotains information about a specific revision or update to the document.

revision.chain_id
string
required

The document chain identifier from your universe. If you publish updates to your documents and want to keep just the latest version, use this (and sequence_id) field to indicate which chain the updated document belongs to. Typically, this is the ID of the first document in the chain.

revision.sequence_id
string
required

The document sequence identifier within the chain. As you keep publishing updates, keep us informed about the sequence of updates - that will help us keep the latest version for you if, for example, we receive multiple updates in a short period of time. Typically, this is a simple integer field that you keep increasing with each update you publish.

Source

This array contains source related information.

source.id
string
required

The unique id of the source in the your universe, if you have any source taxonomy at your end - use that. If not, we suggest you think of how you’d like your sources to be organized (per topic, per geography, per website you scrape etc.) Typically, this is a UUID or an alphanumerical string.

source.name
string

The official canonical name for the source if you don’t have sources taxonomy at your end, we suggest you use something descriptive here. You won’t be able to change it later on.

Timestamps

This array contains information about the timestamps associated with the document in UTC timezone.

timestamps.published
timestamp
required

If you maintain multiple different timestamps related to the document in your universe, store the main one here. Otherwise use the date the document was published to its intended audience (in UTC timezone).

timestamps.created
timestamp

Add the timestamp that marks the exact moment the document was first created (in UTC timezone).

timestamps.last_modified
timestamp

Add the date and time when the document was last updated or modified (in UTC timezone).

Metadata

This array contains additional information about the document.

metadata.primary_entity_id
string

If you have the information about the main entity this document is about, store your ID here. It could be useful to us if we ever decide to do the mapping between your and our IDs, in order to further enhance our entity detections. For example, the primary entity could be the company related to the transcript - if this document is about Microsoft’s earnings call and you have an internal ID for Microsoft in your universe, this is the place to store it.

metadata.primary_entity_name
string

The canonical name in your system that corresponds to the primary entity. For example, the primary entity could be the company related to the transcript.

metadata.filename
string

Provide the document filename here. Typically, we’ve seen people building some logic based on the filenames so it’s a good practice to have it inside the document as well.

metadata.url
string

Document original URL which refers to the web address link where the document was originally hosted or made available online.

metadata.media_type
string

Document media type.

metadata.language
string

Include the original language in which the document is written, following ISO standard as listed here.

Add any copyright information you possess, this could include details about ownership, usage and legal protection.

Codes

type
string

If you store any (standard) identifiers, such as ISIN, CUSIP etc., store the type here (and provide the value in the field below.

value
string

Represents the value of the code associated with the document.

custom
string

Put here any additional metadata that doesn’t fit in other fields but describes the document such as labels or custom fields that will enhance its searchability.

Content

content.title
string
required

Add the title/headline of the document.

Body

content.body.content_type
string
required

IANA media type of the content block (text/plain, application/html).

content.body.value
string
required

Add the text of the content block.

content.body.role
string

The role of the content block in the document (NORMAL, HEADING, FOOTER, HEADER).

Section

The array of sections within the text.

content.section.name
string
required

If you detect sections within a document, specify which section this content block belongs to.

content.section.parents
string

The array of parents in order of nesting for this section.

content.section.metadata
string

If you have any section metadata, put it here.