title: “Format Schema”

Previous: Step 5 - Validate your first BDDF file Next: Step 7 - Start uploading real-time documents

This section provides an overview of the Bigdata Document Format (BDDF), which outlines the structure and requirements for submitting data in JSON format.

Schema

schema.version

string

required

The schema version that you are using. This is the first field in the document and it’s used to identify the schema version you are using.
The schema described on this page is version 1.3

Document

document.id

string

required

A unique ID that identifies this document in your universe. If you don’t have one, we’d kindly ask for you to randomly generate one for each of your documents. We’d recommend a UUID (visit https://www.uuidgenerator.net/ for inspiration) or an alphanumeric string.

Document - Revision

If you update documents over time and store various versions of it, this is the place to share this information with us:

sequence_id

document.revision.chain_id

string

required

Typically the ID of the first document in the update chain - for all future updated versions of the same document, please provide the same chain_id so we know it’s an update, not a completely new document.

document.revision.sequence_id

string

required

The document sequence identifier within the chain. As you keep publishing updates, keep us informed about the sequence of updates, this field tells us which document version is newer - the higher the number, the newer the document. We’ve seen various ways to keep track of this but the most common ones are sequential auto increment (sequence_id = 3, sequence_id = 4, sequence_id = 5…) and unix timestamp (sequence_id = 1720915200 , sequence_id = 1721260800, sequence_id = 1721692800…). We don’t care what you use, as long as you follow the rule: the higher the number, the newer the document!

Example:

"revision": {
    "chain_id": "CC1CB50246DB9E924D1390AE239B2A67",
    "sequence_id": "1"
    }

Document - Source

This array contains source related information.

document.source.id

string

required

The unique id of the source in your universe, if you have any source taxonomy at your end - use that. If not, we suggest you think of how you’d like your sources to be organized (per topic, per geography, per website you scrape etc.) Typically, this is a UUID or an alphanumerical string.

document.source.name

string

required

The official canonical name for the source if you don’t have sources taxonomy at your end, we suggest you use something descriptive here. You won’t be able to change it later on.

Document - Timestamps

This array contains information about the timestamps associated with the document in UTC timezone.

document.timestamps_utc.published

timestamp

required

If you maintain multiple different timestamps related to the document in your universe, store the main one here. Otherwise use the date the document was published to its intended audience (in UTC timezone).

document.timestamps_utc.created

timestamp

required

Add the timestamp that marks the exact moment the document was first created (in UTC timezone).

Document - Metadata

This object contains additional information about the document.

document.metadata.primary_entity_id

string

If you have the information about the main entity this document is about, store your ID here. It could be useful to us if we ever decide to do the mapping between your and our IDs, in order to further enhance our entity detections. For example, the primary entity could be the company related to the transcript - if this document is about Microsoft’s earnings call and you have an internal ID for Microsoft in your universe, this is the place to store it.

document.metadata.primary_entity_name

string

The canonical name in your system that corresponds to the primary entity. For example, the primary entity could be the company related to the transcript.

document.metadata.filename

string

Provide the document filename here. Typically, we’ve seen people building some logic based on the filenames so it’s a good practice to have it inside the document as well.

document.metadata.url

string

Document original URL which refers to the web address link where the document was originally hosted or made available online.

document.metadata.media_type

string

Document media type.

document.metadata.language

string

Include the original language in which the document is written, following ISO standard as listed here.

document.metadata.copyright

string

Add any copyright information you possess, this could include details about ownership, usage and legal protection.

document.metadata.document_type

string

If you identify document types, this is the place to provide your classification values in free text form. Some of the examples of values we’ve seen here are: annual-report, interim-report, quarterly-report, earnings-release, shareholder-meetings-notice, earnings-call-transcript, earnings-call-slides, special-events-slides, sustainability-report, initial-registration-statement, podcast, audio-recording…

Document - Metadata - Reporting Period

document.metadata.reporting_period.fiscal_year

int

Fiscal year, for example: 2024

document.metadata.reporting_period.fiscal_period

enum

Fiscal period - should be enum with the following values allowed: FY, H1, H2, Q1, Q2, Q3, Q4, M01, M02, M03, M04, M05, M06, M07, M08, M09, M10, M11, M12If you leave this field empty, we’ll assume it’s FY

Document - Metadata - Codes

This is an array that helps our classification understand better what is the document about - we use it to hint our detections so any standard identifier (such as ticker or ISIN) can help us a lot

document.metadata.codes.type

string

If you store any (standard) identifiers, such as ISIN, CUSIP etc., store the type here (and provide the value in the field below).

document.metadata.codes.value

string

Represents the value of the code associated with the document.

document.metadata.custom

string

Put here any additional metadata that doesn’t fit in other fields but describes the document such as labels or custom fields that will enhance its searchability.

Content

Both Title and Body objects follow the same block structure, with the same schema - don’t get confused if you see the same field names below ;)

Content - Title

content.title.content_type

string

required

IANA media type of the content block (text/plain, application/html).

content.title.value

string

required

Add the title of the document

content.title.role

string

The role of the content block in the document. For titles it’s typically HEADING

content.title.url

string

If you have any linked documents in the title or it contains graphical elements such as images, this is the place to put the URL

Content - Title - Section

The array of sections within the text. Doesn’t make much sense for title so feel free to skip this element

content.title.section.name

string

required

If you detect sections within a document, specify which section this content block belongs to. For Title it can just be a simple Title

content.title.section.parents

string[]

The array of parents in order of nesting for this section. For Title it will most likely be empty

content.title.section.metadata

string

If you have any section metadata, put it here.

content.title.pages

string[]

Array of page numbers - doesn’t make much sense for Title so either put 1 or just skip this field completely

Content - Body

content.body.content_type

string

required

IANA media type of the content block (text/plain, application/html).

content.body.value

string

required

Add the text of the content block.

content.body.role

string

The role of the content block in the document (NORMAL, HEADING, FOOTER, HEADER…).

content.body.url

string

If you have images or other asset elements as a part of your documents, this is the place to put their URL

Content - Body - Section

If you detect sections and are able to provide them, then certain fields are mandatory (like content.body.section.name), otherwise feel free to skip the whole section.

content.body.section.name

string

required

If you detect sections within a document, specify which section this content block belongs to.

content.body.section.parents

string[]

The array of parents in order of nesting for this section.

content.body.section.metadata

string

If you have any section metadata, put it here.

content.body.pages

string[]

Array of page numbers - typically it will be a single int indicating the page of the original document (for example PDF) where this content block (paragraph) is. In case the paragraph spans pages, provide multiple page values in this array.

Next: Step 7 - Start uploading real-time documents

Getting Started

Research Service

Search Service

Upload proprietary content

Knowledge Graph

Watchlist

Partner Content Upload

Format schema

title: “Format Schema”

Schema

Document

Document - Revision

Document - Source

Document - Timestamps

Document - Metadata

Document - Metadata - Reporting Period

Document - Metadata - Codes

Content

Content - Title

Content - Title - Section

Content - Body

Content - Body - Section

Getting Started

Research Service

Search Service

Upload proprietary content

Knowledge Graph

Watchlist

Partner Content Upload

​title: “Format Schema”

​Schema

​Document

​Document - Revision

​Document - Source

​Document - Timestamps

​Document - Metadata

​Document - Metadata - Reporting Period

​Document - Metadata - Codes

​Content

​Content - Title

​Content - Title - Section

​Content - Body

​Content - Body - Section

title: “Format Schema”

Schema

Document

Document - Revision

Document - Source

Document - Timestamps

Document - Metadata

Document - Metadata - Reporting Period

Document - Metadata - Codes

Content

Content - Title

Content - Title - Section

Content - Body

Content - Body - Section