title: “Format Schema”
Previous: Step 5 - Validate your first BDDF file Next: Step 7 - Start uploading real-time documentsThis section provides an overview of the Bigdata Document Format (BDDF), which outlines the structure and requirements for submitting data in JSON format.
Schema
The schema version that you are using. This is the first field in the document and it’s used to identify the schema version you are using.
The schema described on this page is version 1.3
The schema described on this page is version 1.3
Document
A unique ID that identifies this document in your universe. If you don’t have one, we’d kindly ask for you to randomly generate one for each of your documents. We’d recommend a UUID (visit https://www.uuidgenerator.net/ for inspiration) or an alphanumeric string.
Document - Revision
If you update documents over time and store various versions of it, this is the place to share this information with us:- sequence_id
Typically the ID of the first document in the update chain - for all future updated versions of the same document,
please provide the same chain_id so we know it’s an update, not a completely new document.
The document sequence identifier within the chain. As you keep publishing updates, keep us informed about the sequence of updates, this field
tells us which document version is newer - the higher the number, the newer the document. We’ve seen various ways to
keep track of this but the most common ones are sequential auto increment (sequence_id = 3, sequence_id = 4, sequence_id = 5…)
and unix timestamp (sequence_id = 1720915200 , sequence_id = 1721260800, sequence_id = 1721692800…).
We don’t care what you use, as long as you follow the rule: the higher the number, the newer the document!
Document - Source
This array contains source related information.The unique id of the source in your universe, if you have any source taxonomy at your end - use that. If not, we suggest you think of how you’d like your sources to be organized (per topic, per geography, per website you scrape etc.) Typically, this is a UUID or an alphanumerical string.
The official canonical name for the source if you don’t have sources taxonomy at your end, we suggest you use something descriptive here. You won’t be able to change it later on.
Document - Timestamps
This array contains information about the timestamps associated with the document in UTC timezone.If you maintain multiple different timestamps related to the document in your universe, store the main one here.
Otherwise use the date the document was published to its intended audience (in UTC timezone).
Add the timestamp that marks the exact moment the document was first created (in UTC timezone).
Document - Metadata
This object contains additional information about the document.If you have the information about the main entity this document is about, store your ID here. It could be useful to us if we ever decide to do the mapping between your and our IDs, in order to further enhance our entity detections. For example, the primary entity could be the company related to the transcript - if this document is about Microsoft’s earnings call and you have an internal ID for Microsoft in your universe, this is the place to store it.
The canonical name in your system that corresponds to the primary entity. For example, the primary entity could be the company related to the transcript.
Provide the document filename here. Typically, we’ve seen people building some logic based on the filenames so it’s a good practice to have it inside the document as well.
Document original URL which refers to the web address link where the document was originally hosted or made available online.
Document media type.
Include the original language in which the document is written, following ISO standard as listed here.
Add any copyright information you possess, this could include details about ownership, usage and legal protection.
If you identify document types, this is the place to provide your classification values in free text form. Some of the examples of values we’ve seen here are:
annual-report, interim-report, quarterly-report, earnings-release, shareholder-meetings-notice, earnings-call-transcript, earnings-call-slides,
special-events-slides, sustainability-report, initial-registration-statement, podcast, audio-recording…
Document - Metadata - Reporting Period
Fiscal year, for example: 2024
Fiscal period - should be enum with the following values allowed: FY, H1, H2, Q1, Q2, Q3, Q4, M01, M02, M03, M04, M05, M06, M07, M08, M09, M10, M11, M12If you leave this field empty, we’ll assume it’s FY
Document - Metadata - Codes
This is an array that helps our classification understand better what is the document about - we use it to hint our detections so any standard identifier (such as ticker or ISIN) can help us a lotIf you store any (standard) identifiers, such as ISIN, CUSIP etc., store the type here (and provide the value in the field below).
Represents the value of the code associated with the document.
Put here any additional metadata that doesn’t fit in other fields but describes the document such as labels or custom fields that will enhance its searchability.
Content
Both Title and Body objects follow the same block structure, with the same schema - don’t get confused if you see the same field names below ;)Content - Title
IANA media type of the content block (text/plain, application/html).
Add the title of the document
The role of the content block in the document. For titles it’s typically HEADING
If you have any linked documents in the title or it contains graphical elements such as images, this is the place to put the URL
Content - Title - Section
The array of sections within the text. Doesn’t make much sense for title so feel free to skip this elementIf you detect sections within a document, specify which section this content block belongs to. For Title it can just be a simple Title
The array of parents in order of nesting for this section. For Title it will most likely be empty
If you have any section metadata, put it here.
Array of page numbers - doesn’t make much sense for Title so either put 1 or just skip this field completely
Content - Body
IANA media type of the content block (text/plain, application/html).
Add the text of the content block.
The role of the content block in the document (NORMAL, HEADING, FOOTER, HEADER…).
If you have images or other asset elements as a part of your documents, this is the place to put their URL
Content - Body - Section
If you detect sections and are able to provide them, then certain fields are mandatory (like content.body.section.name), otherwise feel free to skip the whole section.If you detect sections within a document, specify which section this content block belongs to.
The array of parents in order of nesting for this section.
If you have any section metadata, put it here.
Array of page numbers - typically it will be a single int indicating the page of the original document (for example PDF) where this content block (paragraph) is. In case the paragraph spans pages, provide multiple page values in this array.
Next: Step 7 - Start uploading real-time documents