Previous: Step 3 - Test FTP connection
Next: Step 5 - Validate your first BDDF file
In order for your data to be successfully ingested into our system, it must be provided in a specific format - the Bigdata Document Format (BDDF).
Here we’ll walk you through a simple example, before we dive deeper into the full format specification and describing how success looks like -
what are the best practices for preparing your data, to maximize results on Bigdata.com.
Our system relies on a JSON-based schema. Below is an example — but don’t worry, we’ll break it down step by step, just keep reading:
{
"schema": {
"version": "1.1"
},
"document": {
"id": "CC1CB50246DB9E924D1390AE239B2A67",
"revision": {
"chain_id": "CC1CB50246DB9E924D1390AE239B2A67",
"sequence_id": "1"
},
"source": {
"id": "A94637",
"name": "Sports Central"
},
"timestamps_utc": {
"published": "2024-07-01 00:00:00",
"created": "2024-07-01 00:00:00",
"last_modified": "2025-03-17 11:00:00"
},
"metadata": {
"primary_entity_id": "F980BF",
}
},
"content": {
"title": {
"content_type": "text/plain",
"value": "Team A vs Team B: Moneyline, Spread Odds from Sportsbooks",
"role": "HEADING",
"section": {
"name": "title"
}
},
"body": [
{
"content_type": "application/html",
"value": "Team A will face Team B on [REDACTED DATE]. Based on current moneyline odds, Team A is favored to win.\n\nBookmaker1 offers Team A at -136. This represents one of the best moneyline odds for a wager on Team A. More sportsbooks are needed to provide a comprehensive comparison.\n\nTo bet on Team B, Bookmaker1 provides odds of +108. This is currently the only moneyline odd available for Team B. More sportsbooks are needed to compare odds.",
"role": "NORMAL",
"section": {
"name": "body",
"metadata": {
"data": "{\"data\": {\"top_tables_1\": {\"data\": [{\"col\": [{\"col_name\": \"Bookmaker\", \"type\": \"string\", \"value\": \"Bookmaker1\"}, {\"col_name\": \"Home Team Moneyline\", \"type\": \"string\", \"value\": \"-136\"}, {\"col_name\": \"Away Team Moneyline\", \"type\": \"string\", \"value\": \"108\"}, {\"col_name\": \"Home Team Spread Price\", \"type\": \"string\", \"value\": \"NA\"}, {\"col_name\": \"Away Team Spread Price\", \"type\": \"string\", \"value\": \"NA\"}, {\"col_name\": \"Home Team Point Spread\", \"type\": \"string\", \"value\": \"NA\"}, {\"col_name\": \"Away Team Point Spread\", \"type\": \"string\", \"value\": \"NA\"}]}], \"name\": \"Sports Match - Team A vs Team B [YYYY-MM-DD]\"}}}"
}
}
}
]
}
}
Let’s analyze it deeper, piece by piece
Schema node
We follow standard software versioning practices here by adopting major.minor format.
The major version is incremented when there are significant changes that may break backward compatibility, such as redesigned APIs or removed features. We don’t plan to do this but never say never, right?
The minor version increases when new features are added in a backward-compatible way, meaning existing integrations will continue to work without modification. We plan to do this a lot and extend the format by adding fields specific to different content types: transcripts, podcasts, newsletters, market research, filings etc..
We are building an interactive editor where you will be able to select the version from the dropdown and see the schema details. Until we have it, please use 1.2 as that is the latest version that we have (and is described in details on these pages)
"schema": {
"version": "1.2"
}
Document node
This section is basically document metadata - the place to store all relevant information describing the document itself.
Populating this section correctly significantly increases your chances of success on Bigdata.com - more details about this will come later, once you get to Step 6 - Understand what success looks like, but for now, just try to understand the basic structure and concepts behind the format.
Things we need from you:
You surely have a way to uniquely identify documents in your universe - this is the place to put your ID, whatever it may be and however it may look like.
It will help us link our documents to yours, in case we ever need to debug, reprocess or just attribute them.
Internally, we use UUIDs, for example:
"id": "CC1CB50246DB9E924D1390AE239B2A67"
Revision
If you update documents over time and store various versions of it, this is the place to share this information with us:
- chain_id is typically the ID of the first document in the update chain - for all future updated versions of the same document,
please provide the same chain_id so we know it’s an update, not a completely new document.
- sequence_id tells us which document version is newer - the higher the number, the newer the document. We’ve seen various ways to
keep track of this but the most common ones are sequential auto increment (sequence_id = 3, sequence_id = 4, sequence_id = 5…)
and unix timestamp (sequence_id = 1720915200 , sequence_id = 1721260800, sequence_id = 1721692800…).
We don’t care what you use, as long as you follow the rule: the higher the number, the newer the document!
"revision": {
"chain_id": "CC1CB50246DB9E924D1390AE239B2A67",
"sequence_id": "1"
}
Source
- whether you are a content aggregator, a website scrapper or a proprietary content creator, you probably have a way to organize your
content based on where it’s coming from - this is the place for your sources taxonomy. Here you can put your site IDs, feed IDs,
publication IDs, podcast show IDs.. whatever it is that you use to organize your content into sources at your end.
- this is one of the key pieces of information that will uniquely identify your content while going through our pipelines - make
sure you understand it well on Step 6 - Understand what success looks like
"source": {
"id": "A94637",
"name": "Sports Central"
}
Timestamps
We track several different timestamps and offer you the option to provide them accordingly. Please pay a lot of attention on
mapping these fields to the timestamps you have, so we can correctly interpret your content.
As above, more details on Step 6 - Understand what success looks like
"timestamps_utc": {
"published": "2024-07-01 00:00:00",
"created": "2024-07-01 00:00:00",
"last_modified": "2025-03-17 11:00:00"
}
We tried to keep BDDF simple - most likely we do not have dedicated fields for all the metadata that you have describing your
documents - this is the place to store anything and everything that might help us in our classification and analytics.
More details on Step 6 - Understand what success looks like
If you are unsure whether you should put something or not, feel free to reach out to us via email at data@bigdata.com, we’ll be happy to help!
"metadata": {
"primary_entity_id": "F980BF",
"language":"en",
"custom":[{
"key":"expert.id",
"value":"1009"
},
{"key":"expert.name
value":"John Doe"
},
{"key":"expert.title",
"value":"Blogger at bookworm.blog (June 2013 to Present)"
},
{"key":"category",
"value":"6"
},
{"key":"meta.keywords",
"value":"Product management, Book reviews"
},
{"key":"meta.description",
"value":"Product management book reviews, read more!"
},
{"key":"url",
"value":"bookworm.blog"
}],
}
Content node
There are two main content blocks in which you should put the payload (the document itself): Title and body
Title
This one is pretty straightforward - you need to define content_type (plain text, html…) that you will put in value.
Role and section information will probably always be the same (as provided in the example) but we are keeping them for consistency with body:
"title": {
"content_type": "text/plain",
"value": "Team A vs Team B: Moneyline, Spread Odds from Sportsbooks",
"role": "HEADING",
"section": {
"name": "title"
}
}
Body
This is where things get interesting - body is an array of content blocks.
For now, think of splitting the document in paragraphs and providing each of them as a separate member of the body array.
If you have any additional information about the sections, please provide that information too. For example, earnings calls
start with the management discussion and, at some point, switch to Q&A part. If you are a transcript provider, indicating whether the
paragraphs belong to the management discussion or the Q&A part help us a lot in our analytics as we can treat those two differently.
Bonus points if you can distinguish questions from answers or provide speaker information in metadata - your chances of success increase significantly!
More information on how to nail document content and metadata on Step 6 - Understand what success looks like
"body": [
{
"content_type": "application/html",
"value": "Team A will face Team B on [REDACTED DATE]. Based on current moneyline odds, Team A is favored to win.\n\nBookmaker1 offers Team A at -136. This represents one of the best moneyline odds for a wager on Team A. More sportsbooks are needed to provide a comprehensive comparison.\n\nTo bet on Team B, Bookmaker1 provides odds of +108. This is currently the only moneyline odd available for Team B. More sportsbooks are needed to compare odds.",
"role": "NORMAL",
"section": {
"name": "body",
"metadata": {
"data": "{\"data\": {\"top_tables_1\": {\"data\": [{\"col\": [{\"col_name\": \"Bookmaker\", \"type\": \"string\", \"value\": \"Bookmaker1\"}, {\"col_name\": \"Home Team Moneyline\", \"type\": \"string\", \"value\": \"-136\"}, {\"col_name\": \"Away Team Moneyline\", \"type\": \"string\", \"value\": \"108\"}, {\"col_name\": \"Home Team Spread Price\", \"type\": \"string\", \"value\": \"NA\"}, {\"col_name\": \"Away Team Spread Price\", \"type\": \"string\", \"value\": \"NA\"}, {\"col_name\": \"Home Team Point Spread\", \"type\": \"string\", \"value\": \"NA\"}, {\"col_name\": \"Away Team Point Spread\", \"type\": \"string\", \"value\": \"NA\"}]}], \"name\": \"Sports Match - Team A vs Team B [YYYY-MM-DD]\"}}}"
}
}
}
]
Following these basic steps, try to create your own BDDF file. Pay attention to source IDs, timestamps and how you format the content, these are the key pieces.
When you feel like you are getting the hang of it and have the first version, proceed to the next step and validate it :)
Next: Step 5 - Validate your first BDDF file