To ensure your data is successfully ingested into our system, it needs to be provided in a specific format - the Bigdata Document Format (BDDF). We’ll start with a simple example to get you oriented, and then dive into the full format specification, along with what success looks like - including best practices for preparing your data to get the most out of Bigdata.com. Our system is built on a JSON-based schema. Here’s a quick example but don’t worry, we’ll walk through it step by step, so keep going.
Schema
We use standard software versioning conventions, following the major.minor format. The major version goes up when there are big changes that could break backward compatibility - like overhauled APIs or removed features. We don’t expect this to happen often… but hey, never say never. The minor version increases when we add new features in a way that doesn’t disrupt existing integrations. You can expect to see a lot of these - especially as we expand the format with fields tailored to different content types like transcripts, podcasts, newsletters, market research, filings, and more. We’re working on an interactive editor that will let you select a version from a dropdown and view the corresponding schema. Until that’s ready, stick with 1.3, which is our latest version (and is fully detailed on these pages), and put that into your BDDF files.Document
This section is all about document metadata - where you include key information that describes the document itself. Filling out this section properly can greatly improve your chances of success on Bigdata.com. We’ll dive deeper into why that matters in Step 6 - Dive deeper with BDDF schema, but for now, focus on getting familiar with the basic structure and underlying concepts of the format. For now, we’ll keep it simple and stick to the required fields. Here’s what we need from you:ID
You probably already have a way to uniquely identify documents in your system - this is where that ID goes, whatever format it takes. Including it helps us connect your documents with ours if we ever need to debug, reprocess, or trace attribution. On our end, we use UUIDs. Here’s an example:Source
Whether you’re a content aggregator, a web scraper, or a proprietary content creator, you likely have a system for organizing your content based on where it comes from - this is where your source taxonomy goes. Think site IDs, feed IDs, publication IDs, podcast show IDs - whatever identifiers you use to group content by source on your side. This is a critical piece of information that helps us uniquely identify your content as it moves through our pipelines. It also allows us to group it into meaningful packages. We’ll go deeper into this in Step 6 - Dive deeper with BDDF schema, but for now, just be sure to include the source info — these fields are mandatory in our BDDF.Timestamps
Please take extra care when mapping your timestamp fields to ours — getting this right is crucial for us to accurately interpret your content. It’s surprisingly easy to get this wrong - even when the field names match, the logic behind them might not (true story: we’ve run into this exact issue… twice!) Here’s the basic rule of thumb:- the created timestamp should reflect the moment the content itself came into existence - it should stay constant over time
- the published timestamp should indicate when that specific version of the document became publicly available. If you update and republish, update this timestamp accordingly.
- News article scraped from a website:
- created = when the article first went live online
- published = when your system scraped that version
- If the article is updated later and you re-scrape, keep the original created value, but update published to reflect when the new version was captured.
- Earnings call transcript:
- created = date/time when the call took place
- published = when the transcript was generated
- If you release a rough version right after the call and polish it later, created stays fixed, published gets updated.
- Podcast transcript:
- created = when the episode aired
- published = when the transcript was produced
- Follow the same logic as with earnings call transcripts.
- SEC filing:
- created = when it was originally filed with the SEC
- published = when you processed the PDF and generated the structured document
- If you later reprocess your archive (we know it happens), created remains the original filing date, published reflects the reprocessed version.
Metadata
We’ve included a set of predefined metadata fields, but we’ve intentionally kept BDDF simple so chances are, we don’t have a dedicated field for every piece of metadata you use to describe your documents. That’s totally fine — don’t worry about this section for now, and feel free to skip it. We’re just including a quick example below to give you a glimpse of what we’ll cover later on in Step 6 - Dive deeper with BDDF schemaContent
There are two main content block types where you should place the actual payload (the content itself): title and body. It probably goes without saying but just to be clear: there should be only one title block, and as many body blocks as you need.Title
This one is pretty straightforward - you need to specify the content_type (like text/plain, text/html, etc.) for the text you put in the value field. The role and section will most likely always be the same for the title (HEADING and title, respectively), but we keep these fields in the schema for consistency with the body blocks.Body
This is where things get interesting — the body is an array of content blocks. For now, just think of splitting the document into paragraphs and providing each paragraph as a separate item in the body array. If you have extra information about sections, we’ll ask you to include that as well. For example, an earnings call transcript usually starts with a management discussion and then moves into a Q&A section. If you’re providing transcripts, letting us know which paragraphs belong to management discussion and which belong to Q&A really helps our analytics, since we can treat those parts differently. Bonus points if you can separate questions from answers or include speaker info in the metadata - this will significantly boost your chances of success! But don’t worry about that just yet. We’ll guide you on how to master document content and metadata later on in Step 6 - Dive deeper with BDDF schema.Follow these basic steps to create your own BDDF file. Make sure to pay close attention to the source IDs, timestamps, and how you split the content — these are the key elements. Once you feel comfortable and have your first version ready, go ahead and move on to the next step to validate it! :)
Next: Step 5 - Validate your first BDDF file