Evaluation
During the evaluation stage, we aim to understand your data as thoroughly as possible. The more we know upfront, the better prepared we’ll be to ingest and work with your content effectively later on. Here are some of the things we are typically interested in:- Original Content Format: XML, JSON, PDF, HTML?
- Content Type: Text, Audio, Video?
- Format Consistency: Is the format consistent across the entire archive?
- Historical Coverage: How many years of history are available?
- Known Gaps: Are there any known missing periods or data?
- Archive Volume: Total size or document count in the archive.
- Data Anomalies: Any known spikes, dips, or anomalies in volume and why?
- Delivery Frequency: Real-time, Intraday, Daily, Weekly
- Expected Volume: Average number of documents expected per day (depending on the frequency)
- Metadata: What document metadata do you have available
Preparation
This part is a two-way street:- You’ll be focused on preparing your documents - don’t worry, there’s plenty of guidance coming up to help you with that. At the end of this page you’ll find a link to our Quick Start Guide,
- On our side, we’ll be setting up the pipeline: provisioning infrastructure, creating internal IDs, and preparing an SFTP account for you. Most of this is fully automated, but we will make sure the setup is tailored to your specific content type. We recommend you start from SFTP page as we’ll need a few pieces of info to get started with the pipeline setup.
Validation
Now that you’re ready to prepare BDDF documents, we’d like to validate a few of them so we’ll ask that you share a small sample of the actual content. The type of sample can vary depending on your content volume, for example:- all documents you have related to a specific group of companies
- all content from the most recent month
- all documents from a certain document type (for example all annual reports) for a given set of companies
- or whatever else makes sense for us to validate the format, the content, the coverage etc.
Data load
Our preferred data delivery mechanism is through an SFTP server - we recommend you start your journey on the SFTP page so we can get everything sorted and prepared for once you are ready to start uploading . We’re currently working on supporting bundled s - a single large JSON file containing multiple documents. Until that’s ready, please continue sending your documents individually.Real-time Delivery
As you create them, upload each one separately to the SFTP, and our processing pipeline will automatically pick them up. Just a quick reminder: you’ve been provided with two separate SFTP usernames. Make sure to use the one designated for real-time delivery when sending real-time documents. The distinction is subtle, but important - it determines how we handle the files on our end. Real-time documents will be picked up and processed immediately, without overloading the system. On the other hand, loading (large) historical archives is something we like to do in a semi-manual and controlled manner. And that’s it! Your documents will begin showing up on Bigdata.com, and you’ll be able to start interacting with them right away. To get the best possible results, though, we’ll also need your historical archive.Historical Data Delivery
Once real-time ingestion is activated and confirmed to be working smoothly, we can begin planning the onboarding of your historical archive. It’s important to note that while real-time ingestion doesn’t require perfect data to get started, onboarding the full historical archive is different. We want to try get it right the first time. If there are any major improvements or blockers, we’d prefer to address them upfront, before ingestion begins. This is especially critical for large archives spanning several years or containing millions of documents, as re-processing such large volumes would be time-consuming for both sides. After the validation phase, we will have aligned on the key requirements that must be in place before we move forward with ingesting the archive. Once they are also done, we can get started. Historical Data Onboarding Process- Archive Scope - as with the initial sample, we may begin with a subset of your archive. This could include a defined number of years or content for a specific set of companies. We’ll confirm the exact scope
- Date Range Definition - we’ll provide you with the specific date range we’re looking to ingest
- File Format Requirements - please upload each document as an individual file. At this time, we’re unable to process archives that are bundled together in a single file or compressed (e.g., ZIP files).
Launch 🚀
Once all content has been successfully onboarded, real-time data is flowing, and historical data is fully available, we move into the final stage: preparing for launch. This includes creating marketing materials, defining the scope of the data package, and getting everything ready for listing in the Bigdata Store. Once that’s in place, we’re all set to go live and make the package available for customers.Now that you know what to expect, we are ready to get started. Head to Quick Start Guide and go follow our step-by-step guide towards your first upload. Here’s to plunder and no blunder, may ye find fair winds! ⛵

