Here at Bitstrapped, our machine learning team first discovered Airbyte in early 2021, while investigating options for large data migrations and integrations from Oracle, Postgres and a wide variety of data sources.
We regularly build custom Apache Spark and Dataflow (Apache Beam) pipelines for large projects, but we found ourselves having to manually build the tools for migration resiliency, multi-source, retry logic and monitoring the pipelines.
When building architectures for data integration, the automation tools available are limited, and it requires a tremendous amount of customization for data transformation and change data capture.
We have vast experience in the classic enterprise function of Extract, Transform, and Load (ETL) from data sources into a data warehouse. However, for us to help companies realize advanced data warehousing capability we had to combine a few technologies into Google Cloud, which includes Dataflow, Composer, and BigQuery, and eventually an ELT tool.
In our research to seek a data integration platform for ELT, we stumbled onto Airbyte. At the time, it was a little known data integration platform, but it had a new approach to ELT. First, it is an open-source product, so we could explore its capabilities freely and secondly it was focused on simplicity in order to launch data pipelines in minutes, not months.
So why does Open-source ELT even matter? Well, the decision by Airbyte to open-source their platform means they could unlock data connectors to the masses. Data connectors or ELT connectors are the configuration component the system that manages the connection between data sources and destinations. Today, source data can come from any system, both internal databases, external data sources, SaaS products and APIs. Building and maintaining connectors in a closed-source product is quite restrictive and expensive to maintain, because over time data source APIs change, schemas changes, and database versions change.
The reality is that the only way to modernize ELT for the multitude of valuable data sources was to “commoditize” it. Airbyte says their aim is to “Commoditize ELT”. In our words, liberation of ELT is accelerated by liberating data source connector creation. Airbyte has created a framework, Connector Development KIT (CDK), so users can build their own custom connectors. The CDK has a standardized framework, so that development teams can easily maintain them.
Airbyte connectors can be selected from a list of over 100 pre-built connectors for data sources and destination, or you can develop your own. An Airbyte connector runs its own docker container, which at first glance may appear to be a subtle feature, however there is a major architecture advantage here. Each Source is an individual container that Each Connector is effectively a self contained data migration program, that you can monitor, refresh, and schedule.
Airbyte uses a system that is architected as a CLI. The Airbyte application interface is web-based and build UI atop of the CLI. The CLI is a powerful yet familiar framework for data streams. When you are running a job, the input source and output of data is structured in a standard messaging stream. If you are familiar with unix based systems, this is stdin and stdout, which allows for a proven standard structure for data streams. This is how Airbyte creates a consistent data flow pipeline to sync your source and destination.
Here is an extensive list of some of the out-of-the box Airbyte connectors:
We’ve built custom Airbyte connectors for Spotify, Salesforce, Postgres and more, all in GOLang. Why did we choose GO? Well it was simple, that was the flavor of choice for our cloud solutions. Instead of spending months building custom data migration jobs, we are now able to build connectors for complex data integrations within weeks. We won’t dive into the technical details, but If you would like to build a custom Airbyte connector, you can start here.
In order to implement an ELT tool into an Enterprise, there are key features that make this a good long term investment:
When it comes to ELT in the cloud, there are a lot of debates about whether it should be extract, transform, and load (ETL) or extract, load and transform (ELT). Well, in order to find your answer, we may consider that it is less about which one, and more about the nature of data sets. The question is, how big is the data set? How frequently does data transfer? How quickly do you need data available? How complex are your transformations?
Without diving deep into analysis of ETL vs. ELT, the general rule is for small and simpler data sets, you use ETL to transform before storing data in the data warehouse, while discarding unnecessary raw data. For larger and more complex data sets, you apply ELT sending raw data directly to the data warehouse. This works because it allows you to load more data, from many sources, with little bottle neck to building a comprehensive data warehouse.
The choice for a modern data-driven business who would be to go with ELT to handle numerous data sources and then transform your data in your data warehouse with a tool like DBT. Remember that your data sources range from internal business data, 3rd party API data, and unstructured data. If you had to worry about writing code for transformations every time you integrate a new data source, you will never complete data integration and your downstream systems will be left data-less. This is a mistake many organizations make that leads them to more data silos, unreliable and disparate data sources.
Here are common questions and our answers that our clients have asked in adopting Airbyte for their organizations:
When system schema updates, like lets say a new table is added and index or new key is added, what will happen? Well, Airbyte allows you to manually decide whether to include a new update in your migration job, or not. It is up to your data engineering team to decide on the behavior. Some may seek tools that automatically capture changes to data schema, however this can also result in some wild and undesirable outcomes.
In the short run, you can build a custom connector for your specific needs, maintain and update it as needed and deploy it in your Airbyte instance. In the longer term you can take advantage of community connectors as they come online by developers.
Airbyte is new and it's in alpha (as of March 2022), and launches beta in April of 2022, so essentially Airbyte is a bet on the future, so there may be some hesitancy. However, what makes them compelling is they are extensible, so it's not perfect but you can customize it in ways that most competitors do not allow. The community and customer base is growing fast and with direct feedback from customers.
Airbyte has support for CDC with Postgres, MySQL and Microsoft SQL Server, with more on the way. Handling Change data capture can be accomplished in many ways, so it is advisable to consider using technologies built into data warehouse likeBigQuery, where you can configure advanced CDC to resolve data changes.
The decision to adopt Airbyte ELT across your organization will involve a few decision points. The best outcome we see is that you turn data source nightmares into seamless collection of jobs to sync data into your data warehouse.
Airbyte solves the data ingestion and load challenges from any source, or the EL part of ELT, with tools like DBT focused on data transformation or the T. In your organization, the strength of decision making, application quality and operations are only as good as the quality of your data warehouse. Timely data integration and high availability will solve challenges for your analytics team, software engineering and data science all at once. So solving data ingestion, integration and transformation is a high stakes endeavor, so you should do it flexibly in the cloud and choose the right set of technologies.
We focus on building Enterprise ready Airbyte infrastructure to manage your data integration, data source and destination connectors, allowing you to improve the productivity of your downstream data services within your organization. We also solve scale challenges with ELT, including Kubernetes deployment, Airflow as an orchestrator and DBT for transformations.
If you choose Airbyte as your ELT tool, you will have to think about how you will manage it long term. How much upfront development you will invest in, such as custom connectors in Airbyte, data orchestration and hosting. Whether you focus on a managed solution or choose to self-host Airbyte in your cloud VPC, you will need to solve data transformation, scheduling, security, support, SLAs and monitoring for your ELT jobs.
Bitstrapped is an Airbyte consulting partner with expertise in building enterprise custom connectors, managing massive data integrations and developing advanced data warehouse solutions for organizations.