Airbyte
Data Platforms

Airbyte: The modern ELT data pipeline

By
Bitstrapped
Updated
March 21, 2022


Here at Bitstrapped, our machine learning team first discovered Airbyte in early 2021, while investigating options for large data migrations and integrations from Oracle, Postgres and a wide variety of data sources.

We regularly build custom Apache Spark and Dataflow (Apache Beam) pipelines for large projects, but we found ourselves having to manually build the tools for migration resiliency, multi-source, retry logic and monitoring the pipelines.

When building architectures for data integration, the automation tools available are limited, and it requires a tremendous amount of customization for data transformation and change data capture.

ETL experience

We have vast experience in the classic enterprise function of Extract, Transform, and Load (ETL) from data sources into a data warehouse. However, for us to help companies realize advanced data warehousing capability we had to combine a few technologies into Google Cloud, which includes Dataflow, Composer, and BigQuery, and eventually an ELT tool.

In our research to seek a data integration platform for ELT, we stumbled onto Airbyte. At the time, it was a little known data integration platform, but it had a new approach to ELT. First, it is an open-source product, so we could explore its capabilities freely and secondly it was focused on simplicity in order to launch data pipelines in minutes, not months.

Why open-source ELT matters

So why does Open-source ELT even matter? Well, the decision by Airbyte to open-source their platform means they could unlock data connectors to the masses. Data connectors or ELT connectors are the configuration component the system that manages the connection between data sources and destinations. Today, source data can come from any system, both internal databases, external data sources, SaaS products and APIs. Building and maintaining connectors in a closed-source product is quite restrictive and expensive to maintain, because over time data source APIs change, schemas changes, and database versions change. 

The reality is that the only way to modernize ELT for the multitude of valuable data sources was to “commoditize” it. Airbyte says their aim is to “Commoditize ELT”. In our words, liberation of ELT is accelerated by liberating data source connector creation. Airbyte has created a framework, Connector Development KIT (CDK), so users can build their own custom connectors. The CDK has a standardized framework, so that development teams can easily maintain them.

How Airbyte works

Airbyte connectors can be selected from a list of over 100 pre-built connectors for data sources and destination, or you can develop your own. An Airbyte connector runs its own docker container, which at first glance may appear to be a subtle feature, however there is a major architecture advantage here.  Each Source is an individual container that Each Connector is effectively a self contained data migration program, that you can monitor, refresh, and schedule.

You can write a source connector in any language you want or take advantage of Airbyte's Connector-Development Kit (CDK) in Python, C#/.NET, or TypeScript/Javascript. At Bitstrapped we have developed an Airbyte - Golang SDK/CDK you can find on our Github to build your GoLang Connectors. Airbyte’s CDK framework will generate 75% of the code required for you to write source connections and you can customize things like multi-threading, reusable functions and connection details. The implementation is standardized so you can quickly write connectors for HTTP APIs, databases, and other custom sources. Once you have completed development there is a code generator to package your connector and run the test suite.

Airbyte uses a system that is architected as a CLI. The Airbyte application interface is web-based and build UI atop of the CLI. The CLI is a powerful yet familiar framework for data streams. When you are running a job, the input source and output of data is structured in a standard messaging stream. If you are familiar with unix based systems, this is stdin and stdout, which allows for a proven standard structure for data streams. This is how Airbyte creates a consistent data flow pipeline to sync your source and destination.

Airbyte Connectors (out of the box)

Here is an extensive list of some of the out-of-the box Airbyte connectors:

(To scroll past the list click here.)

  • 3PL Central
  • Airtable
  • Amazon SQS
  • Amazon Seller Partner
  • Amazon Ads 
  • Amplitude 
  • Apify 
  • App Store 
  • Asana 
  • AWS CloudTrail
  • Azure Blob Storage
  • Bamboo HR 
  • Bing Ads
  • BigCommerce 
  • BigQuery
  • Braintree
  • Cart.com
  • Chargebee
  • Chargify
  • Chartmogul
  • ClickHouse
  • CockroachDB
  • CSV File
  • Confluence
  • Dixa
  • Drift
  • Drupal
  • DynamoDB
  • Exchange Rates API
  • Facebook Marketing
  • Facebook Pages
  • FilesFlexport
  • Freshdesk
  • Freshsales
  • Freshservice
  • Elasticsearch
  • Excel File
  • Exchange Rates API
  • Facebook Pages
  • Feather FIle
  • Google Ads
  • Google Analytics
  • Google Cloud Storage
  • Google Directory
  • Google Pubsub
  • Google Search Console
  • Google Sheets
  • Google Workspace Admin Reports
  • Greenhouse
  • Harvest
  • Hubspot
  • IBM Db2
  • Instagram
  • Intercom
  • Iterable
  • JSON FIle
  • Jira
  • Kafka
  • Keen
  • Klaviyo
  • Kustomer
  • Looker
  • Magento
  • Mailchimp
  • Marketo
  • MeiliSearch
  • Microsoft Dynamics Customer Engagement
  • Microsoft Dynamics GP
  • Microsoft Dynamics NAV
  • Microsoft Teams
  • Mixpanel
  • MongoDB
  • MSSQL
  • MySQL
  • Postgres
  • Redshift
  • OktaOracle DB
  • Parquet File
  • Paypal Transaction
  • Pipedrive
  • Plaid
  • Posthog
  • Quickbooks
  • Recharge
  • SendGrid
  • Shopify
  • Short.io
  • Slack
  • Smartcheets
  • Snapchat Marketing
  • S3
  • Salesforce
  • Snowflake
  • SurveyMonkey
  • Tempo
  • Trello
  • Twilio
  • Typeform
  • US Census
  • WooCommerce
  • WordPress
  • Zencart
  • Zendesk Chat
  • Zendesk Sunshine
  • Zoom
  • Zuora

Writing custom Airbyte Connectors in any language

We’ve built custom Airbyte connectors for Spotify, Salesforce, Postgres and more, all in GOLang. Why did we choose GO? Well it was simple, that was the flavor of choice for our cloud solutions. Instead of spending months building custom data migration jobs, we are now able to build connectors for complex data integrations within weeks. We won’t dive into the technical details, but If you would like to build a custom Airbyte connector, you can start here.

Airbyte for enterprise data integration

In order to implement an ELT tool into an Enterprise, there are key features that make this a good long term investment:

  1. Security - Ability to run your ELT in a safe enterprise environment, your own network or VPC in the cloud to protect data privacy. Airbyte is open-source and can run in your VPC on a VM, Kubernetes or managed instance. 
  2. ETL Vendor Strength - Strong documentation, strong developer community, github commits and funding. At the time of writing this, Airbyte recently raised $150m series B financing, over 1000 customers and over 100 connectors.
  3. Reliable Architecture - If you are going to entrust your mission critical data integration jobs to a tool, you want to ensure the architecture is reliable. Airbyte has designed a container based framework for the interface between data source and data connection. Along with a standard data stream stdin and stdout used in unix-based systems. 
  4. Cost - If you are going to move thousands to millions of rows of data between sources, you need to manage your costs. Installing Airbyte within your cloud environment means you can manage ingress costs, CPU and storage much more efficiently.

ELT vs. ETL

When it comes to ELT in the cloud, there are a lot of debates about whether it should be extract, transform, and load (ETL) or extract, load and transform (ELT). Well, in order to find your answer, we may consider that it is less about which one, and more about the nature of data sets. The question is, how big is the data set? How frequently does data transfer? How quickly do you need data available? How complex are your transformations?

Without diving deep into analysis of ETL vs. ELT, the general rule is for small and simpler data sets, you use ETL to transform before storing data in the data warehouse, while discarding unnecessary raw data. For larger and more complex data sets, you apply ELT sending raw data directly to the data warehouse. This works because it allows you to load more data, from many sources, with little bottle neck to building a comprehensive data warehouse. 

Choosing ELT

The choice for a modern data-driven business who would be to go with ELT to handle numerous data sources and then transform your data in your data warehouse with a tool like DBT. Remember that your data sources range from internal business data, 3rd party API data, and unstructured data. If you had to worry about writing code for transformations every time you integrate a new data source, you will never complete data integration and your downstream systems will be left data-less. This is a mistake many organizations make that leads them to more data silos, unreliable and disparate data sources.

Common Airbyte questions

Here are common questions and our answers that our clients have asked in adopting Airbyte for their organizations:

Why Airbyte over competitors?

  • Most ELT Tools are expensive to maintain with loads of custom code, data triggers and developers trying adding to them. Airbyte doesn’t limit your connectors, so with some upfront investment you can build consistent frameworks for connectors and spend the rest of your time focused on transformation and integration.
  • Existing ELT tools offer volume-based pricing, which will cost you thousands of dollars. You have the option to run Airbyte within your own network, secure it, integrate both internal and external sources and monitor all jobs.

What happens with schema updates to source?

When system schema updates, like lets say a new table is added and index or new key is added, what will happen? Well, Airbyte allows you to manually decide whether to include a new update in your migration job, or not. It is up to your data engineering team to decide on the behavior. Some may seek tools that automatically capture changes to data schema, however this can also result in some wild and undesirable outcomes.

What if a connector doesn’t exist?

In the short run, you can build a custom connector for your specific needs, maintain and update it as needed and deploy it in your Airbyte instance. In the longer term you can take advantage of community connectors as they come online by developers.

Is Airbyte mature enough to adopt?

Airbyte is new and it's in alpha (as of March 2022), and launches beta in April of 2022, so essentially Airbyte is a bet on the future, so there may be some hesitancy. However, what makes them compelling is they are extensible, so it's not perfect but you can customize it in ways that most competitors do not allow. The community and customer base is growing fast and with direct feedback from customers.

How does Airbyte handle Change Data Capture (CDC) in the data warehouse?

Airbyte has support for CDC with Postgres, MySQL and Microsoft SQL Server, with more on the way. Handling Change data capture can be accomplished in many ways, so it is advisable to consider using technologies built into data warehouse likeBigQuery, where you can configure advanced CDC to resolve data changes.

Conclusion

The decision to adopt Airbyte ELT across your organization will involve a few decision points. The best outcome we see is that you turn data source nightmares into seamless collection of jobs to sync data into your data warehouse.

Airbyte solves the data ingestion and load challenges from any source, or the EL part of ELT, with tools like DBT focused on data transformation or the T. In your organization, the strength of decision making, application quality and operations are only as good as the quality of your data warehouse. Timely data integration and high availability will solve challenges for your analytics team, software engineering and data science all at once. So solving data ingestion, integration and transformation is a high stakes endeavor, so you should do it flexibly in the cloud and choose the right set of technologies.

We focus on building Enterprise ready Airbyte infrastructure to manage your data integration, data source and destination connectors, allowing you to improve the productivity of your downstream data services within your organization. We also solve scale challenges with ELT, including Kubernetes deployment, Airflow as an orchestrator and DBT for transformations.

If you choose Airbyte as your ELT tool, you will have to think about how you will manage it long term. How much upfront development you will invest in, such as custom connectors in Airbyte, data orchestration and hosting. Whether you focus on a managed solution or choose to self-host Airbyte in your cloud VPC, you will need to solve data transformation, scheduling, security, support, SLAs and monitoring for your ELT jobs.

Bitstrapped and Airbyte

Bitstrapped is an Airbyte consulting partner with expertise in building enterprise custom connectors, managing massive data integrations and developing advanced data warehouse solutions for organizations.


Article By

Related Articles