Airflow — Build, Host, Maintain Your Own Dependencies with GitHub and Docker Hub (for free)

build and maintain Airflow images completely open source and free, maintain infrastructure as code, and with good security.

Airflow — Build, Host, Maintain Your Own Dependencies with GitHub and Docker Hub (for free)
Photo by Rubaitul Azad on Unsplash

Airflow — Build, Host, Maintain Your Own Dependencies with GitHub and Docker Hub (for Free)

In Mobagel, we use Airflow to run various data processing tasks, and as Airflow’s document suggested, it is highly recommended that we maintain our own dependencies in our production environment.

Airflow has more than 60 community managed providers (installable via extras) and some of the default extras/providers installed are not used by everyone, sometimes others extras/providers are needed, sometimes (very often actually) you need to add your own custom dependencies, packages or even custom providers.

This guide shows you how to use open source community tool to build and maintain your own airflow image.

Setup Docker Hub Account

Docker Hub is a platform that host your docker images for free, so you need to create an account on docker hub first (or ask you colleague to share organization account to you).

Go to repositories -> Create Repository

You may name your repository airflow, because image name under different namespace will not conflict with each other.

Setup Airflow Build Pipeline

Now, you may go to airflow’s official repository, and fork the official repository to your organization/account.

Now, you want to add your own build pipeline file. I recommend you create a branch from the desired version first, and do not commit on the original airflow branch. This just makes future pulls from upstream easier. In our example, let’s call it custom-release branch, and branch if from v2–3-stable.git clone [your_forked_repo]
git checkout v2–3-stable git checkout -b custom-release git push -u origin custom-release

Now we may add our workflow file .github/workflows/cd.yaml (example file).name: Publish Docker image

on:
 release:
   types: [published]

jobs:
 push_to_registry:
   name: Push Docker image to Docker Hub
   runs-on: ubuntu-latest
   steps:
     - name: Check out the repo
       uses: actions/checkout@v3

     - name: Log in to Docker Hub
       uses: docker/login-action@f054a8b539a109f9f41c372932f1ae047eff08c9
       with:
         username: ${{ secrets.DOCKER_USERNAME }}
         password: ${{ secrets.DOCKER_PASSWORD }}

     - name: Extract metadata (tags, labels) for Docker
       id: meta
       uses: docker/metadata-action@98669ae865ea3cffbcbaa878cf57c20bbf1c6c38
       with:
         images: mobagel/airflow

     - name: Build and push Docker image
       uses: docker/build-push-action@ad44023a93711e3deb337508980b4b5e9bcdc5dc
       env:
         AIRFLOW_GPL_UNIDECODE: yes
       with:
         context: .
         push: true
         tags: ${{ steps.meta.outputs.tags }}
         labels: ${{ steps.meta.outputs.labels }}
         build-args: |
           AIRFLOW_VERSION=2.3.2
           AIRFLOW_EXTRAS=async,celery,cncf.kubernetes,dask,docker,grpc,http,ldap,postgres,redis,statsd,virtualenv
           PYTHON_BASE_IMAGE=python:3.8-slim-buster
           ADDITIONAL_PYTHON_DEPS=scikit-learn==1.0.2 slack-sdk==3.14.1 simplejson==3.17.6 glom==22.1.0

A few notes about this pipeline file:

  • We set the trigger condition to on publish, so this pipeline is triggered if a release is issued on GitHub page.
  • In the login step, we need to provide our Docker hub login username and password. We put these variable as GitHub secret. You can go to GitHub repository page > Settings > Secrets > Action to add your secret.
add secret on GitHub
  • In the step build, we may give supported arguments by Airflow to build our custom image. In our case, we removed a lot of default AIRFLOW_EXTRAS to reduce our image size, fixed AIRFLOW_VERSION and PYTHON_BASE_IMAGE, and added our own ADDITIONAL_PYTHON_DEPS .

After you setup this file, push to your custom-release branch and you are good to go.git add .
git commit -m"adding custom build flow"
git push

Start Your Airflow Build

After all this setup, you may inform your team members about the process for creating a pull request on this repository. If anyone wants to make adjustments to dependencies, tell them to update the ADDITIONAL_PYTHON_DEPS variable in .github/workflows/cd.yaml , and push to custom-release branch (or create a pull request depending on your team policy).

To trigger release, create a release from your repository page > Releases > Draft to draft a new release.

Creating Release
create release detail
  1. In tag, choose a tag of your choice. We used [airflow-version]c[our-version], so our tag is in the form of 2.3.2c1 . Note that this tag will also be the tag of your image.
  2. Select the custom-release branch as target branch.
  3. Fill in the release title and release message and click publish release. Then, your image build will be triggered automatically (you may view build status from repository page > Actions).

After the image is built, go to your own Docker Hub page to verify that the image with the new tag has been published.

Concluding Remark

  1. After introducing this workflow, the company’s infrastructure maintainer no longer need to help the team upgrade Airflow images. And it is very easy to train our team member to execute this flow on their own.
  2. Compared to the approach of training team member to actually build and push their own image, this approach is much easier in the sense that publishers don’t need to install Docker on their own computer, don’t need to understand Docker build command. Security wise, only the pipeline is allowed to push images. This way we don’t need to give away credentials to each developer. IaC-wise, each version of our own custom image is version controlled, and traceable.
  3. Hosting repositories on GitHub or Docker Hub and using their free service sounds good. But for enterprises, the issue has always been whether we can open source these code. Since Airflow itself is already open sourced, and it is generally not too much of an issue to reveal our Airflow dependencies, the whole solution of building and hosting Airflow is quite acceptable. If your company is customizing Airflow in a way that would reveal business secret or introduce security concerns, you’ll need to consider a private approach of your own.

Conclusion

And with that, you’ve crossed another level to becoming a boss coder. GG! 👏

I hope you found this article instructional and informative. If you have any feedback or queries, please let me know in the comments below. And follow SelectFrom for more tutorials and guides on topics like Big Data, Spark, and data warehousing.


The world’s fastest cloud data warehouse:

When designing analytics experiences which are consumed by customers in production, even the smallest delays in query response times become critical. Learn how to achieve sub-second performance over TBs of data with Firebolt.

Read more

在優比快Cloud Team工作是什麼樣子

在優比快Cloud Team工作是什麼樣子

如果你正在找一份可以安安靜靜寫程式、不需要太多溝通的工作,老實說——Ubiquiti Cloud Team 可能不適合你。 年輕的工程師通常在意的是能不能學習、有沒有人帶;而資深工程師,則更看重領域的深度與發揮空間。這兩種我都理解,也都經歷過。在 Ubiquiti Cloud Team,工作確實不輕鬆,問題通常也不單純。但如果你追求挑戰、在意技術如何帶出產品價值,這裡就是個能讓你不斷磨練、逐步放大的舞台。 一些基本資訊先講清楚:我們使用 GitHub,開發環境現代化,雲平台該用的都有;團隊內部提供各種 AI coding 工具輔助日常開發(包括我本人非常依賴的 ChatGPT, Cursor 和 Claude Code);工作型態彈性大,遠端、無限假、健身補助。 一切從「真實世界的裝置」開始 Ubiquiti 跟多數純軟體公司不太一樣,我們的雲端服務是為了支援全球各地數以百萬計的實體網通設備:從 AP、

By schwannden