Launch Announcement

Orkes Academy is live! Learn to build agentic workflows

ENGINEERING SOLUTIONS

How to Make Distributed Workflows Retry-Safe

Nick Lotz

Content Engineer

Last updated: March 6, 2026

March 6, 2026

12 min read

Related Blogs

Mar 17, 2026

Workflow Versioning and Backward Compatibility: Stop Breaking Long-Running Executions

Mar 24, 2026

Conductor CLI Guide: Register, Run, Retry, and Recover Durable Workflows Without Leaving Your Terminal 💻

Mar 17, 2026

How to Build an AI-Powered Support Ticket Triage Workflow with Orkes Conductor

Ready to Build Something Amazing?

Join thousands of developers building the future with Orkes.

Start for free

Back to Blogs

ENGINEERING SOLUTIONS

How to Make Distributed Workflows Retry-Safe

Nick Lotz

Content Engineer

Last updated: March 6, 2026

March 6, 2026

12 min read

Idempotency and retry safety blog cover

A practical walkthrough for preventing duplicate side effects with idempotency patterns and Orkes Conductor.

Consider a customer checking out in your app. A worker calls the processor, waits a few seconds, hits a timeout, and retries. The first call actually succeeded, but your system did not know it yet. The retry succeeds too. One order, two charges. Nobody wrote "charge twice" in the code, but that is still the outcome.

This is exactly why idempotency matters. Idempotency means that if the same logical operation is attempted multiple times, the external outcome is the same as if it ran once. Retries are necessary in distributed systems, but retries without idempotency can produce duplicate side effects.

Financial services has seen how expensive control failures can be. In 2020, Citibank accidentally sent about $900 million to Revlon lenders during a payment operation. That incident had broader operational causes than retry logic alone, but it still illustrates that in distributed systems, workflows can be triggered or replayed more than once for reasons you cannot always predict.

How We Got Here

Earlier systems usually kept business logic in one application and one relational database. Consistency was easier because one transaction boundary could protect the operation end to end. If something failed, the transaction rolled back.

As architecture shifted to services, each business action started crossing boundaries through APIs and queues. Technologies like REST services and Kafka improved scaling and team ownership, but they also introduced new failure points between steps.

Architecture evolution from monolith to distributed systems and workflow orchestration

To avoid dropping work, platforms standardized retries and at-least-once delivery. That improved reliability, but it moved correctness into application design. Instead of relying on a single database transaction for safety, developers now must design workers to handle repeated messages safely and work correctly if tasks are replayed.

So we solved one problem very well, resilience to transient failure. The remaining challenge is guaranteeing one business outcome when the system retries after uncertain results.

For example, imagine an order-confirmation workflow that sends an email after inventory is reserved. The send call succeeds, but the worker times out before recording completion. The task is retried, and the customer receives the same confirmation twice. The system was resilient, but the business outcome was wrong. Retry safety is about making that second attempt return the same result instead of repeating the side effect.

The Missing Piece: Idempotent Retries

As systems grow, business processes stop living in one place. A single user action might touch a payment API, an internal service, a queue, and a notification provider. Each step can succeed or fail independently. Teams usually patch this with local retry logic, but absent proper idempotency, that often creates inconsistent behavior across services.

Before and after workflow orchestration with Conductor

You can turn this maze into a more managed process by using software orchestration. In simple terms, orchestration means defining a business flow as a sequence of coordinated steps and letting one system manage execution state, retries, timeouts, dependencies, and failure handling. Instead of every service deciding process behavior on its own, orchestration provides one control plane for the whole workflow.

Tools like Conductors were created to accomplish this. Netflix Conductor was built to be a workflow orchestration for distributed, long-running work. Orkes Conductor provides that engine as a production-ready platform with visibility and operational controls. It gives you a durable workflow state, consistent retry behavior, and clear task lifecycle management. That directly reduces the pain points we started with: you stop relying on scattered retry code, you gain traceability across steps, and when paired with idempotent workers, retries remain reliable without creating duplicate side effects.

Example Walk-Through: Safe Payments Handling

So far, we’ve talked about the problem in the abstract: retries are necessary, but retries alone do not protect business outcomes. Now we’ll make that concrete by building a complete, runnable example that combines orchestration and idempotency in one flow.

The application we’re simulating is a small checkout payment flow. A “start workflow” call represents the checkout app creating a payment request. Conductor schedules a charge_card task, and a Python worker processes it with idempotency protection. The worker calls a simulated payment side effect backed by a local durable ledger, then reports completion to Conductor. You will run all of this in a local VM, replay the same order, and verify that retries still progress the workflow while producing only one effective charge.

Core idempotency flow from workflow start through first execution and replay-safe result return

The key idea in this flow is that Orkes Conductor continues to do what it should do: deliver tasks reliably and retry when needed. Idempotency is enforced at the worker boundary. On each task, the worker first calls TryStart against the idempotency store. If this is the first valid execution, it performs the side effect and stores the durable result. If Conductor replays the same business operation later, the worker sees the existing record and returns the stored result instead of firing the side effect again. That is how you get retry resilience and one business outcome at the same time.

All code for this demo can be found in this GitHub repo, which we will now proceed to walk through together:

https://github.com/orkes-io/conductor-idempotency-demo

Step 1: Prerequisites and setup

Install Multipass if needed. Multipass is a program from Canonical that lets you spawn on-demand Ubuntu VMs on your local machine. We’ll use it to keep this walkthrough sandboxed.

On macOS:

bash

brew install --cask multipass

On Ubuntu/Linux:

bash

sudo snap install multipass

On Windows (PowerShell):

powershell

winget install Canonical.Multipass

Sanity check:

bash

multipass version

Then clone the demo repository and enter it:

bash

git clone https://github.com/orkes-io/conductor-idempotency-demo.git
cd conductor-idempotency-demo

Step 2: Bootstrap deterministic runtime

From repository root:

bash

make vm-up
make vm-provision
make conductor-up

This creates a repeatable local runtime:

VM name: idempotency-retry-demo
Conductor container: idempotency-retry-demo-conductor
VM Python env: /home/ubuntu/.venvs/idempotency-retry-demo

All three commands are idempotent. If a run is interrupted, rerun the same commands.

Step 3: Export API endpoint and verify control plane

Set the API endpoint in your shell:

bash

VM_IP=$(multipass info idempotency-retry-demo | awk '/IPv4/{print $2}')
export CONDUCTOR_SERVER_URL="http://${VM_IP}:8080/api"
echo "${CONDUCTOR_SERVER_URL}"

Check the API responds:

bash

curl -sS "${CONDUCTOR_SERVER_URL}/metadata/taskdefs" | head -c 200 && echo

You should get JSON back.

Step 4: Install worker dependencies

bash

cd code
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Now your shell can run worker and verification scripts directly.

Step 5: Register workflow metadata (idempotently)

bash

./register_conductor_assets.sh

First run usually says Created .... Later runs may say Updated ... or already exists .... All are acceptable.

Step 6: Execute baseline workflow and worker

Start one workflow execution:

bash

export WF_ID_1=$(ORDER_ID=9243 ./start_workflow.sh)
echo "${WF_ID_1}"

Run the worker:

bash

WORKER_INSTANCE_ID=worker-local-1 python3 run_conductor_worker.py --max-empty-polls 10

Verify workflow status:

bash

curl -sS "${CONDUCTOR_SERVER_URL}/workflow/${WF_ID_1}" | jq '.status'

Expected output:

text

"COMPLETED"

Step 7: Replay the same business operation

Start a second workflow with the same business identity:

bash

export WF_ID_2=$(ORDER_ID=9243 ./start_workflow.sh)
echo "${WF_ID_2}"
WORKER_INSTANCE_ID=worker-local-1 python3 run_conductor_worker.py --max-empty-polls 10

Compare charge_id across both workflow runs:

bash

python3 - <<'PY'
import os
import requests

base = os.environ["CONDUCTOR_SERVER_URL"]
wf1 = os.environ["WF_ID_1"]
wf2 = os.environ["WF_ID_2"]

def get_charge_id(wf_id: str) -> str:
    data = requests.get(f"{base}/workflow/{wf_id}", timeout=30).json()
    tasks = data.get("tasks", [])
    charge = [t for t in tasks if t.get("taskType") == "charge_card"]
    if not charge:
        return "missing"
    return charge[-1].get("outputData", {}).get("charge_id", "missing")

charge1 = get_charge_id(wf1)
charge2 = get_charge_id(wf2)
print("wf1 charge_id:", charge1)
print("wf2 charge_id:", charge2)
print("same charge id:", charge1 == charge2)
PY

same charge id: True is the core proof.

Step 8: Validate crash-window behavior explicitly

Isolate the critical failure window (side effect succeeded, completion not yet written):

bash

python3 run_demo.py --reset-db --scenario crash-retry

Expected output includes a simulated crash and replay result with side_effect_executed: false.

Step 9: Validate worker contention behavior

bash

python3 run_demo.py --reset-db --scenario owner-conflict

Expected output includes:

text

owner conflict scenario: in progress

That confirms duplicate worker paths are controlled and retryable.

Step 10: Run the same flow in Orkes Developer Edition

You proved the model locally. Now run the same flow in Orkes Developer Edition to confirm the behavior holds in a managed environment.

Start in the demo directory and set your endpoint and API credentials:

bash

cd blogs/idempotency-and-retry-safety-in-distributed-workflows/code
export CONDUCTOR_SERVER_URL="https://developer.orkescloud.com/api"
export CONDUCTOR_AUTH_KEY="<your-key-id>"
export CONDUCTOR_AUTH_SECRET="<your-key-secret>"

Next, exchange key/secret for a short-lived token. The scripts in this demo use that token automatically.

bash

export CONDUCTOR_AUTH_TOKEN="$(
  curl -sS -X POST \
    -H "Content-Type: application/json" \
    --data "{\"keyId\":\"${CONDUCTOR_AUTH_KEY}\",\"keySecret\":\"${CONDUCTOR_AUTH_SECRET}\"}" \
    "${CONDUCTOR_SERVER_URL}/token" \
    | python3 -c 'import json,sys; print(json.load(sys.stdin)["token"])'
)"

bash

./register_conductor_assets.sh

Run the first execution for ORDER_ID=9243, then run the worker to process it.

bash

export WF_ID_CLOUD_1=$(ORDER_ID=9243 ./start_workflow.sh)
echo "${WF_ID_CLOUD_1}"
WORKER_INSTANCE_ID=worker-cloud-1 \
  python3 run_conductor_worker.py \
  --db-path /tmp/idempotency-retry-demo-cloud.sqlite \
  --max-empty-polls 20

Now replay the same business operation (ORDER_ID=9243) and process it again with the same worker state.

bash

export WF_ID_CLOUD_2=$(ORDER_ID=9243 ./start_workflow.sh)
echo "${WF_ID_CLOUD_2}"
WORKER_INSTANCE_ID=worker-cloud-1 \
  python3 run_conductor_worker.py \
  --db-path /tmp/idempotency-retry-demo-cloud.sqlite \
  --max-empty-polls 20

If any command returns 401, refresh CONDUCTOR_AUTH_TOKEN and rerun that command.

Finally, compare both workflow runs. You should see both complete, and both reference the same charge_id.

bash

python3 - <<'PY'
import os
import requests

base = os.environ["CONDUCTOR_SERVER_URL"]
wf1 = os.environ["WF_ID_CLOUD_1"]
wf2 = os.environ["WF_ID_CLOUD_2"]
token = os.getenv("CONDUCTOR_AUTH_TOKEN", "")
headers = {"X-Authorization": token} if token else {}

def read_workflow(workflow_id: str) -> dict:
    return requests.get(f"{base}/workflow/{workflow_id}", headers=headers, timeout=30).json()

def read_charge_id(workflow: dict) -> str:
    tasks = workflow.get("tasks", [])
    charge_tasks = [t for t in tasks if t.get("taskType") == "charge_card"]
    if not charge_tasks:
        return "missing"
    return charge_tasks[-1].get("outputData", {}).get("charge_id", "missing")

wf1_data = read_workflow(wf1)
wf2_data = read_workflow(wf2)
charge1 = read_charge_id(wf1_data)
charge2 = read_charge_id(wf2_data)

print("wf1 status:", wf1_data.get("status"))
print("wf2 status:", wf2_data.get("status"))
print("wf1 charge_id:", charge1)
print("wf2 charge_id:", charge2)
print("same charge id:", charge1 == charge2)
PY

Reliability is not just retrying faster, but rather is making each business operation replay-safe by design. We use orchestration to manage state across failures, and idempotency to make repeated attempts converge on one correct outcome instead of duplicate side effects. And, if you have not run this in Orkes Cloud yet, this is a good point to sign up and validate the same behavior in a managed environment with your own workflows.

Related Blogs

Mar 17, 2026

Workflow Versioning and Backward Compatibility: Stop Breaking Long-Running Executions

Mar 24, 2026

Conductor CLI Guide: Register, Run, Retry, and Recover Durable Workflows Without Leaving Your Terminal 💻

Mar 17, 2026

How to Build an AI-Powered Support Ticket Triage Workflow with Orkes Conductor

Ready to Build Something Amazing?

Join thousands of developers building the future with Orkes.

Start for free

Idempotency and retry safety blog cover

A practical walkthrough for preventing duplicate side effects with idempotency patterns and Orkes Conductor.

How We Got Here

Architecture evolution from monolith to distributed systems and workflow orchestration

So we solved one problem very well, resilience to transient failure. The remaining challenge is guaranteeing one business outcome when the system retries after uncertain results.

The Missing Piece: Idempotent Retries

Before and after workflow orchestration with Conductor

Example Walk-Through: Safe Payments Handling

All code for this demo can be found in this GitHub repo, which we will now proceed to walk through together:

https://github.com/orkes-io/conductor-idempotency-demo

Step 1: Prerequisites and setup

Install Multipass if needed. Multipass is a program from Canonical that lets you spawn on-demand Ubuntu VMs on your local machine. We’ll use it to keep this walkthrough sandboxed.

On macOS:

bash

brew install --cask multipass

On Ubuntu/Linux:

bash

sudo snap install multipass

On Windows (PowerShell):

powershell

winget install Canonical.Multipass

Sanity check:

bash

multipass version

Then clone the demo repository and enter it:

bash

git clone https://github.com/orkes-io/conductor-idempotency-demo.git
cd conductor-idempotency-demo

Step 2: Bootstrap deterministic runtime

From repository root:

bash

make vm-up
make vm-provision
make conductor-up

This creates a repeatable local runtime:

VM name: idempotency-retry-demo
Conductor container: idempotency-retry-demo-conductor
VM Python env: /home/ubuntu/.venvs/idempotency-retry-demo

All three commands are idempotent. If a run is interrupted, rerun the same commands.

Step 3: Export API endpoint and verify control plane

Set the API endpoint in your shell:

bash

VM_IP=$(multipass info idempotency-retry-demo | awk '/IPv4/{print $2}')
export CONDUCTOR_SERVER_URL="http://${VM_IP}:8080/api"
echo "${CONDUCTOR_SERVER_URL}"

Check the API responds:

bash

curl -sS "${CONDUCTOR_SERVER_URL}/metadata/taskdefs" | head -c 200 && echo

You should get JSON back.

Step 4: Install worker dependencies

bash

cd code
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Now your shell can run worker and verification scripts directly.

Step 5: Register workflow metadata (idempotently)

bash

./register_conductor_assets.sh

First run usually says Created .... Later runs may say Updated ... or already exists .... All are acceptable.

Step 6: Execute baseline workflow and worker

Start one workflow execution:

bash

export WF_ID_1=$(ORDER_ID=9243 ./start_workflow.sh)
echo "${WF_ID_1}"

Run the worker:

bash

WORKER_INSTANCE_ID=worker-local-1 python3 run_conductor_worker.py --max-empty-polls 10

Verify workflow status:

bash

curl -sS "${CONDUCTOR_SERVER_URL}/workflow/${WF_ID_1}" | jq '.status'

Expected output:

text

"COMPLETED"

Step 7: Replay the same business operation

Start a second workflow with the same business identity:

bash

export WF_ID_2=$(ORDER_ID=9243 ./start_workflow.sh)
echo "${WF_ID_2}"
WORKER_INSTANCE_ID=worker-local-1 python3 run_conductor_worker.py --max-empty-polls 10

Compare charge_id across both workflow runs:

bash

python3 - <<'PY'
import os
import requests

base = os.environ["CONDUCTOR_SERVER_URL"]
wf1 = os.environ["WF_ID_1"]
wf2 = os.environ["WF_ID_2"]

def get_charge_id(wf_id: str) -> str:
    data = requests.get(f"{base}/workflow/{wf_id}", timeout=30).json()
    tasks = data.get("tasks", [])
    charge = [t for t in tasks if t.get("taskType") == "charge_card"]
    if not charge:
        return "missing"
    return charge[-1].get("outputData", {}).get("charge_id", "missing")

charge1 = get_charge_id(wf1)
charge2 = get_charge_id(wf2)
print("wf1 charge_id:", charge1)
print("wf2 charge_id:", charge2)
print("same charge id:", charge1 == charge2)
PY

same charge id: True is the core proof.

Step 8: Validate crash-window behavior explicitly

Isolate the critical failure window (side effect succeeded, completion not yet written):

bash

python3 run_demo.py --reset-db --scenario crash-retry

Expected output includes a simulated crash and replay result with side_effect_executed: false.

Step 9: Validate worker contention behavior

bash

python3 run_demo.py --reset-db --scenario owner-conflict

Expected output includes:

text

owner conflict scenario: in progress

That confirms duplicate worker paths are controlled and retryable.

Step 10: Run the same flow in Orkes Developer Edition

You proved the model locally. Now run the same flow in Orkes Developer Edition to confirm the behavior holds in a managed environment.

Start in the demo directory and set your endpoint and API credentials:

bash

cd blogs/idempotency-and-retry-safety-in-distributed-workflows/code
export CONDUCTOR_SERVER_URL="https://developer.orkescloud.com/api"
export CONDUCTOR_AUTH_KEY="<your-key-id>"
export CONDUCTOR_AUTH_SECRET="<your-key-secret>"

Next, exchange key/secret for a short-lived token. The scripts in this demo use that token automatically.

bash

export CONDUCTOR_AUTH_TOKEN="$(
  curl -sS -X POST \
    -H "Content-Type: application/json" \
    --data "{\"keyId\":\"${CONDUCTOR_AUTH_KEY}\",\"keySecret\":\"${CONDUCTOR_AUTH_SECRET}\"}" \
    "${CONDUCTOR_SERVER_URL}/token" \
    | python3 -c 'import json,sys; print(json.load(sys.stdin)["token"])'
)"

bash

./register_conductor_assets.sh

Run the first execution for ORDER_ID=9243, then run the worker to process it.

bash

export WF_ID_CLOUD_1=$(ORDER_ID=9243 ./start_workflow.sh)
echo "${WF_ID_CLOUD_1}"
WORKER_INSTANCE_ID=worker-cloud-1 \
  python3 run_conductor_worker.py \
  --db-path /tmp/idempotency-retry-demo-cloud.sqlite \
  --max-empty-polls 20

Now replay the same business operation (ORDER_ID=9243) and process it again with the same worker state.

bash

export WF_ID_CLOUD_2=$(ORDER_ID=9243 ./start_workflow.sh)
echo "${WF_ID_CLOUD_2}"
WORKER_INSTANCE_ID=worker-cloud-1 \
  python3 run_conductor_worker.py \
  --db-path /tmp/idempotency-retry-demo-cloud.sqlite \
  --max-empty-polls 20

If any command returns 401, refresh CONDUCTOR_AUTH_TOKEN and rerun that command.

Finally, compare both workflow runs. You should see both complete, and both reference the same charge_id.

bash

python3 - <<'PY'
import os
import requests

base = os.environ["CONDUCTOR_SERVER_URL"]
wf1 = os.environ["WF_ID_CLOUD_1"]
wf2 = os.environ["WF_ID_CLOUD_2"]
token = os.getenv("CONDUCTOR_AUTH_TOKEN", "")
headers = {"X-Authorization": token} if token else {}

def read_workflow(workflow_id: str) -> dict:
    return requests.get(f"{base}/workflow/{workflow_id}", headers=headers, timeout=30).json()

def read_charge_id(workflow: dict) -> str:
    tasks = workflow.get("tasks", [])
    charge_tasks = [t for t in tasks if t.get("taskType") == "charge_card"]
    if not charge_tasks:
        return "missing"
    return charge_tasks[-1].get("outputData", {}).get("charge_id", "missing")

wf1_data = read_workflow(wf1)
wf2_data = read_workflow(wf2)
charge1 = read_charge_id(wf1_data)
charge2 = read_charge_id(wf2_data)

print("wf1 status:", wf1_data.get("status"))
print("wf2 status:", wf2_data.get("status"))
print("wf1 charge_id:", charge1)
print("wf2 charge_id:", charge2)
print("same charge id:", charge1 == charge2)
PY

Related Blogs

Workflow Versioning and Backward Compatibility: Stop Breaking Long-Running Executions

Conductor CLI Guide: Register, Run, Retry, and Recover Durable Workflows Without Leaving Your Terminal 💻

How to Build an AI-Powered Support Ticket Triage Workflow with Orkes Conductor

Ready to Build Something Amazing?

Company

Product

Community

Use Cases

Compare

A practical walkthrough for preventing duplicate side effects with idempotency patterns and Orkes Conductor.

How We Got Here

The Missing Piece: Idempotent Retries

Example Walk-Through: Safe Payments Handling

Step 1: Prerequisites and setup

Step 2: Bootstrap deterministic runtime

Step 3: Export API endpoint and verify control plane

Step 4: Install worker dependencies

Step 5: Register workflow metadata (idempotently)

Step 6: Execute baseline workflow and worker

Step 7: Replay the same business operation

Step 8: Validate crash-window behavior explicitly

Step 9: Validate worker contention behavior

Step 10: Run the same flow in Orkes Developer Edition

Related Blogs

Workflow Versioning and Backward Compatibility: Stop Breaking Long-Running Executions

Conductor CLI Guide: Register, Run, Retry, and Recover Durable Workflows Without Leaving Your Terminal 💻

How to Build an AI-Powered Support Ticket Triage Workflow with Orkes Conductor

Ready to Build Something Amazing?

A practical walkthrough for preventing duplicate side effects with idempotency patterns and Orkes Conductor.

How We Got Here

The Missing Piece: Idempotent Retries

Example Walk-Through: Safe Payments Handling

Step 1: Prerequisites and setup

Step 2: Bootstrap deterministic runtime

Step 3: Export API endpoint and verify control plane

Step 4: Install worker dependencies

Step 5: Register workflow metadata (idempotently)

Step 6: Execute baseline workflow and worker

Step 7: Replay the same business operation

Step 8: Validate crash-window behavior explicitly

Step 9: Validate worker contention behavior

Step 10: Run the same flow in Orkes Developer Edition