Resilient Faktory K8S Deployment

My current project heavily relies on background Faktory jobs, which are mission-critical. We have multiple Faktory workers (clients) in our K8S cluster that pull the queued jobs, providing ample resiliency. Given the crucial nature of successfully enqueueing the jobs for our application, we’ve identified a potential issue with having a single Faktory server, and we were hoping to address this concern.

Fair warning: The topic discussed here involves unresolved issues, and we ultimately adopted a slightly different approach to address this challenge. If you are seeking a straightforward how-to guide, this might not be suitable for you. I’ve written this post as a learning experience, which, in my opinion, contains some useful information

To further illustrate the problem, consider an API endpoint responsible for handling user payments. This endpoint initiates a chain of Faktory jobs to ensure smooth money movement and ledgering, all processed asynchronously. If the Faktory server goes down at any point, the API endpoint will fail because the job cannot be enqueued, leading to service denial for our users.

Our starting point of a single server Faktory deployment on K8S, following the Faktory docs, looked like this:

kind: Service
apiVersion: v1
metadata:
  name: faktory-server
spec:
  selector:
    app: faktory-server
  ports:
    - name: faktory
      protocol: TCP
      port: 7419
      targetPort: 7419
    - name: dashboard
      protocol: TCP
      port: 7420
      targetPort: 7420
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: faktory-server
  labels:
    app: faktory-server
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: faktory-server
  template:
    metadata:
      labels:
        app: faktory-server
    spec:
      shareProcessNamespace: true
      terminationGracePeriodSeconds: 10
      containers:
        - name: faktory-server
          image: xxx.dkr.ecr.us-west-2.amazonaws.com/faktory-server
          imagePullPolicy: Always
          ports:
            - containerPort: 7419
              name: faktory
            - containerPort: 7420
              name: dashboard
          volumeMounts:
            - name: faktory-server-storage-volume
              mountPath: "/var/lib/faktory"
          env:
            - name: FAKTORY_ENV
              value: production
            - name: FAKTORY_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: faktory-secrets
                  key: FAKTORY_PASSWORD
            - name: FAKTORY_LICENSE
              valueFrom:
                secretKeyRef:
                  name: faktory-secrets
                  key: FAKTORY_LICENSE
      volumes:
        - name: faktory-server-storage-volume
          persistentVolumeClaim:
            claimName: faktory-server-storage-pv-claim
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: faktory-server-storage-pv-claim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: "gp2"
  resources:
    requests:
      storage: 5Gi

Migrating to Redis

As you can see, the Faktory server utilized on-disk storage, which is not suitable for a multiserver setup. Therefore, the initial priority was transitioning to using a Redis cluster.

I won’t delve into the details of provisioning a Redis cluster, but there is a comprehensive wiki page on this topic in the Faktory repository. In our case, we employed Elasticache within the AWS ecosystem.

Following the provisioning of our Redis cluster and incorporating the Redis URL into our environment variables, our deployment’s manifest is now (with the removal of the PersistentVolumeClaim):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: faktory-server
  labels:
    app: faktory-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: faktory-server
  template:
    metadata:
      labels:
        app: faktory-server
    spec:
      shareProcessNamespace: true
      terminationGracePeriodSeconds: 10
      containers:
        - name: faktory-server
          image: xxx.dkr.ecr.us-west-2.amazonaws.com/faktory-server
          imagePullPolicy: Always
          ports:
            - containerPort: 7419
              name: faktory
            - containerPort: 7420
              name: dashboard
          env:
            - name: FAKTORY_ENV
              value: production
            - name: FAKTORY_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: faktory-secrets
                  key: FAKTORY_PASSWORD
            - name: FAKTORY_LICENSE
              valueFrom:
                secretKeyRef:
                  name: faktory-secrets
                  key: FAKTORY_LICENSE
            - name: REDIS_URL
              valueFrom:
                configMapKeyRef:
                  name: faktory-server-env-vars
                  key: REDIS_URL

As you can see above, the image for the deployment is our own image in ECR, based on the Faktory image. That image contains our configs for scheduled jobs etc. in a separate repository. More on that to come.

Adding a Network Load Balancer

Now that we have storage that can be shared across pods, we need to add a network load balancer to route requests from the workers to the different servers. Luckily, K8S makes it quite easy. We just annotate our service with the internal load balancer annotation and add the LoadBalancer type:

kind: Service
apiVersion: v1
metadata:
  name: faktory-server
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-internal: true
spec:
  type: LoadBalancer
  selector:
    app: faktory-server
  ports:
    - name: faktory
      protocol: TCP
      port: 7419
      targetPort: 7419
    - name: dashboard
      protocol: TCP
      port: 7420
      targetPort: 7420

Duplicate Scheduled Jobs

After testing this setup for a little while, we encountered an issue with our scheduled jobs. We had a lot of cron scheduled jobs defined in a separate repository and built into the image used in our deployment. Since we now had multiple servers, they each contained all the configs, which resulted in duplication of the scheduled jobs. That created some serious race condition issues since those jobs ran at exactly the same time.

To mitigate the issue, I moved all those configurations to the K8S manifest and separated between a “primary” Faktory server and “secondaries”. The idea is that only the primary will schedule those jobs, but other than that they are all equal in responsibilities. In case the primary fails, those jobs would not be scheduled until we get alerted and fix the issue, which is not a big issue (since those scheduled jobs are not super time sensitive).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: faktory-server-primary
  labels:
    app: faktory-server
spec:
  # The primary can only have 1 replica, so cron is not duplicated
  replicas: 1
  selector:
    matchLabels:
      app: faktory-server
  template:
    metadata:
      labels:
        app: faktory-server
    spec:
      shareProcessNamespace: true
      terminationGracePeriodSeconds: 10
      containers:
        - name: faktory-server-primary
          image: docker.contribsys.com/contribsys/faktory-ent:1.6.1
          imagePullPolicy: Always
          ports:
            - containerPort: 7419
              name: faktory
            - containerPort: 7420
              name: dashboard
          env:
            - name: FAKTORY_ENV
              value: production
            - name: FAKTORY_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: faktory-secrets
                  key: FAKTORY_PASSWORD
            - name: FAKTORY_LICENSE
              valueFrom:
                secretKeyRef:
                  name: faktory-secrets
                  key: FAKTORY_LICENSE
            - name: REDIS_URL
              valueFrom:
                configMapKeyRef:
                  name: faktory-server-env-vars
                  key: REDIS_URL
          readinessProbe:
            tcpSocket:
              port: 7420
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            tcpSocket:
              port: 7420
            initialDelaySeconds: 15
            periodSeconds: 10
          volumeMounts:
            - name: faktory-server-cron-volume
              mountPath: "/etc/faktory/conf.d"
            - name: faktory-server-statsd-volume
              mountPath: "/etc/faktory/conf.d"
      imagePullSecrets:
        - name: faktory-server-ent-login
      volumeMounts:
        - name: faktory-server-cron-volume
          mountPath: "/etc/faktory/conf.d"
        - name: faktory-server-statsd-volume
          mountPath: "/etc/faktory/conf.d"
      volumes:
        - name: faktory-server-cron-volume
          configMap:
            name: faktory-server-cron
        - name: faktory-server-statsd-volume
          configMap:
            name: faktory-server-statsd

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: faktory-server-secondary
  labels:
    app: faktory-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: faktory-server
  template:
    metadata:
      labels:
        app: faktory-server
    spec:
      shareProcessNamespace: true
      terminationGracePeriodSeconds: 10
      containers:
        - name: faktory-server-secondary
          image: docker.contribsys.com/contribsys/faktory-ent:1.6.1
          imagePullPolicy: Always
          ports:
            - containerPort: 7419
              name: faktory
            - containerPort: 7420
              name: dashboard
          env:
            - name: FAKTORY_ENV
              value: production
            - name: FAKTORY_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: faktory-secrets
                  key: FAKTORY_PASSWORD
            - name: FAKTORY_LICENSE
              valueFrom:
                secretKeyRef:
                  name: faktory-secrets
                  key: FAKTORY_LICENSE
            - name: REDIS_URL
              valueFrom:
                configMapKeyRef:
                  name: faktory-server-env-vars
                  key: REDIS_URL
          readinessProbe:
            tcpSocket:
              port: 7420
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            tcpSocket:
              port: 7420
            initialDelaySeconds: 15
            periodSeconds: 10
      imagePullSecrets:
        - name: faktory-server-ent-login

As you can see, the only difference between the primary and the secondary deployments is the mounted volumes, which are the configs needed for the primary Faktory server. We also needed to add the enterprise login secret, which is an encoded string.

If you look closely you can also see that we have a config for statsd exporter. We export the logs to be consumed by Prometheus. I might write a separate post about that setup.

Our cron config file looks something like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: faktory-server-cron
data:
  cron.toml: |2

    # This file defines the jobs that should be scheduled via the Faktory Server
    
    # All CRON schedule times are in UTC to produce the times in CDT (UTC-5, local Viva time) indicated in the comment
    
    # Faktory uses an exponential backoff algorithm for retries:  15 + count ^ 4 + (rand(30) * (count + 1))
    # The default is 25 - which is approx 3 weeks.  20 tries is about two days.
    
    [[cron]]
    schedule = "0 12 27 * *"
    [cron.job]
    type = "MyWorker"
    args = ["SomeArgument"]
    retry = 5

Issues with Job Acknowledgement

The changes above fixed the issue of scheduling duplication, and I thought we were grooving at that point. However, after some time we started observing some weird behaviour when jobs were finishing successfully, only to be enqueued again roughly half an hour after they ran initially. This was really bizarre and baffled me for a while.

After consulting with Faktory’s author, turns out that half an hour is the timeout between when a job is picked up and when it is freed up again in case the worker doesn’t acknowledge the job completed. Turns out, sometimes our workers would grab a job from server A but acknowledge completion with server B, so server A would free up the job to be enqueued again after 30 minutes. This is happening since the status of the jobs is held in memory, and not represented in the persistence layer (Redis in our case), so there’s no shared knowledge of that across our Faktory servers.

This is, unfortunately, where we hit a dead end. After some discussion with Faktory’s author, it became clear that changing that implementation detail is not something he’s interested in doing. We decided to keep our current deployment structure with only one master replica, and add a table to our database to dump jobs into in case the server fails, so we don’t lose them while we handle fixing the issue.

In conclusion, this was a nice learning process, and I’m happy I went down this rabbit hole. Although I don’t agree with the author’s arguments against redundancy of the server, I understand his lack of desire to change the system drastically just for that. Our plan B solution should work great, hoping it will never actually be put to the test!

Resilient Faktory K8S Deployment

Categories

Tags

Migrating to Redis

Adding a Network Load Balancer

Duplicate Scheduled Jobs

Issues with Job Acknowledgement