Branko (@brankopetric00) Twitter Tweets • TwiCopy

Branko

2 months ago

Implemented chaos engineering in production. Simulated failure of payment service. Actual customers couldn't pay. Chaos very realistic.

thumb_up_off_alt29

chat_bubble_outline1

repeat0

shareShare

We A/B tested our Python Lambda on x86 vs. ARM (Graviton2). Results: - x86 (1GB mem): 150ms execution time. - ARM (1GB mem): 110ms execution time. Graviton2 was 26% faster and 20% cheaper. It wasn't a free migration. We had to rebuild 3 native C++ dependencies for arm64. But

thumb_up_off_alt79

chat_bubble_outline6

repeat3

shareShare

Branko

@brankopetric00

2 months ago

We ran a security scan. 90% of our K8s pods had a Service Account token mounted in /var/run/secrets/... We asked the devs: Do you use this to talk to the K8s API? They said no. The problem: automountServiceAccountToken is true by default. Every pod gets a token, even if it

thumb_up_off_alt295

chat_bubble_outline13

repeat23

shareShare

Branko

@brankopetric00

2 months ago

We created VPC Interface Endpoints (PrivateLink) for SQS, SNS, and other AWS services. Goal: Better security (no internet traffic). Problem: Our AWS bill increased by $890/month. - VPC Gateway Endpoints (for S3, DynamoDB) are free. - VPC Interface Endpoints (for everything

thumb_up_off_alt91

chat_bubble_outline5

repeat2

shareShare

Branko

@brankopetric00

2 months ago

We cut our EKS compute bill by 40% by switching from Cluster Autoscaler to Karpenter. Why: - Cluster Autoscaler is reactive. Pods go Pending, then it finds a node. - Karpenter is pod-driven. It observes the pod spec before it's scheduled. - It then provisions the *perfect*,

thumb_up_off_alt152

chat_bubble_outline5

repeat13

shareShare

Branko

@brankopetric00

2 months ago

The goal of automation is not to make a 10-minute task take 1 minute. It's to make a 10-minute task impossible to do wrong.

thumb_up_off_alt20

chat_bubble_outline1

repeat0

shareShare

Branko

@brankopetric00

2 months ago

Our incident started with a single pod failing. What happened: - The load balancer correctly removed it. - Traffic shifted to the remaining 9 pods. - This slightly increased their memory usage. - Which pushed them just over the K8s memory limit. - One by one, the OOMKiller

thumb_up_off_alt100

chat_bubble_outline12

repeat8

shareShare

Branko

@brankopetric00

2 months ago

A blameless post-mortem isn't about sparing feelings. It's the sober admission that human error is just a symptom of a system that allowed that error to be catastrophic.

thumb_up_off_alt31

chat_bubble_outline3

repeat6

shareShare

Branko

@brankopetric00

2 months ago

For fun, I added a Grafana dashboard tracking the age of the oldest pod in each deployment. A month later, we noticed one service's oldest pod was 30 days old, while all others were 2-3. We checked it. It was a 'zombie' pod that failed its health check but somehow never got

thumb_up_off_alt167

chat_bubble_outline6

repeat6

shareShare

Branko

@brankopetric00

2 months ago

We reduced deployment time from 45 minutes to 8 minutes. The fix? Stopped running the entire test suite on deploy. Instead, we run it on PR merge and deploy the artifact that passed tests. Seems obvious now, but for 2 years we were testing the exact same code twice because

thumb_up_off_alt158

chat_bubble_outline9

repeat10

shareShare

Branko

@brankopetric00

2 months ago

S3 has two ways to serve files. They're not the same. S3 REST API (recommended): - URL: bucket-name.s3.amazonaws.com/file.jpg - Private bucket with OAC (Origin Access Control) - HTTPS between CloudFront and S3 - CloudFront Functions add index.html routing - CloudFront handles custom error

thumb_up_off_alt213

chat_bubble_outline1

repeat15

shareShare

Branko

@brankopetric00

2 months ago

Our app handled 10k users fine. At 100k users, everything was still smooth. At 150k users, the database started timing out at 2am every night. The culprit? A nightly analytics query that worked fine at small scale. It was doing a full table scan that grew exponentially.

thumb_up_off_alt60

chat_bubble_outline5

repeat2

shareShare

Branko

@brankopetric00

2 months ago

Choosing between eventual consistency and strong consistency isn't a technical decision. It's a business decision disguised as one. When we finally started asking 'what happens if this data is wrong for 30 seconds?' instead of debating CAP theorem, architecture conversations

thumb_up_off_alt17

chat_bubble_outline1

repeat1

shareShare

Branko

@brankopetric00

2 months ago

Added Redis caching to fix our slow API responses. Performance improved 10x. Two months later: - Cache invalidation bugs everywhere - Data inconsistency issues - Debugging became a nightmare - Junior devs confused about source of truth We ripped it out and optimized the

thumb_up_off_alt398

chat_bubble_outline28

repeat17

shareShare

Branko

@brankopetric00

2 months ago

We had two services that were constantly in a race condition. Service A would call B, but B needed data from A that A hadn't written to the cache yet. We tried everything complex. Distributed locks. Queues. The final fix? A dev added a 500ms intentional delay to Service A API

thumb_up_off_alt38

chat_bubble_outline8

repeat0

shareShare

Branko

@brankopetric00

2 months ago

AWS locks S3 bucket names for couple of hours. Let's say you had an S3 bucket `frontend-app` created in us-east-1. You want to change the region to eu-central-1 You delete the S3 bucket in us-east-1 region AWS locks bucket name and throws below error `A conflicting

thumb_up_off_alt61

chat_bubble_outline9

repeat4

shareShare

Branko

@brankopetric00

2 months ago

Stripe should implement charge limits which could be configured by user during the checkout. E.g. you want to subscribe to X app, it's $20/month, you set max charge amount on your card to $30

thumb_up_off_alt10

chat_bubble_outline1

repeat1

shareShare

Branko

@brankopetric00

2 months ago

We used sticky sessions on our load balancer. It was an easy fix for a stateful web app we inherited. When we scaled from 4 to 40 nodes, 'stickiness' meant traffic was horribly imbalanced. Some new nodes got 0 traffic, while old nodes were 100% saturated with 'sticky' users.

thumb_up_off_alt18

chat_bubble_outline1

repeat1

shareShare

Branko

@brankopetric00

2 months ago

Black Friday 2024. Our checkout service went down at around 11 PM. What we discovered: - Payment gateway was fine - Database was fine - Load balancers were fine - But every request was timing out after exactly 30 seconds The culprit? A third-party fraud detection service we

thumb_up_off_alt41

chat_bubble_outline3

repeat4

shareShare

Branko

@brankopetric00

2 months ago

Our devs set most new Lambda functions to 2048MB of memory, 'just in case.' We ran the AWS Lambda Power Tuning tool. Most functions never used more than 300MB. - We right-sized 50+ functions in an afternoon. - The cost-per-invocation dropped significantly. - Our Lambda bill

thumb_up_off_alt9

chat_bubble_outline0

repeat1

shareShare