Branko (@brankopetric00) 's Twitter Profile
Branko

@brankopetric00

DevOps Engineer | AWS, Terraform, Kubernetes
kubewhisper.com - human language to kubectl commands
@prmptvault - ultimate prompt engineering platform

ID: 1157414015666184192

linkhttps://brankopetric.com calendar_today02-08-2019 22:12:59

442 Tweet

3,3K Takipçi

275 Takip Edilen

Branko (@brankopetric00) 's Twitter Profile Photo

Implemented chaos engineering in production. Simulated failure of payment service. Actual customers couldn't pay. Chaos very realistic.

Branko (@brankopetric00) 's Twitter Profile Photo

We A/B tested our Python Lambda on x86 vs. ARM (Graviton2). Results: - x86 (1GB mem): 150ms execution time. - ARM (1GB mem): 110ms execution time. Graviton2 was 26% faster and 20% cheaper. It wasn't a free migration. We had to rebuild 3 native C++ dependencies for arm64. But

Branko (@brankopetric00) 's Twitter Profile Photo

We ran a security scan. 90% of our K8s pods had a Service Account token mounted in /var/run/secrets/... We asked the devs: Do you use this to talk to the K8s API? They said no. The problem: automountServiceAccountToken is true by default. Every pod gets a token, even if it

Branko (@brankopetric00) 's Twitter Profile Photo

We created VPC Interface Endpoints (PrivateLink) for SQS, SNS, and other AWS services. Goal: Better security (no internet traffic). Problem: Our AWS bill increased by $890/month. - VPC Gateway Endpoints (for S3, DynamoDB) are free. - VPC Interface Endpoints (for everything

Branko (@brankopetric00) 's Twitter Profile Photo

We cut our EKS compute bill by 40% by switching from Cluster Autoscaler to Karpenter. Why: - Cluster Autoscaler is reactive. Pods go Pending, then it finds a node. - Karpenter is pod-driven. It observes the pod spec before it's scheduled. - It then provisions the *perfect*,

Branko (@brankopetric00) 's Twitter Profile Photo

The goal of automation is not to make a 10-minute task take 1 minute. It's to make a 10-minute task impossible to do wrong.

Branko (@brankopetric00) 's Twitter Profile Photo

Our incident started with a single pod failing. What happened: - The load balancer correctly removed it. - Traffic shifted to the remaining 9 pods. - This slightly increased their memory usage. - Which pushed them just over the K8s memory limit. - One by one, the OOMKiller

Branko (@brankopetric00) 's Twitter Profile Photo

A blameless post-mortem isn't about sparing feelings. It's the sober admission that human error is just a symptom of a system that allowed that error to be catastrophic.

Branko (@brankopetric00) 's Twitter Profile Photo

For fun, I added a Grafana dashboard tracking the age of the oldest pod in each deployment. A month later, we noticed one service's oldest pod was 30 days old, while all others were 2-3. We checked it. It was a 'zombie' pod that failed its health check but somehow never got

Branko (@brankopetric00) 's Twitter Profile Photo

We reduced deployment time from 45 minutes to 8 minutes. The fix? Stopped running the entire test suite on deploy. Instead, we run it on PR merge and deploy the artifact that passed tests. Seems obvious now, but for 2 years we were testing the exact same code twice because

Branko (@brankopetric00) 's Twitter Profile Photo

S3 has two ways to serve files. They're not the same. S3 REST API (recommended): - URL: bucket-name.s3.amazonaws.com/file.jpg - Private bucket with OAC (Origin Access Control) - HTTPS between CloudFront and S3 - CloudFront Functions add index.html routing - CloudFront handles custom error

Branko (@brankopetric00) 's Twitter Profile Photo

Our app handled 10k users fine. At 100k users, everything was still smooth. At 150k users, the database started timing out at 2am every night. The culprit? A nightly analytics query that worked fine at small scale. It was doing a full table scan that grew exponentially.

Branko (@brankopetric00) 's Twitter Profile Photo

Choosing between eventual consistency and strong consistency isn't a technical decision. It's a business decision disguised as one. When we finally started asking 'what happens if this data is wrong for 30 seconds?' instead of debating CAP theorem, architecture conversations

Branko (@brankopetric00) 's Twitter Profile Photo

Added Redis caching to fix our slow API responses. Performance improved 10x. Two months later: - Cache invalidation bugs everywhere - Data inconsistency issues - Debugging became a nightmare - Junior devs confused about source of truth We ripped it out and optimized the

Branko (@brankopetric00) 's Twitter Profile Photo

We had two services that were constantly in a race condition. Service A would call B, but B needed data from A that A hadn't written to the cache yet. We tried everything complex. Distributed locks. Queues. The final fix? A dev added a 500ms intentional delay to Service A API

Branko (@brankopetric00) 's Twitter Profile Photo

AWS locks S3 bucket names for couple of hours. Let's say you had an S3 bucket `frontend-app` created in us-east-1. You want to change the region to eu-central-1 You delete the S3 bucket in us-east-1 region AWS locks bucket name and throws below error `A conflicting

Branko (@brankopetric00) 's Twitter Profile Photo

Stripe should implement charge limits which could be configured by user during the checkout. E.g. you want to subscribe to X app, it's $20/month, you set max charge amount on your card to $30

Branko (@brankopetric00) 's Twitter Profile Photo

We used sticky sessions on our load balancer. It was an easy fix for a stateful web app we inherited. When we scaled from 4 to 40 nodes, 'stickiness' meant traffic was horribly imbalanced. Some new nodes got 0 traffic, while old nodes were 100% saturated with 'sticky' users.

Branko (@brankopetric00) 's Twitter Profile Photo

Black Friday 2024. Our checkout service went down at around 11 PM. What we discovered: - Payment gateway was fine - Database was fine - Load balancers were fine - But every request was timing out after exactly 30 seconds The culprit? A third-party fraud detection service we

Branko (@brankopetric00) 's Twitter Profile Photo

Our devs set most new Lambda functions to 2048MB of memory, 'just in case.' We ran the AWS Lambda Power Tuning tool. Most functions never used more than 300MB. - We right-sized 50+ functions in an afternoon. - The cost-per-invocation dropped significantly. - Our Lambda bill