Imran M's Blog

🚀 Acing the MLOps Interview: Building a Production-Grade Real-Time Fraud Detection Pipeline on AWS

Imran M — Wed, 28 Jan 2026 14:06:04 GMT

Author's Note: A co-learner's guide to designing Highly Reliable, Scalable, and Secure MLOps Architecture based on a real Domain Expert Interview experience, dissecting an LLM-generated response to a complex MLOps design challenge. I tried to share what was done well, what was missing, and how you can truly ace such interviews.

Introduction

As a seasoned MLOps and Cloud Engineer with more than a decade of experience in Tech designing and automating solutions for ML models, I recently attended a Domain Expert Interview focused on MLOps + Cloud + Design + Data AI @Mercor AI. The interview presented a fascinating challenge: critique an LLM-generated response for designing a real-time fraud detection MLOps pipeline on AWS.

This article serves as a comprehensive guide on how to ace such interviews by identifying gaps, proposing improvements, and demonstrating the depth of knowledge expected from senior engineers.

The Interview Context

🎯 Interview Format: LLM Response Judging

The interview was structured as follows:

Whiteboard (for candidate):

You're viewing an LLM-generated prompt and its response. Please critique out loud how well the response addresses each requirement and what improvements are needed. Tie your points directly to the text of the prompt and response.

Take up to 4 minutes to read silently. Then we'll discuss for about 7 minutes. You may write notes on paper; do not use any external resources. When you've finished reading, say "I'm ready to discuss."

📝 The Original Prompt (to the LLM):

"Design a real-time fraud detection MLOps pipeline on AWS. Your design must include at least the following:

A data ingestion and feature engineering architecture that processes streaming transaction events with low latency.

A secure, versioned model training and deployment workflow using AWS services.

CI/CD processes for both data pipelines and model code, including automated rollback on failures.

Comprehensive monitoring, alerting, and drift detection for both data inputs and model predictions.

Cost-optimization strategies that maintain required SLAs without overspending.

Provide a diagram-level description plus a detailed step-by-step explanation."

💬 Follow-up Question from the AI Interviewer:

"Let us continue with one of the follow-up questions from the whiteboard. How would you evaluate if the monitoring and drift detection in the proposed design are sufficient for a high-volume fraud detection scenario on AWS?"

The Original LLM Response Under Review

Here's the response that was presented for critique:

#	Component	LLM's Proposed Solution
1	Data Ingestion & Feature Engineering	Ingest data via Amazon Kinesis Data Streams; transform raw events with AWS Lambda; write features to Amazon S3.
2	Model Training & Deployment	Trigger AWS SageMaker training jobs using Amazon EventBridge; register models in SageMaker Model Registry; use IAM roles for security.
3	CI/CD Processes	Build Docker images for feature code and model code; push to ECR; configure AWS CodePipeline with CodeCommit, CodeBuild, CodeDeploy; include test stage that rolls back on failures.
4	Monitoring & Drift Detection	Set up CloudWatch alarms and dashboards for endpoint latency and error rate; use SageMaker Model Monitor to detect data drift; send SNS notifications on threshold breaches.
5	Cost Optimization	Optimize costs by using SageMaker Spot Instances for training; implement auto-scaling policies on Kinesis shards based on traffic; lifecycle transition rules to move old data to Glacier.

Deep Dive: Expert Analysis & Improvements

🔍 The Core Interview Question:

"How would you ensure an MLOps pipeline is highly reliable, scalable, and secured for production use-case?"

Let me break this down across three critical dimensions:

Question 1: Data Ingestion Gaps & Optimizations

❌ What's Missing in the Original Response?

The LLM's response suggests: "Ingest data via Amazon Kinesis Data Streams; transform raw events with AWS Lambda; write features to Amazon S3."

Critical Gaps Identified:

┌─────────────────────────────────────────────────────────────────────┐
│                    MISSING ASPECTS IN DATA INGESTION                │
├─────────────────────────────────────────────────────────────────────┤
│ ❌ No Schema Validation/Evolution Strategy                          │
│ ❌ No Dead Letter Queue (DLQ) for Failed Events                     │
│ ❌ No Data Quality Checks at Ingestion                              │
│ ❌ Lambda Cold Start Issues for Real-Time Processing                │
│ ❌ No Feature Store Integration                                     │
│ ❌ No Exactly-Once Processing Guarantees                            │
│ ❌ No Encryption in Transit/At Rest Details                         │
│ ❌ No Backpressure Handling Mechanism                               │
│ ❌ No Data Lineage Tracking                                         │
│ ❌ S3 is NOT Suitable for Real-Time Feature Serving                 │
└─────────────────────────────────────────────────────────────────────┘

✅ Production-Grade Data Ingestion Architecture

                                    ┌──────────────────┐
                                    │  Schema Registry │
                                    │  (AWS Glue)      │
                                    └────────┬─────────┘
                                             │ Validate
                                             ▼
┌─────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Transaction │───▶│ Amazon MSK      │───▶│ Apache Flink    │
│ Sources     │    │ (Kafka)         │    │ (Kinesis Data   │
│             │    │                 │    │  Analytics)     │
└─────────────┘    └────────┬────────┘    └────────┬────────┘
                            │                      │
                            ▼                      ▼
                   ┌─────────────────┐    ┌─────────────────┐
                   │ Dead Letter     │    │ Amazon SageMaker│
                   │ Queue (SQS)     │    │ Feature Store   │
                   └─────────────────┘    └────────┬────────┘
                                                   │
                            ┌──────────────────────┼──────────────────────┐
                            ▼                      ▼                      ▼
                   ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
                   │ Online Store    │    │ Offline Store   │    │ ElastiCache     │
                   │ (DynamoDB)      │    │ (S3 + Athena)   │    │ (Redis)         │
                   └─────────────────┘    └─────────────────┘    └─────────────────┘

🛠️ Recommended Tooling Changes

Original	Improved	Rationale
Kinesis Data Streams	Amazon MSK (Managed Kafka)	Better exactly-once semantics, higher throughput, easier replay capabilities, broader ecosystem compatibility
AWS Lambda	Amazon Kinesis Data Analytics (Apache Flink)	Eliminates cold start issues, provides stateful processing, better windowing functions, handles backpressure natively
S3 for Features	SageMaker Feature Store	Purpose-built for ML, provides both online (low-latency) and offline (batch) stores, automatic feature versioning
(Missing)	AWS Glue Schema Registry	Schema validation, evolution, and compatibility checking
(Missing)	ElastiCache (Redis)	Sub-millisecond feature retrieval for real-time inference

🔐 Security Enhancements for Data Ingestion

# Example: Secure Kinesis/MSK Configuration with Encryption
{
    "encryption_config": {
        "in_transit": {
            "tls_enabled": True,
            "tls_version": "TLS_1_2"
        },
        "at_rest": {
            "kms_key_id": "arn:aws:kms:us-east-1:123456789:key/mrk-xxx",
            "encryption_type": "KMS"
        }
    },
    "authentication": {
        "sasl_scram": True,
        "iam_authentication": True
    },
    "network_security": {
        "vpc_config": {
            "private_subnets_only": True,
            "security_groups": ["sg-fraud-detection-ingestion"]
        },
        "private_link_enabled": True
    }
}

📊 Data Quality Gates (Must-Have)

# Great Expectations Integration for Data Quality
from great_expectations.core import ExpectationSuite

fraud_detection_suite = ExpectationSuite(
    expectation_suite_name="fraud_transaction_validation"
)

# Critical expectations for fraud detection
expectations = [
    # Schema validation
    {"expectation_type": "expect_column_to_exist", 
     "kwargs": {"column": "transaction_id"}},

    # Data freshness
    {"expectation_type": "expect_column_values_to_be_between",
     "kwargs": {"column": "event_timestamp", 
                "min_value": "now() - interval 5 minutes"}},

    # Completeness checks
    {"expectation_type": "expect_column_values_to_not_be_null",
     "kwargs": {"column": "amount"}},

    # Anomaly detection at ingestion
    {"expectation_type": "expect_column_values_to_be_between",
     "kwargs": {"column": "amount", "min_value": 0, "max_value": 1000000}},

    # Cardinality checks
    {"expectation_type": "expect_column_unique_value_count_to_be_between",
     "kwargs": {"column": "merchant_category", "min_value": 1, "max_value": 500}}
]

Question 2: CloudWatch Observability - Is It Enough?

❌ Short Answer: NO, CloudWatch Alone is NOT Sufficient

The original response states: "Set up CloudWatch alarms and dashboards for endpoint latency and error rate; use SageMaker Model Monitor to detect data drift; send SNS notifications on threshold breaches."

🔍 Critical Analysis

┌─────────────────────────────────────────────────────────────────────────────┐
│              CloudWatch Limitations for ML Observability                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ✅ Good For:                    │  ❌ NOT Sufficient For:                  │
│  ─────────────────────────────── │  ──────────────────────────────────────  │
│  • Infrastructure metrics        │  • Feature drift detection               │
│  • Basic latency/error tracking  │  • Prediction drift (concept drift)     │
│  • Log aggregation               │  • Model explainability tracking         │
│  • Simple threshold alerts       │  • A/B test analysis                     │
│                                  │  • Ground truth comparison               │
│                                  │  • Feature importance drift              │
│                                  │  • Cohort-level performance analysis     │
│                                  │  • Real-time model debugging             │
│                                  │  • ML-specific anomaly detection         │
│                                  │  • Business KPI correlation              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

✅ Production-Grade ML Observability Stack

┌─────────────────────────────────────────────────────────────────────────────┐
│                    COMPREHENSIVE ML OBSERVABILITY ARCHITECTURE               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐           │
│  │   CloudWatch    │   │   Prometheus    │   │    Grafana      │           │
│  │   (Infra)       │   │   (Metrics)     │   │   (Dashboards)  │           │
│  └────────┬────────┘   └────────┬────────┘   └────────┬────────┘           │
│           │                     │                     │                     │
│           └─────────────────────┼─────────────────────┘                     │
│                                 │                                           │
│                                 ▼                                           │
│  ┌─────────────────────────────────────────────────────────────┐           │
│  │                    ML OBSERVABILITY LAYER                    │           │
│  ├─────────────────────────────────────────────────────────────┤           │
│  │                                                             │           │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │           │
│  │  │   Evidently  │  │   Arize AI   │  │   Fiddler    │      │           │
│  │  │   (OSS)      │  │   or         │  │   or         │      │           │
│  │  │              │  │   WhyLabs    │  │   Arthur AI  │      │           │
│  │  └──────────────┘  └──────────────┘  └──────────────┘      │           │
│  │                                                             │           │
│  │  Capabilities:                                              │           │
│  │  • Data Drift Detection (PSI, KL Divergence, JS Distance)  │           │
│  │  • Concept Drift Detection                                  │           │
│  │  • Prediction Drift Monitoring                              │           │
│  │  • Feature Importance Tracking                              │           │
│  │  • Model Explainability (SHAP, LIME)                       │           │
│  │  • Cohort Analysis                                          │           │
│  │  • Ground Truth Integration                                 │           │
│  │                                                             │           │
│  └─────────────────────────────────────────────────────────────┘           │
│                                 │                                           │
│                                 ▼                                           │
│  ┌─────────────────────────────────────────────────────────────┐           │
│  │                   SAGEMAKER MODEL MONITOR                    │           │
│  │  (Enhanced Configuration - Not Basic Setup)                  │           │
│  └─────────────────────────────────────────────────────────────┘           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

🎯 Advanced SageMaker Model Monitor Configuration

from sagemaker.model_monitor import (
    DataCaptureConfig,
    ModelQualityMonitor,
    ModelBiasMonitor,
    ModelExplainabilityMonitor,
    DefaultModelMonitor
)

# 1. Enhanced Data Capture
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,  # For fraud, capture everything
    capture_options=["REQUEST", "RESPONSE"],
    destination_s3_uri=f"s3://{bucket}/data-capture",
    kms_key_id="arn:aws:kms:...",  # Encrypt captured data
    csv_content_types=["text/csv"],
    json_content_types=["application/json"]
)

# 2. Model Quality Monitor (Performance Degradation)
model_quality_monitor = ModelQualityMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=50,
    max_runtime_in_seconds=3600,
    base_job_name='fraud-model-quality',
    network_config=NetworkConfig(
        enable_network_isolation=False,
        security_group_ids=['sg-xxx'],
        subnets=['subnet-xxx']
    )
)

# 3. Bias Monitor (Critical for Fraud - Avoid False Positives on Demographics)
model_bias_monitor = ModelBiasMonitor(
    role=role,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=50,
    max_runtime_in_seconds=3600
)

# Configure bias constraints
bias_config = BiasConfig(
    label_values_or_threshold=[1],  # Fraud label
    facet_name="customer_segment",  # Monitor bias across segments
    facet_values_or_threshold=[0]
)

# 4. Explainability Monitor (Why did the model flag this transaction?)
explainability_monitor = ModelExplainabilityMonitor(
    role=role,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=50,
    max_runtime_in_seconds=3600
)

# SHAP configuration for feature importance tracking
shap_config = SHAPConfig(
    baseline=[baseline_dataset],  # Representative sample
    num_samples=500,
    agg_method="mean_abs",
    save_local_shap_values=True
)

📈 Multi-Dimensional Drift Detection Strategy

# Production-grade drift detection configuration
drift_detection_config = {
    "data_drift": {
        "methods": ["PSI", "KL_Divergence", "JS_Distance", "Wasserstein"],
        "threshold_warning": 0.1,
        "threshold_critical": 0.2,
        "features_to_monitor": "all",
        "baseline_window": "30_days",
        "comparison_window": "1_hour",
        "statistical_tests": ["KS_Test", "Chi_Square", "Anderson_Darling"]
    },
    "prediction_drift": {
        "metrics": ["prediction_distribution", "confidence_distribution"],
        "threshold_warning": 0.05,
        "threshold_critical": 0.1
    },
    "concept_drift": {
        "method": "ADWIN",  # Adaptive Windowing
        "ground_truth_delay": "48_hours",
        "performance_metrics": ["precision", "recall", "f1", "auc_roc"],
        "minimum_samples": 1000
    },
    "feature_importance_drift": {
        "method": "SHAP_comparison",
        "threshold": 0.15,
        "top_k_features": 20
    }
}

🚨 Alerting Strategy Beyond SNS

# Multi-channel alerting configuration
alerting:
  channels:
    - type: pagerduty
      severity_mapping:
        critical: P1
        warning: P2
        info: P3
      integration_key: ${PAGERDUTY_KEY}

    - type: slack
      webhook_url: ${SLACK_WEBHOOK}
      channels:
        critical: "#fraud-detection-critical"
        warning: "#fraud-detection-alerts"
        info: "#fraud-detection-monitoring"

    - type: email
      recipients:
        - ml-platform-team@company.com
        - fraud-ops@company.com

    - type: sns
      topic_arn: arn:aws:sns:us-east-1:123456789:fraud-alerts

  escalation_policy:
    - condition: "no_ack_within_5_minutes"
      action: escalate_to_senior_oncall

    - condition: "drift_score > 0.3"
      action: auto_rollback_model

    - condition: "latency_p99 > 200ms"
      action: scale_out_endpoint

Question 3: Cost Optimization - What's Missing?

❌ Original Response Gaps

The LLM suggested: "Optimize costs by using SageMaker Spot Instances for training; implement auto-scaling policies on Kinesis shards based on traffic; lifecycle transition rules to move old data to Glacier."

🔍 What's Missing?

┌──────────────────────────────────────────────────────────────────────────────┐
│                    COST OPTIMIZATION GAPS ANALYSIS                           │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ❌ Missing Inference Cost Optimization (Largest Cost Driver!)               │
│  ❌ No Multi-Model Endpoints Strategy                                        │
│  ❌ No Serverless Inference Options                                          │
│  ❌ No Model Compilation/Optimization (Neo, TensorRT)                        │
│  ❌ No Inference Caching Strategy                                            │
│  ❌ No Reserved Capacity Planning                                            │
│  ❌ No Right-Sizing Analysis                                                 │
│  ❌ No Cost Allocation Tagging Strategy                                      │
│  ❌ No FinOps Feedback Loops                                                 │
│  ❌ No Graviton Instance Usage                                               │
│  ❌ No Data Transfer Cost Optimization                                       │
│  ❌ No Feature Store Cost Optimization                                       │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

✅ Comprehensive Cost Optimization Strategy

1️⃣ Inference Cost Optimization (The Big One!)

# Multi-Model Endpoint - Run 100s of models on single endpoint
from sagemaker.multidatamodel import MultiDataModel

mme = MultiDataModel(
    name="fraud-detection-mme",
    model_data_prefix=f"s3://{bucket}/fraud-models/",
    model=model,
    sagemaker_session=sagemaker_session
)

# Deploy with auto-scaling
predictor = mme.deploy(
    initial_instance_count=2,
    instance_type="ml.c6i.xlarge",  # Cost-effective for inference
    endpoint_name="fraud-detection-endpoint"
)

# Configure aggressive auto-scaling
scaling_client = boto3.client('application-autoscaling')

scaling_client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=f'endpoint/{endpoint_name}/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=20
)

# Scale based on invocations per instance
scaling_client.put_scaling_policy(
    PolicyName='fraud-invocation-scaling',
    ServiceNamespace='sagemaker',
    ResourceId=f'endpoint/{endpoint_name}/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 1000,  # Invocations per instance
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
        },
        'ScaleInCooldown': 300,
        'ScaleOutCooldown': 60  # Fast scale-out for fraud detection
    }
)

2️⃣ Serverless Inference for Variable Workloads

from sagemaker.serverless import ServerlessInferenceConfig

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=4096,
    max_concurrency=50,
    provisioned_concurrency=5  # Warm instances for latency-sensitive requests
)

# Ideal for: batch fraud checks, non-real-time scoring
model.deploy(
    serverless_inference_config=serverless_config,
    endpoint_name="fraud-detection-serverless"
)

3️⃣ Model Compilation with SageMaker Neo

# Compile model for 25-50% performance improvement
from sagemaker.pytorch import PyTorchModel

compiled_model = model.compile(
    target_instance_family='ml_c6i',
    input_shape={'input': [1, 256]},  # Batch size 1, 256 features
    output_path=f's3://{bucket}/compiled-models/',
    framework='pytorch',
    framework_version='2.0',
    compiler_options={
        'dtype': 'float16',  # Mixed precision
        'optimization_level': 3
    }
)

# Results: 30-40% cost reduction, 50% latency improvement

4️⃣ Intelligent Tiered Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                      COST-OPTIMIZED INFERENCE TIERS                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ TIER 1: Hot Path (Real-Time, <50ms)                                 │   │
│  │ ─────────────────────────────────────────────────────────────────── │   │
│  │ • SageMaker Real-Time Endpoints (Provisioned)                       │   │
│  │ • Graviton3 instances (ml.c7g.xlarge) - 40% cheaper than x86        │   │
│  │ • ElastiCache for feature caching                                   │   │
│  │ • Use for: High-value transactions, known fraud patterns            │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                        │
│                                    ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ TIER 2: Warm Path (Near Real-Time, <500ms)                          │   │
│  │ ─────────────────────────────────────────────────────────────────── │   │
│  │ • SageMaker Serverless with Provisioned Concurrency                 │   │
│  │ • Scale to zero during low traffic                                  │   │
│  │ • Use for: Medium-risk transactions, secondary validation          │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                        │
│                                    ▼                                        │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ TIER 3: Cold Path (Batch, Minutes to Hours)                         │   │
│  │ ─────────────────────────────────────────────────────────────────── │   │
│  │ • SageMaker Batch Transform with Spot Instances                     │   │
│  │ • 70-90% cost savings vs real-time                                  │   │
│  │ • Use for: Historical analysis, model retraining, bulk scoring     │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

5️⃣ Complete Cost Optimization Matrix

Category	Original Suggestion	Enhanced Strategy	Savings Potential
Training	Spot Instances	Spot + Managed Warm Pools + Checkpointing	60-90%
Inference	(Not mentioned)	Multi-Model Endpoints + Graviton + Neo Compilation	40-60%
Streaming	Kinesis Auto-scaling	MSK Serverless + Right-sized consumers	30-50%
Storage	S3 Glacier	Intelligent Tiering + Feature Store optimization	20-40%
Data Transfer	(Not mentioned)	VPC Endpoints + Regional affinity	50-80%
Feature Store	(Not mentioned)	Online store TTL + Batch reads	30-50%

6️⃣ FinOps Implementation

# Mandatory cost allocation tags
cost_tags = {
    "Project": "fraud-detection",
    "Environment": "production",
    "Team": "ml-platform",
    "CostCenter": "CC-4521",
    "ModelVersion": "v2.3.1",
    "Pipeline": "real-time-inference"
}

# AWS Cost Anomaly Detection
anomaly_monitor = {
    "MonitorName": "fraud-detection-costs",
    "MonitorType": "DIMENSIONAL",
    "MonitorSpecification": {
        "DimensionType": "SERVICE",
        "Dimensions": [
            {"Key": "Service", "Values": ["SageMaker", "Kinesis", "S3"]}
        ],
        "Tags": cost_tags
    }
}

# Budget alerts
budget_config = {
    "BudgetName": "fraud-ml-monthly",
    "BudgetLimit": {"Amount": "50000", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST",
    "NotificationsWithSubscribers": [
        {
            "Notification": {
                "NotificationType": "ACTUAL",
                "ComparisonOperator": "GREATER_THAN",
                "Threshold": 80,
                "ThresholdType": "PERCENTAGE"
            },
            "Subscribers": [
                {"SubscriptionType": "EMAIL", "Address": "finops@company.com"}
            ]
        }
    ]
}

The Production-Grade Architecture Blueprint

┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                    PRODUCTION-GRADE FRAUD DETECTION MLOps ARCHITECTURE                  │
├─────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                         │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐           │
│  │ Transaction │────▶│   Amazon    │────▶│   Apache    │────▶│  SageMaker  │           │
│  │   Sources   │     │    MSK      │     │   Flink     │     │Feature Store│           │
│  └─────────────┘     └──────┬──────┘     └──────┬──────┘     └──────┬──────┘           │
│                             │                   │                   │                   │
│                             ▼                   ▼                   ▼                   │
│                      ┌─────────────┐     ┌─────────────┐     ┌─────────────┐           │
│                      │   Schema    │     │    Data     │     │ ElastiCache │           │
│                      │  Registry   │     │  Quality    │     │  (Redis)    │           │
│                      │  (Glue)     │     │  (GE/Soda)  │     │             │           │
│                      └─────────────┘     └─────────────┘     └──────┬──────┘           │
│                                                                     │                   │
│  ┌──────────────────────────────────────────────────────────────────┼──────────────┐   │
│  │                         ML PLATFORM LAYER                        │              │   │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐        │              │   │
│  │  │ SageMaker   │────▶│  Model      │────▶│ SageMaker   │◀───────┘              │   │
│  │  │ Pipelines   │     │  Registry   │     │ Endpoint    │                       │   │
│  │  │ (Training)  │     │ (Versioned) │     │ (Inference) │                       │   │
│  │  └──────┬──────┘     └─────────────┘     └──────┬──────┘                       │   │
│  │         │                                       │                              │   │
│  │         ▼                                       ▼                              │   │
│  │  ┌─────────────┐                         ┌─────────────┐                       │   │
│  │  │   MLflow    │                         │   Shadow    │                       │   │
│  │  │ Experiment  │                         │   Deploy    │                       │   │
│  │  │  Tracking   │                         │  (Canary)   │                       │   │
│  │  └─────────────┘                         └─────────────┘                       │   │
│  └────────────────────────────────────────────────────────────────────────────────┘   │
│                                                                                         │
│  ┌──────────────────────────────────────────────────────────────────────────────────┐  │
│  │                              CI/CD & GitOps                                       │  │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐    │  │
│  │  │   GitHub    │────▶│   GitHub    │────▶│    AWS      │────▶│   ArgoCD    │    │  │
│  │  │  (Source)   │     │  Actions    │     │    ECR      │     │ (GitOps)    │    │  │
│  │  └─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘    │  │
│  └──────────────────────────────────────────────────────────────────────────────────┘  │
│                                                                                         │
│  ┌──────────────────────────────────────────────────────────────────────────────────┐  │
│  │                           OBSERVABILITY STACK                                     │  │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐    │  │
│  │  │ CloudWatch  │     │ Prometheus  │     │  Evidently  │     │   Grafana   │    │  │
│  │  │  (Infra)    │     │ (Metrics)   │     │ (ML Drift)  │     │(Dashboards) │    │  │
│  │  └─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘    │  │
│  │                                                                                   │  │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐                        │  │
│  │  │  PagerDuty  │     │   Slack     │     │ SageMaker   │                        │  │
│  │  │ (Alerting)  │     │  (Notif)    │     │Model Monitor│                        │  │
│  │  └─────────────┘     └─────────────┘     └─────────────┘                        │  │
│  └──────────────────────────────────────────────────────────────────────────────────┘  │
│                                                                                         │
└─────────────────────────────────────────────────────────────────────────────────────────┘

Security Deep Dive (DevSecOps)

🔐 Security Layers Not Addressed in Original Response

┌──────────────────────────────────────────────────────────────────────────────┐
│                    COMPREHENSIVE SECURITY ARCHITECTURE                        │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  LAYER 1: NETWORK SECURITY                                                   │
│  ─────────────────────────────────────────────────────────────────────────   │
│  ✅ VPC with Private Subnets Only                                           │
│  ✅ VPC Endpoints for all AWS services (no internet egress)                 │
│  ✅ Network ACLs + Security Groups (defense in depth)                       │
│  ✅ AWS PrivateLink for cross-account access                                │
│  ✅ AWS Network Firewall for advanced threat detection                      │
│                                                                              │
│  LAYER 2: DATA SECURITY                                                      │
│  ─────────────────────────────────────────────────────────────────────────   │
│  ✅ KMS CMK encryption for all data at rest                                 │
│  ✅ TLS 1.3 for all data in transit                                         │
│  ✅ S3 bucket policies with explicit deny                                   │
│  ✅ Macie for PII/sensitive data detection                                  │
│  ✅ Data classification and tagging                                         │
│                                                                              │
│  LAYER 3: IDENTITY & ACCESS                                                  │
│  ─────────────────────────────────────────────────────────────────────────   │
│  ✅ IAM roles with least privilege                                          │
│  ✅ Service-linked roles for SageMaker                                      │
│  ✅ Cross-account roles with external ID                                    │
│  ✅ MFA enforcement                                                         │
│  ✅ AWS Organizations SCPs                                                  │
│                                                                              │
│  LAYER 4: APPLICATION SECURITY                                               │
│  ─────────────────────────────────────────────────────────────────────────   │
│  ✅ Container image scanning (ECR + Snyk/Trivy)                             │
│  ✅ Dependency vulnerability scanning                                       │
│  ✅ SAST/DAST in CI/CD pipeline                                             │
│  ✅ Secrets management (AWS Secrets Manager)                                │
│  ✅ Model artifact integrity verification                                   │
│                                                                              │
│  LAYER 5: COMPLIANCE & AUDIT                                                 │
│  ─────────────────────────────────────────────────────────────────────────   │
│  ✅ CloudTrail for API audit logging                                        │
│  ✅ AWS Config for compliance rules                                         │
│  ✅ GuardDuty for threat detection                                          │
│  ✅ Security Hub for unified view                                           │
│  ✅ Model lineage and provenance tracking                                   │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

🛡️ ML-Specific Security Considerations

# Secure SageMaker Configuration
secure_endpoint_config = {
    "EndpointConfigName": "fraud-detection-secure",
    "ProductionVariants": [{
        "VariantName": "primary",
        "ModelName": "fraud-model-v2",
        "InstanceType": "ml.c6i.xlarge",
        "InitialInstanceCount": 2,
        "ContainerStartupHealthCheckTimeoutInSeconds": 300
    }],
    "DataCaptureConfig": {
        "EnableCapture": True,
        "KmsKeyId": "arn:aws:kms:...",  # Encrypt captured data
        "CaptureOptions": [
            {"CaptureMode": "Input"},
            {"CaptureMode": "Output"}
        ]
    },
    "KmsKeyId": "arn:aws:kms:...",  # Encrypt model artifacts
    "AsyncInferenceConfig": {
        "OutputConfig": {
            "KmsKeyId": "arn:aws:kms:..."  # Encrypt async outputs
        }
    }
}

# VPC Configuration for SageMaker
vpc_config = {
    "SecurityGroupIds": ["sg-fraud-sagemaker"],
    "Subnets": ["subnet-private-1", "subnet-private-2"],
    "EnableNetworkIsolation": True,  # No internet access
    "EnableInterContainerTrafficEncryption": True  # Encrypt distributed training
}

# Model Artifact Signing
import boto3
from hashlib import sha256

def sign_model_artifact(model_path: str, signing_key_arn: str):
    """Sign model artifacts for integrity verification"""
    signer = boto3.client('signer')

    # Compute artifact hash
    with open(model_path, 'rb') as f:
        artifact_hash = sha256(f.read()).hexdigest()

    # Sign with AWS Signer
    signing_job = signer.start_signing_job(
        source={'s3': {'bucketName': bucket, 'key': model_path}},
        destination={'s3': {'bucketName': bucket, 'prefix': 'signed-models/'}},
        profileName='ml-model-signing-profile'
    )

    return signing_job['jobId'], artifact_hash

Key Takeaways

📝 Interview Success Factors

Dimension	Basic Answer	Expert Answer
Data Ingestion	Kinesis + Lambda + S3	MSK + Flink + Feature Store + Schema Registry + DLQ + Data Quality
Observability	CloudWatch + Model Monitor	Multi-layer stack + Custom drift detection + Explainability + FinOps integration
Cost Optimization	Spot instances + Glacier	Tiered inference + Multi-model endpoints + Graviton + Neo + Reserved capacity
Security	IAM roles	Defense in depth + Encryption everywhere + Network isolation + Audit logging
CI/CD	Basic pipeline	GitOps + Canary deployments + Automated rollback + Model validation gates

🎯 The 5 Pillars of Production MLOps

┌──────────────────────────────────────────────────────────────��──┐
│                                                                 │
│   1. RELIABILITY          2. SCALABILITY        3. SECURITY    │
│   ─────────────           ──────────────        ────────────   │
│   • Circuit breakers      • Auto-scaling        • Zero trust    │
│   • Graceful degradation  • Multi-region        • Encryption    │
│   • Chaos engineering     • Caching layers      • Audit trails  │
│   • Automated rollback    • Async processing    • Compliance    │
│                                                                 │
│   4. OBSERVABILITY                    5. COST EFFICIENCY        │
│   ────────────────                    ──────────────────        │
│   • Multi-dimensional metrics         • Right-sizing            │
│   • Distributed tracing               • Spot/Serverless         │
│   • ML-specific monitoring            • FinOps practices        │
│   • Alerting & on-call                • Continuous optimization │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Conclusion

The original LLM response provided a foundational starting point but lacked the depth required for a production-grade, high-volume fraud detection system. Key gaps included:

Data Ingestion: Missing schema validation, exactly-once processing, feature store integration, and real-time serving capabilities
Observability: CloudWatch alone is insufficient; requires dedicated ML observability tools for drift detection, explainability, and performance monitoring
Cost Optimization: Inference costs (the largest expense) were completely overlooked; missing modern cost-saving techniques like multi-model endpoints, serverless inference, and model compilation

As a senior MLOps engineer, your role is to bridge the gap between theoretical designs and production realities. This means:

Anticipating failure modes and building resilience
Designing for scale from day one
Implementing security as a first-class citizen, not an afterthought
Building comprehensive observability into every component
Continuously optimizing costs while maintaining SLAs

Remember: In a real interview, demonstrating this depth of knowledge—identifying gaps, proposing concrete solutions with specific tools, and understanding trade-offs—is what separates senior engineers from the rest.

📚 Further Reading

🚀 DevSecOps Mastery Guide 101

Imran M — Mon, 26 Jan 2026 18:30:00 GMT

"Emirates Group" screening questions - Every Senior DevOps / Cloud Engineer should know

Master these fundamental DevOps concepts that separate noob engineers from senior ones. From Terraform state locking to idempotency principles, we'll break down complex topics into digestible explanations with real-world analogies.

Question 1

You are managing Terraform state files in an Azure Storage Account blob container to enable collaboration. To prevent state corruption during simultaneous runs by different CI/CD pipelines, you need to ensure state locking is active. Which specific Azure Storage feature does the Terraform backend use to acquire a lock on the state file?

Azure Blob Soft Delete
Azure Blob Lease
Blob Versioning
Immutable Storage Policies

✓ Answer: Azure Blob Lease

💡 Simple Explanation

Think of it like checking out a library book. When someone takes the book (acquires a lease), nobody else can take it until they return it. Terraform "leases" the state file so only one pipeline can modify it at a time. Others have to wait their turn.

🔒 Library Book Analogy: Just like you can't check out a book someone else has, pipelines can't modify a state file that's already "checked out" by another process.

Question 2

An application on a production Linux server is crashing, and the logs indicate a "Too many open files" error. You have verified that the system-wide limits are sufficient. Which command allows you to list all open files, network sockets, and pipes specifically associated with the crashing application's Process ID (PID)?

netstat -plnt
ps -aux | grep
lsof -p
top -H -p

✓ Answer: lsof -p

💡 Simple Explanation

lsof = List of Open Files

It's like asking "Hey, what files is this program currently using?" The "Too many open files" error means the app opened too many files/connections and forgot to close them. This command shows you exactly what it's holding onto.

# List all open files for process 1234
lsof -p 1234

# Example output shows:
# - Open files
# - Network connections
# - Pipes and sockets

Question 3

You are designing an automated pipeline where an Azure Virtual Machine (VM) needs to retrieve database credentials from an Azure Key Vault. You want to avoid storing any secrets (like Client IDs or Client Secrets) within the VM's configuration or code. What is the most secure, cloud-native method to authenticate the VM to the Key Vault?

Create a Service Principal, generate a certificate, and install it on the VM.
Enable a System-assigned Managed Identity on the VM and grant it access policies in Key Vault.
Generate a Shared Access Signature (SAS) token for the Key Vault and bake it into the VM image.
Use the Azure CLI with a dedicated user account and run az login in the startup script.

✓ Answer: Enable System-assigned Managed Identity on the VM and grant it access policies in Key Vault

💡 Simple Explanation

Instead of giving your VM a username/password (which can be stolen), Azure says "I know this VM belongs to you, I'll vouch for it automatically."

🎫 Employee Badge Analogy: It's like being an employee with a badge. You don't need to prove who you are every time — the building recognizes your badge automatically.

🔐 Deep Dive: Managed Identity Security

⚠️ What if the badge is stolen?

The badge (Managed Identity) is tied to the VM itself — it cannot be copied or exported.

The credentials:

Live only inside that specific VM
Rotate automatically (Azure handles this)
Are accessible only from that VM's internal metadata endpoint (169.254.169.254)

If someone steals the badge (compromises the VM), yes they can use it. But that means your VM itself is compromised — you have bigger problems.

✅ Mitigations:

Principle of least privilege — give the identity only the permissions it absolutely needs
Network restrictions — Key Vault firewall rules, private endpoints
Monitoring — alert on suspicious access patterns
Short-lived secrets — even if accessed, credentials expire quickly

AWS Equivalent:

Azure	AWS
VM → Managed Identity → Key Vault	EC2 → IAM Role → Secrets Manager
System-assigned identity	Instance Profile with IAM Role
Metadata endpoint: 169.254.169.254	Metadata endpoint: 169.254.169.254

Same concept:

No hardcoded credentials
Credentials fetched from instance metadata
Automatically rotated by AWS
Role is bound to the instance, not portable

Question 4

You have written a complex Bash script for a deployment process. You want to ensure that if any command within the pipeline fails (returns a non-zero exit code), the entire script stops execution immediately to prevent cascading errors. Which command should you place at the top of your script?

set -e
set -x
exit 1
trap "echo error" ERR

✓ Answer: set -e

💡 Simple Explanation

Normally, Bash scripts are stubborn — if something fails, they keep going anyway.

set -e tells the script: "If anything goes wrong, STOP immediately."

👨‍🍳 Chef Analogy: Like telling a chef: "If you burn the first dish, don't keep cooking the rest."

#!/bin/bash
set -e  # Exit on any error

# If this fails, script stops here
curl https://api.example.com/data -o data.json

# This won't run if curl failed
jq '.users' data.json

Question 5

A Junior DevOps engineer has written an automation script that appends a configuration line to a file: echo "config=true" >> /etc/app/config. As a Senior Engineer, you flag this code during review because it violates the principle of "Idempotency." Why?

The script will fail if the file does not exist.
The script does not verify if the user has root privileges.
If the script is run multiple times, it will add the same line repeatedly, potentially breaking the configuration.
The script does not use a variable for the configuration string.

✓ Answer: If the script is run multiple times, it will add the same line repeatedly, potentially breaking the configuration

💡 Simple Explanation

Idempotent = Running something 10 times should give the same result as running it once.

The >> append adds a line every single time. Run it 5 times? You get 5 identical lines. That's bad.

📝 Signup Sheet Analogy: It's like writing your name on a signup sheet every time you walk past it, instead of checking if your name is already there.

🛠️ Deep Dive: Fixing the Idempotency Problem

Why is idempotency needed?

In production:

Scripts run multiple times (retries, CI/CD reruns, recovery)
Automation tools like Ansible/Terraform run repeatedly
Human error — someone accidentally clicks "deploy" twice

If your script isn't idempotent, repeated runs = corrupted system.

How to fix it:

# ❌ Bad (junior's code)
echo "config=true" >> /etc/app/config

# ✅ Good (idempotent)
grep -qxF "config=true" /etc/app/config || echo "config=true" >> /etc/app/config

What this does:

grep -qxF checks if the exact line already exists
|| means "if NOT found, then add it"
Run it 100 times — same result as running once

Even Better for Production:

CONFIG_FILE="/etc/app/config"
CONFIG_LINE="config=true"

# Ensure file exists
touch "$CONFIG_FILE"

# Idempotent append
if ! grep -qxF "$CONFIG_LINE" "$CONFIG_FILE"; then
    echo "$CONFIG_LINE" >> "$CONFIG_FILE"
    echo "Config added successfully"
else
    echo "Config already present, skipping"
fi

Or Use Proper Config Management:

Ansible — lineinfile module (built-in idempotency)
Puppet/Chef — declarative config management
Terraform — for infrastructure

These tools handle idempotency automatically so you don't reinvent the wheel.

📚 Understanding grep -qxF

The command grep -qxF is a powerful, "quiet" way to check for the existence of an exact line in a file.

Flag	Meaning	Purpose
`-q`	quiet/silent	Suppresses all normal output. Returns exit status (0 for match, 1 for no match)
`-x`	line-regexp	Matches the entire line. "apple" won't match "pineapple"
`-F`	fixed-strings	Treats pattern as literal string, not regex. Special chars like . * $ are normal text

💡 Pro Tip:

For case-insensitive matching, add the -i flag: grep -qixF

🎯 Key Takeaways

Always use proper locking mechanisms (Blob Lease) for shared state
Master debugging tools like lsof for production troubleshooting
Prefer Managed Identities over hardcoded credentials
Use set -e to fail fast and prevent cascading errors
Write idempotent scripts that can run multiple times safely

Happy DevSecOps Engineering! 🚀

Master these concepts in depth and level up your infrastructure game in a secured way.

Self-Hosting Runners for GitHub Actions: A Complete Tutorial

Imran M — Tue, 20 May 2025 16:46:48 GMT

🚀 Project Overview

This article provides a comprehensive guide to the GitHub Actions Self-Hosted ECR Image project, detailing each component, the challenges encountered, and the solutions implemented. The primary goal is to offer readers a thorough understanding of the project's structure and functionality.

This exercise was performed to automate the deployment of a simple Python application using GitHub Actions in conjunction with AWS Elastic Container Registry (ECR).
Leveraging self-hosted runners enhances the CI/CD processes, offering greater control and customization over the environment compared to GitHub’s hosted runners
This setup allows for the management of containerized workflows and facilitates seamless deployment to container registries like AWS ECR.
Throughout this guide, we will break down the setup process, simplify its components, and provide step-by-step instructions on how to replicate this setup for your own projects.
This includes integrating GitHub Actions with self-hosted runners on any cloud provider, focusing on efficient container management and deployment strategies.

📜 Problem Statement

There has always been a need for a streamlined CI/CD process for deploying applications, but several challenges emerge in managing dependencies and ensuring reproducibility without self-hosted runners:

Limited Customization: Default GitHub-hosted runners offer limited customization options, restricting the ability to tailor the CI/CD environment to specific project requirements, which can hinder the development process.
Resource Constraints: Default hosted runners may not provide the necessary resources for large or resource-intensive builds, leading to slower build times and potential bottlenecks in the CI/CD pipeline.
Dependency Management: Managing dependencies can be challenging due to the lack of control over the runner environment, which can lead to inconsistencies and difficulties in ensuring reproducibility across different builds.
Data Security Concerns: Using public runners may raise data security concerns, as sensitive data and proprietary code are processed on shared infrastructure, potentially exposing them to security vulnerabilities.
Network Access Limitations: Hosted runners may not have access to private networks or internal resources, which can be a significant limitation for projects that require integration with internal systems or databases.
Cost Implications: For organizations with high usage, relying solely on GitHub-hosted runners can lead to increased costs due to the consumption of paid runner minutes, especially for open-source projects or those with frequent builds.
Scalability Issues: Scaling the CI/CD process can be challenging with hosted runners, as organizations are limited by the availability and capacity of GitHub's infrastructure, potentially impacting the ability to handle increased workloads efficiently.

🔧 Solutions Implemented

Before diving in, let’s understand the motivation behind this setup:

Self-Hosted Runners: Unlike GitHub’s default hosted runners, self-hosted runners provide you with full control over the CI/CD environment. This allows running on our own infrastructure, such as AWS EC2, enabling customization of hardware and access to private resources like AWS ECR without relying on public runners.
GitHub Actions: Utilized GitHub Actions to automate the build and deployment process, streamlining the workflow and enhancing efficiency.
Cost Reduction and Control: By using self-hosted runners, the setup reduces costs associated with GitHub-hosted runner minutes and increases control over the CI/CD environment, allowing for tailored configurations and optimizations.

🌟 Features

Seamless Integration: Effortlessly connect GitHub Actions with self-hosted runners across any cloud environment, ensuring smooth and efficient CI/CD processes.
Container Management: Utilize Docker to build, test, and deploy containerized applications, streamlining the development and deployment lifecycle.
AWS ECR Deployment: Leverage AWS Elastic Container Registry as a secure and scalable container registry for storing Docker images. It integrates seamlessly with AWS services, ensuring our images remain private and accessible within our VPC, while automating deployments to AWS ECR.
Scalable and Cost-Effective: Implement cloud-based self-hosted runners to execute workflows efficiently, optimizing resource usage and reducing costs.
Customizable: Fully configure the setup to accommodate diverse CI/CD pipelines, allowing for tailored solutions that meet specific project requirements.

🚀 Getting Started

📋 Prerequisites

Create an IAM Role for EC2 Instance:
- Establish an IAM role with the necessary permissions for the EC2 instance to interact with AWS services like ECR.
- Attach this role to the EC2 instance to allow secure and managed access without embedding credentials.
Set Up the AWS ECR Repository: We need a place to store our Docker images. AWS ECR can be ideal for this.
- Navigate to the AWS Management Console.
- Go to ECR > Repositories > Create Repository.
- Name the repository (e.g.github-runneror your preferred name) and create it to store your Docker images.
Launch an EC2 instance with Docker installed and AWS CLI configured.
- Set up your AWS credentials to enable connections to AWS ECR.
- Spin up an EC2 instance (e.g., t2.medium with Ubuntu 24.04) suitable for the workload.
- Install Docker and AWS CLI on the instance.
- Ensure the instance has internet access and the security group allows necessary outbound connections.
Configure AWS Credentials:
- Preferred Method: Utilize the IAM role attached to the EC2 instance for seamless and secure access to AWS services.
- Alternative Methods:
  - Use an AWS credentials file located at ~/.aws/credentials.
  - Set environment variables:
```
  export AWS_ACCESS_KEY_ID=your-access-key
  export AWS_SECRET_ACCESS_KEY=your-secret-key
```
  - Note: Avoid hardcoding credentials; prefer IAM roles or AWS credentials files for enhanced security.
A GitHub Actions workflow to build, tag, and push images to ECR.
Set Up Self-Hosted Runners: Install and register a self-hosted runner for the GitHub repository. Learn more from the official documentation on how to set up the runner.

The GitHub self-hosted runner must be running and listening for jobs when the pipeline runs.
Once done, there will be a run.sh script available which will help start the runner using the command:

Ensure this command is kept running in the terminal or as a background process before triggering the workflow.
Add an inbound traffic rule in the EC2 instance security group to allow traffic from the port (in this case is 8080) where the application runs.
Run the application as a docker container: Firstly,
- Authenticate Docker to the ECR registry.
- Pull the image from ECR
- Run the image as a container
- Evaluate docker logs for the app events (for debugging)

Follow the steps specified in Pulling and Running the ECR Image on EC2 instance section of theREADME.md file (see Documentation below)

📂 Project Structure

├── .github/              # GitHub configuration and workflows
│   └── workflows/        # GitHub Actions workflow files
├── docker/               # Dockerfiles and related resources
├── test.py/              # Test case for ensuring workflow reliability
├── images/               # Image assets for documentation or usage
├── Dockerfile            # Main Docker setup
├── LICENSE               # License for the project
├── pyproject.toml        # Python project configuration
├── README.md             # Project overview and instructions
├── uv.lock               # Python dependency lock file
├── .dockerignore         # Docker ignore rules
├── .gitignore            # Git ignore rules
├── .python-version       # Python version specification

⚙️ GitHub Actions Workflow

Below is an example workflow file (.github/workflows/deploy.yml) for automating deployments to AWS ECR:

name: Build & Deploy Python App to AWS ECR repository

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]
  workflow_dispatch:

env:
  AWS_REGION: ${{ vars.AWS_REGION }}
  ECR_REPOSITORY: ${{ vars.ECR_REPOSITORY }}

jobs:
  install:
    name: Install uv & other dependencies
    runs-on: self-hosted

    steps:
    - uses: actions/checkout@v4

    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version-file: "pyproject.toml"

    - name: Install uv
      uses: astral-sh/setup-uv@v5

    - name: Enable caching
      uses: astral-sh/setup-uv@v5
      with:
        enable-cache: true

    - name: Install project dependencies
      run: uv sync --locked --all-extras --dev

    - name: Run tests with pytest
      run: uv run pytest -vs test_calculator.py

  build:
    name: Build & push docker image
    needs: install
    runs-on: self-hosted

    steps:
    - uses: actions/checkout@v4    

    - name: Configure AWS credentials
      if: success()
      uses: aws-actions/configure-aws-credentials@v4
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}

    - name: Login to Amazon ECR repository
      if: success()
      id: login-ecr
      uses: aws-actions/amazon-ecr-login@v2

    - name: Build, tag, and push image to ECR
      if: success()
      env:
        ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
        ECR_REPOSITORY: ${{ env.ECR_REPOSITORY }}
      run: |
        SHORT_SHA=$(echo $GITHUB_SHA | tail -c 6)
        TAG_DATE=$(date +"%d-%b-%y")
        BRANCH_NAME=$(echo ${GITHUB_REF_NAME} | sed 's/\//-/g')
        IMAGE_TAG="${BRANCH_NAME}-${SHORT_SHA}-${TAG_DATE}"
        docker build . --file Dockerfile --tag $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
        docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG

🧪 Testing

Unit tests are performed using the uv package manager and the pytest library using the following command:

uv run pytest -vs

Finally, the application is served using FastAPI, which serves as the web framework, and Uvicorn, which acts as the ASGI server.

When the Docker container is executed, it initiates the FastAPI application, making it accessible on port 8080

This setup allows the application to handle incoming HTTP requests efficiently, leveraging FastAPI's capabilities for building APIs and Uvicorn's performance as a lightweight and fast server.

📊 Visualizations

Figure: Self hosted runner in active state listening for connections

Figure: EC2 instance self hosted runner deployment successful.

Figure: Uvicorn powered FastAPI app running on EC2 instance self hosted runner

📖 Documentation

Detailed documentation is available in my github repository(linked under), including few links:

Setup Guide: Step-by-step instructions to configure the project are specified in the README.md file in the github project: gh-actions-self-hosted-ecr
Best Practices: Pro-Tips for optimizing workflows (refer Security Best Practices section of the same README.md file)

🔗 Useful Links

🤝 Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch (git checkout -b feature/your-feature).
Commit your changes (git commit -m "Add your feature").
Push the branch (git push origin feature/your-feature).
Open a Pull Request.

Please follow the Code of Conduct.

Conclusion

This project demonstrates the integration of GitHub Actions with AWS ECR & EC2 instances for deploying Python applications using uv as package manager.
Using self-hosted runners provides flexibility and cost savings.

❤️ Acknowledgments

Special thanks to the open-source community for providing tools and resources that made this exercise possible.