Building a Self-Service AWS Cleanup Solution for Sandbox Accounts using AWS-Nuke

11 minute read

I designed and built this end-to-end solution to help platform engineering teams manage AWS sandbox sprawl across multiple development teams. Here’s the story of how I created a self-service model that significantly reduced costs while empowering developers.

The Problem: Sandbox Sprawl

If you manage AWS sandbox environments for multiple development teams, you’ve probably experienced this: developers spin up EC2 instances for quick tests, create RDS databases for prototyping, allocate Elastic IPs for demos—and then forget about them. Days turn into weeks, weeks into months, and suddenly your monthly AWS bill has grown by 40% with no clear understanding of which resources are actually being used.

We faced this exact challenge. With multiple teams using sandbox accounts for experimentation and development, resource sprawl became a significant problem:

Cost creep: Monthly bills growing without corresponding value
Resource zombies: Thousands of forgotten resources running indefinitely
Manual burden: Platform team spending hours investigating and cleaning up
Team friction: Developers frustrated when their resources were accidentally deleted

The traditional approaches weren’t working. Full centralized control meant our platform team became a bottleneck, spending time making cleanup decisions they weren’t equipped to make. Complete team autonomy meant chaos—no standards, no oversight, and continued cost growth.

This needed something different.

The Solution: Self-Service with Guardrails

I built a solution around a simple principle: “Make the right thing the easy thing.”

Our approach combines centralized infrastructure with distributed ownership. The platform team provides standardized tooling and safety mechanisms, while development teams control their own cleanup policies.

The Model

┌─────────────────────────────────────────────────────────────┐
│         CENTRALIZED                    DISTRIBUTED          │
│         CONTROL                        OWNERSHIP            │
│                                                              │
│  Platform Team Provides:      Teams Control:                │
│  • Infrastructure             • What to protect             │
│  • Workflow orchestration     • Cleanup schedules           │
│  • Security & compliance      • Tag-based filters           │
│  • Approval mechanisms        • Regional scope              │
│  • Monitoring & reporting     • Resource type selection     │
└─────────────────────────────────────────────────────────────┘

This hybrid model gives us the best of both worlds:

Teams get autonomy to manage their own resources
Platform team maintains standards and oversight
Finance gets predictable, controlled costs

Architecture: Three Key Components

The solution consists of three main components, built around aws-nuke, a powerful open-source tool for removing AWS resources.

1. Centralized Infrastructure (Platform Team Manages)

I built a reusable Terraform module that deploys the core infrastructure into each AWS account. This includes:

Workflow orchestration using AWS Step Functions to coordinate the cleanup process
Lambda functions for business logic, dry-run generation, and approval handling
CodeBuild projects that execute aws-nuke with team-specific configurations
DynamoDB tables for state tracking and approval management
S3 buckets for configuration storage and detailed reports
SNS topics for email notifications
EventBridge rules for scheduled execution

The key insight: deploy once per account, configure many times. Each team gets the same reliable infrastructure with zero maintenance burden.

2. Self-Service Configuration (Teams Manage)

Here’s where the magic happens. Teams manage their own cleanup policies through a Git repository with a simple folder structure:

configs/
├── team-alpha/
│   └── sandbox-account-123456789012/
│       ├── us-east-1-config.yaml
│       └── us-west-2-config.yaml
└── team-beta/
    └── dev-account-210987654321/
        └── us-east-1-config.yaml

The folder naming convention <account-alias>-<account-id> is intentional. GitLab CI validates that the account alias in the folder name matches the account ID in the configuration files—a sanity check that ensures configurations are only uploaded to their intended accounts.

Each team defines their cleanup rules using straightforward YAML configuration files. The critical insight: teams explicitly define what to DELETE using an include-only model, while common protection filters safeguard critical infrastructure.

Example configuration:

regions:
  - us-east-1
  - us-west-2

accounts:
  "123456789012":
    presets:
      # Use common protection filters that all teams copy from
      - "protect-critical-infrastructure"

    # Explicitly define what resource types to DELETE (include-only model)
    resource-types:
      includes:
        - EC2Instance
        - EC2Volume
        - EC2Address
        - RDSInstance
        - LambdaFunction

    filters:
      # Additional protections beyond the common preset
      EC2Instance:
        - property: tag:Environment
          value: "staging"
      EC2Address:
        - property: tag:DoNotDelete
          value: "true"

presets:
  protect-critical-infrastructure:
    filters:
      # Common protections that all teams inherit
      # Protects the aws-nuke infrastructure itself (also covered by IAM permission boundaries)
      IAMRole:
        - "aws-nuke-*"
        - "OrganizationAccountAccessRole"
      VPC:
        - type: "contains"
          value: "default"
      S3Bucket:
        - property: Name
          value: "aws-nuke-*"
      DynamoDBTable:
        - "aws-nuke-*"
      CloudWatchLogsLogGroup:
        - "/aws/codebuild/aws-nuke-*"
        - "/aws/lambda/aws-nuke-*"

This configuration approach provides:

Include-only deletion: Teams explicitly list which resource types to clean up (EC2, RDS, Lambda, etc.)
Common protection baseline: All configs inherit standard protections via presets
Team-specific filters: Additional safeguards for resources with specific tags or names
Safe by default: If a resource type isn’t in the includes list, it won’t be touched

This configuration format follows the aws-nuke specification. aws-nuke is a powerful, battle-tested open-source tool that can identify and remove hundreds of AWS resource types across all regions. By building my orchestration layer around aws-nuke, I got:

Comprehensive resource coverage (300+ AWS resource types)
Active maintenance and community support
Proven reliability across thousands of AWS accounts
Regular updates for new AWS services

When teams commit changes to their configuration files, GitLab CI automatically validates the syntax and uploads the config to a staging prefix in S3. This triggers a dry-run execution, and an approval email is sent to the team with a link to review the dry-run logs. Teams can see exactly what resources will be affected before the configuration goes live. Once approved, the configuration moves from staging to production and becomes active for scheduled executions.

Pull Request Reviews: Currently, configuration changes go through a lightweight PR approval process where SRE provides a quick sanity check. This is especially valuable during initial onboarding when teams are learning the patterns. As teams mature and become more familiar with the system, the plan is to remove this dependency and move to fully autonomous configuration management.

3. Approval Workflow (Safety Net)

Automation is powerful, but I built in multiple safety layers:

Layer 1: IAM Permission Boundaries - Each account has its own custom IAM role with explicit permission boundaries for aws-nuke execution. The role is scoped to that account only and cannot affect resources in other accounts. Even if a team makes a configuration error, the IAM policy prevents deletion of critical infrastructure like VPCs, security groups, network resources, and organizational resources. This is defense-in-depth at the AWS IAM level—the last line of defense against accidental destruction.

Layer 2: GitLab CI Validation - The CI pipeline validates that the account alias in the folder name matches the account ID in the configuration, ensuring configs are only uploaded to their intended accounts

Layer 3: Pull Request Review - Configuration changes go through PR review where SRE provides guidance during initial onboarding. This is the only point where SRE gets involved, acting as coach rather than gatekeeper. As teams mature, this checkpoint will be removed.

Layer 4: Configuration Filters - Teams explicitly define what to protect using YAML configuration

Layer 5: Dry-Run Previews - Generate detailed reports before any deletion

Layer 6: Email Approvals - Human checkpoint before execution

Layer 7: Self-Protection - The system never deletes its own infrastructure

Here’s how the workflow operates:

Initial Configuration or Changes (Approval Required)

When a team creates a new configuration or modifies an existing one:

Day 1, Config Commit: Team commits configuration changes to Git

GitLab CI validates the YAML syntax
Uploads config to staging prefix in S3 (not live yet)
Triggers dry-run execution with staged config
Lambda generates preview report from dry-run
Email sent: “Review 300 resources marked for cleanup - Approve to activate”
Email includes link to dry-run logs in S3
Report includes: resource types, IDs, ages, tags, estimated costs

Day 2, Review & Approval: Team reviews dry-run logs and approves

Opens dry-run logs from S3 via email link
Checks for any resources that should be protected
If changes needed: Update configuration and restart process
If satisfied: Approves via email link
Configuration moves from staging to production S3 prefix
First cleanup executes on-schedule (EventBridge) after approval

Scheduled Automatic Execution (No Approval Needed)

Once a configuration is approved, EventBridge runs it automatically on schedule:

Every Monday, 2:00 AM: EventBridge triggers the workflow

Step Function orchestrates the process
CodeBuild runs aws-nuke with approved configuration
Resources cleaned up automatically
Detailed report uploaded to S3
Team notified of completion

This design reduces operational overhead—teams only need to review and approve when they make configuration changes, not for every scheduled execution.

Why This Works for Sandbox Accounts

Sandbox environments have unique characteristics that make them perfect for this approach:

High churn rate - Resources are created and discarded frequently
Experimentation focus - Teams try things and move on
Cost sensitivity - Budget constraints demand efficiency
Lower risk - Non-production data means more aggressive cleanup is acceptable
Multiple teams - Shared responsibility requires clear ownership

The self-service model aligns perfectly with these characteristics. Teams understand their own resources better than anyone else, and they’re empowered to make the right decisions.

Important note: This approach is specifically designed for sandbox/development accounts. We do not recommend it for production environments, shared services, or compliance-heavy workloads where strict retention requirements exist.

Benefits Across the Organization

For Development Teams

Autonomy: Control their own cleanup rules without platform team tickets
Safety: Protect critical resources with simple tags or filters
Transparency: See exactly what will be deleted before it happens
Flexibility: Different rules for different regions or accounts
Speed: Quick onboarding, no waiting

For Platform Team

Standardization: One solution deployed across all teams
Visibility: Central reporting and monitoring
Compliance: Built-in approval workflows and audit trails
Scalability: New teams self-onboard without platform intervention
Maintenance: One Terraform module, many deployments

For Finance/Leadership

Cost Savings: Significant reduction in sandbox spend
Governance: Centralized oversight with distributed execution
Metrics: Track usage patterns and identify waste
Predictability: Scheduled, consistent cleanup cycles

Implementation Lessons Learned

1. Start with Strong Guardrails

I deployed the infrastructure per account using account-level Terraform variables, which allow for custom IAM roles and permission boundaries when necessary. Each account gets its own IAM role for aws-nuke execution, scoped to operate only within that specific account—it cannot affect resources in other accounts remotely.

The custom IAM role with permission boundaries explicitly denies deletion of critical infrastructure components:

VPCs, subnets, and core networking
Security groups and NACLs
AWS Organizations resources
Critical IAM roles and policies
Logging and audit infrastructure
The cleanup system’s own resources (the system never deletes itself)

This IAM-level protection means that even if a team completely misconfigures their YAML file (like using empty filters), AWS itself will prevent catastrophic deletions. It’s the insurance policy against human error.

Additionally, GitLab CI validates that the account alias in the folder structure (<account-alias>-<account-id>) matches the account ID in the configuration files before uploading—ensuring configs can only be deployed to their intended accounts.

2. Tag Strategy is Critical

I educated teams on tagging standards and made it part of the onboarding:

DoNotDelete: Universal protection tag
Environment: Distinguish prod-like resources from experiments
Owner: Track resource ownership
ExpiresOn: Optional expiration dates for temporary resources

3. Configuration Validation Matters

The GitLab CI pipeline validates every configuration change:

YAML syntax checking
Account ID verification
Region validation
Filter safety checks (prevent empty filters that would delete everything)

4. Observability is Essential

I built comprehensive observability from day one:

All execution logs in CloudWatch
All reports archived in S3
All state tracked in DynamoDB
Easy to answer: “Who deleted what, when, and why?”

5. Progressive Autonomy

Initially, I implemented a lightweight PR approval process for configuration changes. This gave me a chance to:

Coach teams during their first configurations
Catch common mistakes early
Build trust through collaboration

This is the only touchpoint where SRE gets involved—no tickets, no lengthy reviews, just a quick sanity check. As teams gain confidence and understanding, the plan is to remove this requirement entirely and achieve fully autonomous configuration management. The goal is to work ourselves out of the approval loop.

Results

After rolling this out across multiple teams:

Significant reduction in sandbox AWS spend
Fast onboarding - teams self-service without platform team involvement
No unintended deletions of protected resources - IAM permission boundaries and validation working as designed

The ROI was clear within the first month. More importantly, the conversation shifted from “cost police” to “enablement partner.”

Key Takeaways

If you’re considering a similar approach, here are our recommendations:

1. Empowerment vs Control

Don’t choose between control and autonomy—choose both. The right platform provides guardrails while enabling speed.

2. Configuration as Code

If infrastructure is code, cleanup policies should be too. Git gives you version control, peer review, and rollback capabilities.

3. Include-Only Deletion Model

Rather than “delete everything except…”, use “only delete what you explicitly include”. Teams list the specific resource types they want cleaned (EC2, RDS, Lambda) and inherit common protection filters. This safe-by-default approach prevents accidental cleanup of unexpected resource types.

4. Safe Automation

Automate aggressively, but fail safely. Multiple layers of protection mean teams can move fast without breaking things.

5. Self-Service Scales

Build once, use everywhere. When teams can self-onboard, your solution scales without your team growing.

Getting Started

If you want to build something similar:

Identify your problem: Measure your sandbox sprawl and cost impact
Adopt aws-nuke: Start with ekristen/aws-nuke as your foundation
Add orchestration: Step Functions, Airflow, or your workflow engine to wrap aws-nuke
Build self-service layer: Configuration repository with CI/CD
Implement approvals: Email, Slack, or your notification system
Pilot with one team: Prove the model before scaling
Iterate and improve: Listen to feedback, refine the experience
Scale organization-wide: Let success drive adoption

Conclusion

Platform engineering isn’t about gatekeeping—it’s about enablement. This self-service AWS cleanup solution embodies this philosophy. I built centralized infrastructure that’s reliable and secure, then gave teams the autonomy to configure it for their needs.

The result? Happier developers, lower costs, and a platform engineer who’s seen as an enabler rather than a bottleneck.

Sandbox sprawl is a common problem, but the solution doesn’t have to be complex. With the right balance of control and autonomy, you can build systems that teams actually want to use.

Twitter Facebook LinkedIn

Manu Balasree