Introduction

Amazon Simple Storage Service (S3) is a cloud storage solution known for its scalability, reliability, and security. S3 storage can be used in many ways for many use cases. However, they all have one thing in common: cost management.

 

Managing costs on S3 may be demanding for companies that migrate their on-premises solutions to the AWS cloud or for those with non-standardized storage management policies. Improperly managing S3 costs can cause expenses to rapidly escalate and have a serious impact on budgets.

 

This article will give a concise overview of how to optimise costs on S3 by implementing appropriate standards. I will explore strategies such as selecting appropriate storage classes, leveraging lifecycle policies, and efficient data management to help you manage your S3 expenditures efficiently. We will focus on best practices for storing and processing data files. By understanding the nuances of data file management in S3, you will be better prepared to reduce costs.

Ways to optimise costs

Storage classes

 

Amazon Simple Storage Service (S3) offers multiple storage classes that are designed for different storage needs, optimizing costs and performance. Each storage class is intended for a different use case, providing flexibility in how data is stored and accessed. Understanding these storage classes is crucial for effective data management and cost optimization in the cloud.

 

  • S3 Standard
    Should be used for frequently accessed data. It offers high durability, availability, and low latency.
  • S3 Intelligent-Tiering
    Automatically moves data between two access tiers based on changing access patterns.
  • S3 Standard-IA (Infrequent Access)
    Should be used for infrequently accessed data. It offers lower storage costs with a retrieval fee.
  • S3 One Zone-IA
    Similar to Standard-IA but stored in a single availability zone. It offers lower costs for infrequently accessed data with slightly less durability.
  • S3 Glacier Instant Retrieval
    Designed for archival storage with immediate access to data. Data retrieval is instant.
  • S3 Glacier Flexible Retrieval
    Designed for long-term archival with retrieval times ranging from minutes to hours, offering very low storage costs for data that is rarely accessed.
  • S3 Glacier Deep Archive
    The lowest-cost storage option for data that is accessed very infrequently, with retrieval times ranging from 12 to 48 hours, is ideal for data that needs to be preserved for years or decades.

 

You can find detailed descriptions of Storage Classes in the documentation, links below:

 

Lifecycle Policies

 

Lifecycle policies are a set of rules that define the actions performed by Amazon S3 to a group of objects. These rules are used to move objects into different storage classes and archive or delete them after a certain time. Lifecycle rules are divided into “Transition actions” and “Expiration actions”. Actions are based on the creation date for the current file version. For noncurrent versions of objects, you define the number of days after which action should be performed.

 

Transition actions

 

Transition actions are responsible for moving objects to another storage class based on their age or the date they were created. Example of transition actions:

  1. Move objects from S3 Standard to S3 Standard-IA 30 days after creation
  2. Move objects from S3 Standard to S3 One Zone-IA 60 days after creation
  3. Move objects from S3 Standard-IA to S3 Glacier 90 days after creation
  4. Move objects from S3 Glacier Flexible Retrieval to S3 Glacier Deep Archive 180 days after creation

 

Expiration actions

 

Expiration actions are responsible for permanently deleting objects, thus freeing up storage space and reducing costs. Example of expiration actions:

  1. Delete objects 100 days after creation
  2. Permanently delete previous versions of objects older than 15 days
  3. Delete objects based on a prefix older than 15 days
  4. Delete objects based on a tag older than 15 days

 

Efficient Data Management

 

Aggregation of small files

 

Small files can cause inefficiencies and higher costs in Amazon S3 due to the overhead associated with storing and retrieving each individual object. Aggregating small files into larger objects can significantly reduce these costs and improve performance.

  • Reduced Storage Overhead: Amazon S3 charges based on the number of objects stored and the storage used. By combining small files into larger objects, you reduce the number of objects, thereby lowering overhead costs
  • Lower Request Costs: Each PUT, GET, and LIST request incurs a cost. Aggregating files reduces the number of requests needed to manage and access your data
  • Improved Performance: Fewer objects mean fewer metadata operations, which can lead to faster access times and more efficient data retrieval
  • Potential reduced requests number:
    • To other services: e.g. when you use KMS you will reduce encrypt and decrypt requests
    • Data transfer
    • S3 Replication

 

Data compression

 

Data compression is another effective strategy for reducing storage costs and improving data transfer efficiency. Compressing data before storing it in Amazon S3 reduces the amount of storage space required and can lower data transfer costs. Currently, Amazon S3 does not offer native data compression features, so compression must be handled by the data provider or user before/after files are uploaded to S3.

  • Reduced Storage Costs: Compressed data occupies less space, which directly translates to lower storage costs
  • Faster Data Transfers: Compressed files are smaller, resulting in faster uploads and downloads, and reduced data transfer costs
  • Improved Performance: Smaller file sizes mean quicker retrieval times, enhancing the overall performance of data access

 

Example of data compression:

 

When you work with CSV files, it is beneficial to store them in a compressed .gz format. PySpark can decompress these files when loading data, allowing compressed CSV files to be read directly.

 

df = spark.read.format(„csv”).option(„header”, „true”).load(„datafile.csv.gz”)

 

Algorithm of decreasing S3 costs

 

The diagram below shows how to approach effective cost management on S3 from a data storage perspective.

 

The diagram shows how to approach effective cost management on S3 from a data storage perspective.

Use cases

Introduction

 

The presented cases are among the most common. They show typical storage management problems, possible solutions, and real benefits resulting from properly applied cost management practices.

 

Case #1

 

Brief introduction

 

A company needs to reduce costs, and they look for some solutions. They use many AWS services, and they have conducted an audit. It turned out that S3 is one of the most expensive services. The company asked for a cost-effective S3 storage management policy.

 

As-is state:

  • All S3 objects are stored in S3 standard storage class
  • No lifecycle policies are set
  • Company receives 3k files per month
  • Expected average new file size for the next 12 months: ~1 GB
  • There is one bucket to optimize
    • Raw bucket – where source files are stored
      • Files are accessed for 30 days and then only once or twice a year, if at all
      • Files should be available immediately when needed
      • Data cannot be deleted
      • CSV file format
      • Region: us-east-1
      • 600k objects
      • Total size: 60 TB
      • Storage cost: $1370 per month
  • Files are not divided into many small parts

 

Possibilities to reduce costs

 

In this scenario, the best storage classes to use are S3 Standard and S3 Glacier Instant Retrieval. S3 Standard will handle files accessed for 30 days, and after that time, they should be transferred to S3 Glacier Instant Retrieval. Changing the file format from CSV to Parquet will reduce the total files size by 2 to 10 times. This solution provides high durability, availability, low latency, and is more cost-efficient.

 

To achieve this, there are two possible steps to reduce costs:

  • Create a lifecycle rule to move objects to Glacier Instant Retrieval Storage class 30 days after object creation
  • Change file format to Parquet – prepare job to convert CSV files to Parquet Costs were calculated in June 2024.

 

Current pricing: S3 pricing documentation.

 

Solution Objects number Total size Storage class  Cost per month
As-is 600k 60TB S3 Standard 1 370,00 USD
CSV to Parquet 600k 6TB-30TB S3 Standard 138,00 – 690,00 USD
Glacier Instant Retrieval 600k 57TB + 3TB Glacier Instant Retrieval + S3 Standard 297,00 USD
Parquet + Glacier Instant Retrieval 600k 5,7TB-28,5TB + 3TB Glacier Instant Retrieval + S3 Standard 91,80 – 183,00 USD

 

We can see the difference between different variants in the cost-per-month perspective. The file format transformation significantly reduces cost even when using S3 Standard storage class, as does moving unused objects to Glacier Instant Retrieval. The combination of these two scenarios gives the best result.

 

Let’s compare costs from a 12-month perspective, including transition costs and data delivery.

 

Month Objects number CSV total size Parquet total size As-is solution Parquet + Glacier Instant Retrieval  Transition costs
Lowest Highest
1 600 000 60TB 6TB-30TB 1 370,00 USD  91,80 USD  183,00 USD 5,97 USD
2 603 000 63TB 6,3TB-31,5TB 1 436,00 USD  93,00 USD  189,00 USD 0,03 USD
3 606 000 66TB 6,6TB-33TB 1 502,00 USD  94,20 USD  195,00 USD 0,03 USD
4 609 000 69TB 6,9TB-34,5TB 1 568,00 USD  95,40 USD  201,00 USD 0,03 USD
5 612 000 72TB 7,2TB-36TB 1 634,00 USD  96,60 USD  207,00 USD 0,03 USD
6 615 000 75TB 7,5TB-37,5TB 1 700,00 USD  97,80 USD  213,00 USD 0,03 USD
7 618 000 78TB 7,8TB-39TB 1 766,00 USD  99,00 USD  219,00 USD 0,03 USD
8 621 000 81TB 8,1TB-40,5TB 1 832,00 USD  100,20 USD  225,00 USD 0,03 USD
9 624 000 84TB 8,4TB-42TB 1 898,00 USD  101,40 USD  231,00 USD 0,03 USD
10 627 000 87TB 8,7TB-43,5TB 1 964,00 USD  102,60 USD  237,00 USD 0,03 USD
11 630 000 90TB 9TB-45TB 2 030,00 USD  103,80 USD  243,00 USD 0,03 USD
12 633 000 93TB 9,3TB-46,5TB 2 096,00 USD  105,00 USD  249,00 USD 0,03 USD
SUM 20 796,00 USD 1 180,80 USD 2 592,00 USD 6,30 USD

 

Summary

 

We can see that the predicted costs for the 12 months result in a huge difference. The company can save ~18k/19k USD in 12 months in comparison to the current solution.

 

Case #2

 

Brief introduction

 

A company is looking for some cost improvements of S3 service. They requested an audit.

 

As-is state:

  • All S3 objects are stored in appropriate storage classes
  • Lifecycle policies are set:
    • Transaction actions – move objects to the appropriate storage class
    • Expiration actions – remove objects at the appropriate time
  • The company receives ~5k files per month
    • On average, they receive one file in ~20 parts (250 files)
      • Average file (part) size: ~5 MB
  • The company uses one bucket
    • Parquet file format
    • Region: us-east-1
  • There is no aggregation job – there is a possibility of doing it

 

Possibilities to reduce costs

 

In this scenario the lifecycle policies are correctly used – there is no need for improvement. There is no possibility to change the file format. It is possible to aggregate data by combining many small files into one. This will reduce the number of requests and will be a more cost-effective solution.

 

To achieve this, there are two possible options to reduce costs:

 

  • Create a job to combine files
  • Instruct data providers to send data in one file

 

Creating the job to combine files is not the best solution from a cost-effective perspective, but sometimes, this is the only way. You have to prepare and maintain the new solution. This is reflected in the increase in operational costs, and this solution does not reduce costs as much as the second approach. A better solution is to outsource this directly to data providers, but their cooperation and flexibility are required. The comparison of these two approaches is presented below:

 

Aspect Aggregate files in AWS Instruct Data Providers to send a single file
Implementation Complexity Moderate Low
Cost Medium Low
Maintenance High Low
Operational Overhead High Low
Time to Implement Moderate to High Low to Moderate (could be longer)
Integration Complexity High Low
Cost Predictability Moderate High
Provider Cooperation Not Required Essential

 

Summary

 

Our recommendation is to instruct data providers to send the data in a single file. This solution results in lower costs and effort but cooperation with data providers is crucial.

 

Case #3

 

Brief introduction

 

A company is looking for some cost improvements of S3 service. They requested an audit.

 

As-is state:

  • All S3 objects are stored in appropriate storage classes
  • Life cycle rules are set up properly
  • The company uses one bucket
    • Region: us-east-1
    • Parquet file format
    • ~28 million objects
    • Total size: ~3 TB
    • Storage cost: ~ $80 per month
    • API requests number: ~1 000 000 000 000
      • Cost: ~ $600 per month
    • Requests number to KMS: ~230 000 000
      • Cost: ~ $700 per month
    • Requests number to GuardDuty: ~1 000 000 000 000
      • Cost: ~ $500 per month
    • The company has too many small files: files ~100 KB per object

 

Possibilities to reduce costs

 

Implementing Delta Lake can help reduce the costs associated with excessive API requests and inefficient storage usage caused by too many small files. Delta Lake uses compaction techniques that combine smaller files into larger ones. To realize these savings, you need to integrate Delta Lake into your existing environment.

 

Cost Category Before Delta Lake After Delta Lake Savings
S3 Storage 80,00 USD ~ 80,00 USD 0,00 USD
S3 API Requests 600,00 USD 240,00 USD 360,00 USD
KMS Requests 700,00 USD 280,00 USD 420,00 USD
GuardDuty Requests 500,00 USD 200,00 USD 300,00 USD
Total Monthly Cost 1 880,00 USD 800,00 USD 1 080,00 USD

 

It is worth mentioning that the processing itself is also 5 times faster than before (previously it took around 12 hours).

 

Summary

 

Implementing the Delta Lake solution has resulted in savings of $1 000 per month.

Summary

Cost-effective S3 management can be easily implemented through the implementation of a few simple rules or the use of S3 Intelligent-Tiering or very complex ones containing dedicated rules relating to specific locations. Proper S3 management is very important for companies dealing with Big Data as the size of the data reaches unimaginable proportions. Failure to manage S3 or doing it in the wrong way can have a serious impact on budgets.

 

Each case is different, even if very similar. The key factors are constraints, client and environment requirements, and finally cooperation during the analysis. Openness and cooperation are key to a reliable analysis, which has a huge impact on further actions.