Query content of S3 objects with SQL using Laravel
Photo by Pero Kalimero on Unsplash
Recently, I was working on a Laravel application that deals with a lot of CSV data. Different business partners were uploading a new version of the CSV files to S3 every hour.
My application was processing these files to build various charts for reporting. It also provides an interface for authorized users to view the content of uploaded CSV files. Users can also filter and search different records from the interface. Most of the time application only needs to access the most recent rows.
My initial approach was to download the entire CSV file from s3 into memory on every request. Then loop through each row to calculate the required data for charts. I was caching the calculated results for half an hour to improve the performance. This approach had a few drawbacks:
- When the cache expires, I had to download the entire file from the s3. Some files were over 5MB in size. Downloading such large files decreases the performance significantly.
- In S3, You have to pay when you retrieve data. Downloading large files frequently can increase the cost.
- Users of my application wanted to see recent records. Most of the time they were viewing the old cached data.
- Writing a code that filters the CSV content was not as pleasant as writing a query in SQL.
Another approach was to get the content of CSV every 15 minutes. Then insert these records into the SQL database. I did not like this approach because now the data exists in two different places. This approach was not scalable either. As the number of records in the CSV file increase, I have to update or insert more records into the database. On the bright side, I can use Eloquent or Database query builder to access the data.
AWS S3 Select
While Amazon S3 is awesome for storage, It also has a feature called S3 Select. With S3 Select, You can use a simple SQL query to filter the content of the stored objects. and retrieve only a subset of data that you need.
S3 Select currently works on objects that store data in CSV, JSON, and Apache Parquet formats. It also works with GZIP or BZIP2 compressed CSV and JSON objects.
Example of S3 Select Query with Laravel
For example purposes, I am going to use the following sample data. You can download the complete dataset here
Laravel filesystem does not support QueryObjectContent
operation out of the box. I created the following macro so I can reuse the same logic in many places.
<?php
namespace App\Providers;
use Illuminate\Filesystem\AwsS3V3Adapter;
use Illuminate\Support\ServiceProvider;
class MacroServiceProvider extends ServiceProvider
{
public function register(): void
{
AwsS3V3Adapter::macro('query', function (string $key, string $expression, array $overrides = []) {
/** @var AwsS3V3Adapter $this */
$default = [
'Bucket' => $this->config['bucket'],
'Expression' => $expression,
'ExpressionType' => 'SQL',
'InputSerialization' => [
'CSV' => [
'FileHeaderInfo' => 'USE',
'FieldDelimiter' => ','
],
],
'Key' => $key,
'OutputSerialization' => [
'CSV' => [
'RecordDelimiter' => ','
],
],
'RequestProgress' => [
'Enabled' => false,
],
];
return $this->getClient()->selectObjectContent(array_merge($default, $overrides));
});
}
}
This maco uses the underlying S3Client to execute the QueryObjectContent
command. It accepts the name of the object, a query, and an optional parameter for the QueryObjectContent command. and It returns Aws\Result
back.
Select & Limit Clause
S3 Select uses simple SQL statements. The following code snippet retrieves the first five records from our example dataset.
$records = collect();
// S3 Query to select first 5 records
$query = 'SELECT "date", "canadian_dollar_to_usd" FROM s3object LIMIT 5';
// Execute the query and fetch the results
$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);
// Loop through payload events to collect result
foreach($result['Payload'] as $event) {
if (isset($event['Records'])) {
$records = $records->merge(
Str::of($event['Records']['Payload'])
->explode(PHP_EOL)
->filter()
->map(fn(string $item) => Str::of($item)->explode(','))
);
}
}
dd($records);
If your CSV file does not have headers, You can also specify index numbers in your query. You may have to update the FileHeaderInfo parameter value. The possible values are USE
, IGNORE
, or NONE
. If your CSV does not include headers use NONE
otherwise use IGNORE
.
$query = 'SELECT _1, _2 FROM s3object LIMIT 5';
Like SQL statement, You can also use *
to retrieve all the columns.
$query = "SELECT * FROM s3Object LIMIT 5"
Where Clause
To explore where clause, Let's get the list of the CAD to the USD conversion rate since March 1, 2022. Notice that the date is a reserved keyword in S3 select. To use these reserved keywords, you have to wrap them in double quotes. You can find the list of reserved keywords here.
// ...
$query = 'SELECT "date", canadian_dollar_to_usd FROM s3object WHERE CAST("date" AS TIMESTAMP) >= CAST(\'2022-03-01\' AS TIMESTAMP)';
$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);
// ...
Using Functions
Like SQL, S3 select also allows us to use functions. You can find a complete list of functions in AWS documentation. In the following code, we will get the total number of records in a file using the COUNT aggregate function.
$count = null;
$query = 'SELECT COUNT(*) FROM s3object';
$result = Storage::disk('s3')->query('exchange_rate_to_usd.csv', $query);
// Loop through payload events to get records
foreach($result['Payload'] as $event) {
if (isset($event['Records'])) {
$count = Str::of($event['Records']['Payload']);
}
}
dd($count);
Cost of S3 Select
Amazon S3 Select is very cheap. The cost of S3 Select depends on the number of select requests, data scanned and amount of data returned. Let's say I have 50GB of data stored in S3. I make 100,000 Select requests per month and return 20GB of data. It will cost me less than 2 US dollars. Following is the pricing estimate from the AWS pricing calculator as of March 2022.
Tiered price for: 50 GB
50 GB x 0.0230000000 USD = 1.15 USD
Total tier cost = 1.1500 USD (S3 Standard storage cost)
100,000 GET requests in a month x 0.0000004 USD per request = 0.04 USD (S3 Standard GET requests cost)
20 GB x 0.0007 USD = 0.014 USD (S3 select returned cost)
30 GB x 0.002 USD = 0.06 USD (S3 select scanned cost)
1.15 USD + 0.04 USD + 0.014 USD + 0.06 USD = 1.26 USD (Total S3 Standard Storage, data requests, S3 select cost)
S3 Standard cost (monthly): 1.26 USD
Conclusion
In conclusion, You can reduce data transfer time and cost by using s3 select. Although S3 only supports simple queries, It can be very quiet useful in some scenarios.
driesvints, justsanjit, nadu, ngoquocdat, phcostabh liked this article
Other articles you might like
Laravel Custom Query Builders Over Scopes
Hello 👋 Alright, let's talk about Query Scopes. They're awesome, they make queries much easier to r...
Access Laravel before and after running Pest tests
How to access the Laravel ecosystem by simulating the beforeAll and afterAll methods in a Pest test....
🍣 Sushi — Your Eloquent model driver for other data sources
In Laravel projects, we usually store data in databases, create tables, and run migrations. But not...
The Laravel portal for problem solving, knowledge sharing and community building.
The community