Amazon Web Services (AWS) is the biggest cloud platform in the world, with over 200 features. In this article, we break down 10 AWS services that support at least some SQL syntax, talk about their use cases, and give examples of how to write queries.
Service | Description | SQL Support | Use Case |
RDS | Postgres, MySQL, etc. | Full | Small-medium web apps |
Aurora | Serverless databases | Full | Serverless apps |
Redshift | Data warehouse | Full | OLAP, Petabytes of data, analytics |
DynamoDB | NoSQL database | Some - PartiSQL | Ecommerce, building fast |
Keyspaces | Managed Cassandra (key value) | Some - CQL | Messaging |
Neptune | Graph database | Some - openCypher | Social networks |
Timestream | Time series database | Partial | IOT, Logging |
Quantum Ledger | Cryptographically verified transactions | Some - PartiSQL | Finance |
Athena | Ad-hoc queries on S3 | Some - CTAS | Historical data |
Babelfish | MSFT SQL Server on Aurora | Full | .NET |
The table above shows how SQL support varies between the services. A graph database cannot be queried in the same way as a classic relational database, and various subsets of SQL, like PartiQL, have emerged to fit these models. In fact, even within standard SQL, there are many SQL dialects for different companies like Oracle and Microsoft.
AWS database services
From relational, to graph, to ledger, to time series, the services below cover almost every conceivable hosted database use case. Let's examine what each is for and how to write a SQL query against them!
RDS
Amazon RDS is one of the most basic AWS database services, used mainly for offloading your database management operations to a platform. Therefore, it is used for small or medium enterprises where the data volume is limited, and the functionalities required for company operations are not too complex.
Amazon RDS supports database engines such as MySQL, MariaDB, PostgreSQL, Oracle, and Microsoft SQL Server. It comes with workflows to secure your RDS instance using SSH and offers a straightforward cloud console for connecting.
Benefits
- Amazon RDS is the most inexpensive service, thanks to its ease of usage and lack of complexity.
- It is highly scalable and allows you to scale up to 32 vCPUs and 244 Gb of RAM.
- This service is also easy to use and pretty fast.
Example SQL
RDS is just some nice management options around the core databases so it supports the exact SQL that each database supports. Here would be a subquery that could be run against Postgres, MySQL, or Oracle.
SELECT
*
FROM
users
WHERE
id IN (
SELECT
user_id
FROM
transactions
WHERE
amount > 1000
);
AWS Aurora
Amazon Aurora is an elevated version of Amazon RDS. Large enterprises use this since their data volume and complexity of operations are much higher. It doesn’t support all the same database engines as Amazon RDS, and instead only supports MySQL and PostgreSQL. Aurora scales up and down as the load on your database increases and decreases. Newer providers like PlanetScale also offer this capability with additional schema migration features and lower costs.
Amazon Aurora, like RDS, can perform replication. It actually offers about 15 different types of replications, and one replication can be done within milliseconds. On the other hand, RDS can perform only five types of replications, taking more time.
Some of the use cases that can depict the strength of Amazon Aurora are enterprise applications, SaaS applications, and web/mobile gaming.
Benefits:
- Auto-scaling allows scaling operations to be carried out automatically without affecting the DB performance. It allows up to 128 TB per database instance.
- Aurora backup operations are automatic, continuous, incremental, and have about 99.99999999% durability.
- Aurora can detect database failure and recover in less than a minute. Furthermore, in case of a permanent failure, it can automatically move to a replica without data loss.
Redshift
Like Amazon Aurora, Amazon Redshift is used by large enterprises. However, Redshift is more complex, can handle more data, and is referred to as a data warehouse. This is because Redshift is built for OLAP (Online Analytical Processing).
Furthermore, Redshift can scale up to Petabytes of data and supports up to 60 user-defined databases per cluster. On the other hand, Aurora can only scale till Terrabytes and support up to 40. Besides this, the security and the maintenance of both the database services are pretty much the same.
A few use cases of Amazon Redshift are creating machine models for forecasting operations, optimizing your company’s business intelligence, and increasing developer productivity.
Benefits
- Redshift possesses the highest scaling capabilities amongst the three options we've examined.
- Its performance is much faster and more durable.
- Amazon Redshift can also handle a more significant amount of data and analyze it within a shorter period.
Example SQL
Redshift supports some SQL functions and queries which would generally only be necessary with large data warehouse applications. For example, PERCENTILE_CONT
computes a linear interpolation to return a percentile.
SELECT
TOP 10 salesid,
sum(pricepaid),
percentile_cont(0.6) WITHIN GROUP (
ORDER BY
salesid
),
median (salesid)
FROM
sales
GROUP BY
salesid,
pricepaid;
DynamoDB
DynamoDB is Amazon's answer to MongoDB, a NoSQL database that works on JSON documents. These databases rely heavily on nested data and do not enforce any strict schema, unless the developer turns that option on. That means that DynamoDB is great for high-volume sites like a CMS or mobile apps with a lot of traffic. For example, both Major League Baseball and Duolingo make use of DynamoDB.
Example SQL
Because DynamoDB is not relational and does not enforce ACID by default, it must use a modified version of standard SQL. Amazon has developed a query language called PartiQL which uses many SQL concepts but is built for highly nested data. The query below takes advantage of the key-value underpinnings of DynamoDB in a relatively SQL standard way.
UPDATE
Music
SET
AwardsWon = 1
SET
AwardDetail = { 'Grammys': [ 2020, 2018 ] }
WHERE
Artist = 'Acme Band'
AND SongTitle = 'PartiQL Rocks'
Keyspaces
Cassandra is a famous distributed key-value database. iMessage, while it uses SQLite on the client side, sends all its messages through a massive Cassandra cluster at Apple. Other developers building distributed services with high fault tolerance needs may also look to Cassandra, albeit without wanting the headache of managing a large cluster. This is where AWS Keyspaces comes in by proving managed Cassandra.
Example SQL
The Cassandra project has developed its own SQL subset called CQL which uses the concepts of Keyspaces, Tables, Partitions, Rows, and Columns to query data. The example below is a CQL instruction to use Cassandra's famous fault tolerance to restore a keyspace at a point in time.
RESTORE TABLE mykeyspace.mytable_restored
FROM TABLE mykeyspace.my_table
WITH restore_timestamp = '2020-06-30T04:05:00+0000'
AND custom_properties = {
'point_in_time_recovery': { 'status' :'disabled' },
'capacity_mode': {
'throughput_mode': 'PROVISIONED',
'read_capacity_units': 10,
'write_capacity_units': 20
}
}
AND TAGS = {
'key1' :'val1',
'key2' :'val2'
};
AWS Neptune
The most well-known graph database is probably Neo4j, but for developers in the AWS ecosystem, Neptune can fill that role. Graph databases are very helpful for modeling relationships between N items in a network. Social networks are an obvious application, but fraud detection and even infrastructure management are good use cases for graph DBs too. Neo4j developed an open-source graph query language called openCypher, which AWS Neptune supports.
Example SQL
As you can see, it's a little charitable to call the below query a SQL query. In fact, the openCypher language is probably the most exotic SQL language here. In this query, we define a route between two airports. This is basically an INSERT
. Later on, we could query for the best route options between airports in a ticketing system.
MATCH (a:airport {code:'SEA'}), (b:airport {code:'ANC'}),
CREATE (a)-[r:route]->(b)
RETURN r
Timestream
Time-series databases are increasingly popular thanks to the emergence of unlimited logging, IoT, blockchain, and fintech. A time series database attempts to hide the pain of querying across date ranges, storing unlimited historical data, and keeping current data up to date. Compression is often a key metric for the use of a time-series database.
AWS Timestream specifically helps with managing the lifecycle of this data and also integrates with other AWS services you might expect, including AWS IoT Core!
Example SQL
When making queries over time-series data, we make heavy use of common table expressions (CTEs) and aggregates. AWS Timestream supports a number of useful functions which make these queries easier. Because do you want to implement linear interpolation in SQL? In the query below we find the average CPU utilization binned at 30-second intervals for a specific EC2 host over the past 2 hours, filling in the missing values using linear interpolation. This data could then be displayed on a handy line chart or otherwise used for service monitoring.
WITH binned_timeseries AS (
SELECT
hostname,
BIN(time, 30s) AS binned_timestamp,
ROUND(AVG(measure_value: :double), 2) AS avg_cpu_utilization
FROM
"sampleDB" .DevOps
WHERE
measure_name = 'cpu_utilization'
AND hostname = 'host-Hovjv'
AND time > ago(2h)
GROUP BY
hostname,
BIN(time, 30s)
),
interpolated_timeseries AS (
SELECT
hostname,
INTERPOLATE_LINEAR(
CREATE_TIME_SERIES(binned_timestamp, avg_cpu_utilization),
SEQUENCE(min(binned_timestamp), max(binned_timestamp), 15s)
) AS interpolated_avg_cpu_utilization
FROM
binned_timeseries
GROUP BY
hostname
)
SELECT
time,
ROUND(value, 2) AS interpolated_cpu
FROM
interpolated_timeseries
CROSS JOIN UNNEST(interpolated_avg_cpu_utilization)
Quantum Ledger (QLDB)
QLDB is a massive ledger, or a history of financial events. Organizations like banks or the Census Bureau use ledgers to ensure that no matter what happens with their data, they can be absolutely sure of the record. We will note that the AWS docs work hard not to use the term blockchain. Indeed, if QLDB was distributed around the world and open access, it would be very close to Bitcoin.
The key properties of QLDB are that it is immutable. No way, no how can the commit log for the database be altered. And if it is the ledger can report this or refuse to process additional transactions. The underpinnings of QLDB are a JSON-like data structure called AWS Ion. We saw earlier how PartiQL provides a SQL-esque query language for highly nested data. The example below inserts a record into QLDB to track vehicle registrations.
Example SQL
Once this record is entered into the ledger it becomes immutable and cannot be altered without inserting a new record. In other words, there is really no concept of UPDATE
for QLDB.
INSERT INTO VehicleRegistration VALUE
{
'VIN' : 'KM8SRDHF6EU074761', --string
'RegNum' : 1722, --integer
'State' : 'WA',
'City' : 'Kent',
'PendingPenaltyTicketAmount' : 130.75, --decimal
'Owners' : { --nested struct
'PrimaryOwner' : { 'PersonId': '294jJ3YUoH1IEEm8GSabOs' },
'SecondaryOwners' : [ --list of structs
{ 'PersonId' : '1nmeDdLo3AhGswBtyM1eYh' },
{ 'PersonId': 'IN7MvYtUjkp1GMZu0F6CG9' }
]
},
'ValidFromDate' : `2017-09-14T`, --Ion timestamp literal with day precision
'ValidToDate' : `2020-06-25T`
}
AWS Athena
Do you ever just shove data in S3 and forget about it? S3 often stores terabytes of JSON, CSV, text, or other unstructured data. AWS Athena is a SQL service to pull that data out of S3 and push it into a relational structure. It's great for inspecting buckets, transforming data through ETL processes, or cleaning data to send to services like Hadoop. It's also compatible with DataStation.
There is also another way to query S3 with a service called S3 Select. More information is coming to the Arctype blog on this topic soon! Athena is more flexible than S3 Select because it can touch multiple objects and buckets in the same query. This is also known as making "ad-hoc" queries.
Example SQL
The key concept of AWS Athena is called Creating a Table from Query Results (CTAS). Obviously, there are no tables that exist before S3 is queried, so we need to make use of INSERT INTO
when we pull data out to create the table. In the query below, we graph dates from before 2015 from CSVs in our S3 bucket.
INSERT INTO new_parquet
SELECT id,
date,
element,
datavalue,
mflag,
qflag,
sflag,
obstime,
substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) < 2015
Babelfish for Aurora PostgreSQL
We would be remiss not to mention Babelfish as the last SQL service on this list. Technically Babelfish is more of a technology than a fully-fledged database but it accepts SQL queries so here we go!
With Babelfish, you can connect applications that you have written within the Microsoft ecosystem, for example using .NET and TSQL, to AWS Aurora which is backed by Postgres. Although we will note that the list of limitations and gotchas seems fairly long, for organizations with massive codebases making use of TSQL, this will in fact be an easier migration path than rewriting every query to support Postgres.
Example query
Just a simple one here. This query would not run inside Postgres because it uses Microsoft-specific syntax. Yet with Babelfish backed by Aurora, it will return the correct result.
SELECT @@VERSION AS version;
Fully managed databases vs. managing servers
All of the options above fit into a category where AWS software is doing the heavy lifting of provisioning, backups, scale-out, upgrades, and other tasks associated with operating a database. You can, of course, host all of the software above on your own servers and do this yourself.
The Elastic Compute Cloud is almost the opposite of AWS Managed Services (AMS). It was created as a virtual machine to create multiple instances simultaneously. Each instance can run operating systems or software packages based on your requirements and needs. While features such as Auto Scaling are involved, users are left on their own to carry out significant functions for security, resource utilization, maintenance, and more. This leads to another set of pros and cons:
Pros:
- It is highly scalable
- It is integrated with most AWS services.
- The presence of the API makes your operations very easy
- You have complete control of the instances.
Cons:
- You’ll have to look at the security of the public instances.
- Managing resource utilization, maintenance, and management of the lifecycle can become pretty tedious.
- In specific scenarios, EC2 is not cost-efficient for companies.
Conclusion
This is a (fairly) complete overview of the AWS database services for databases and for writing SQL. All of these allow you to manipulate and query data via SQL, which makes them easy to use while also being highly scalable. One thing that stands out is the sheer variety of architectures, syntax, and targeted use cases. This speaks to both the depth of Amazon's offerings and the size of the cloud computing market. Take heart, you're working in an industry that's growing 40% year over year!