Athena vs redshift spectrum

12/6/2023

To further understand how Redshift takes advantage of its own disks, it is important to know how it operates, its query language, and what structures might be used to prepare data for upcoming queries.

To perform all these tasks, it stores metadata related to the actual data that is being held in the cluster. Work distribution is also possible because a leader node can have multiple compute nodes attached, fully scalable and divided by slices, which are sectors of a computing node with their own disk space and memory assigned. Let’s dive deeper into Redshift to understand how the best performance might be achieved, starting with its architecture.Įach cluster consists of a leader node and one or more compute nodes. The leader node represents the SQL endpoint and is responsible for job delegation and parallelization. The infrastructure of Redshift is cluster-based, which means that there is an infrastructure (compute and storage) that you pay for (if you pause a Redshift cluster you won’t get compute charges, but you still get storage charges). For optimal performance, your data needs to be loaded into a Redshift cluster and sorted/distributed in a certain way to lower the execution time of your queries. After the hands-on example, we will discuss how we can connect Tableau with Redshift, and finally, we will conclude with an overview of everything Redshift has to offer.Īmazon Redshift is a fully-managed data warehouse service in the AWS cloud which scales to petabytes of data. It is designed for On-Line Analytical Processing (OLAP) and BI performance drops when used for Transaction Processing (OLTP). from TPC-H, and we will see how Spectrum performs, compared to using data stored in Redshift. In this example we will use the same dataset and queries used in our previous blogs, i.e.

We will also examine a feature in Redshift called Spectrum, which allows querying data in S3 then we will walk through a hands-on example to see how Redshift is used. In this article, we will go through the basic concepts of Redshift and also discuss some technical aspects thanks to which the data stored in Redshift can be optimised for querying. Actually, the combination of S3, Athena and Redshift is what AWS proposes as a data lakehouse. We are not going to make a thorough comparison between Athena and Redshift, but if you are interested in the comparison of these two technologies and what situations are more suited to one or the other, you can find interesting articles online such as this one. In this article, we are going to learn about Amazon Redshift, an AWS data warehouse that, in some situations, might be better suited to your analytical workloads than Athena. As we commented, Athena is great for relatively simple ad hoc queries in S3 data lakes even when data is large, but there are situations (complex queries, heavy usage of reporting tools, concurrency) in which it is important to consider alternative approaches, such as data warehousing technologies. In our second article, we introduced Athena and its serverless querying capabilities. The rest of tables are left unpartitioned.

Partitioned Parquets: 32.5 GB – the largest tables, which are partitioned, are lineitem with 21.5GB and orders with 5GB, with one partition per day each partition has one file and there around 2,000 partitions per table.
Parquets without partitions: 31.5 GB – the largest tables are lineitem with 21GB and orders with 4.5GB, also split into 80 files.
Raw (CSV): 100 GB – the largest tables are lineitem with 76GB and orders with 16GB, split into 80 files.
In that example, we used a dataset from the popular TPC-H benchmark, and generated three versions of the TPC-H dataset: We also introduced the concept of the data lakehouse, as well as giving an example of how to convert raw data (most data landing in data lakes is in a raw format such as CSV) into partitioned Parquet files with Athena and Glue in AWS. In the first article of the series, we discussed how to optimise data lakes by using proper file formats ( Apache Parquet) and other optimisation mechanisms (partitioning). This is the third article in the ‘Data Lake Querying in AWS’ blog series, in which we introduce different technologies to query data lakes in AWS, i.e.

0 Comments

Athena vs redshift spectrum

Leave a Reply.

Author

Archives

Categories